[PATCH v3 0/5] Support the FEAT_HDBSS introduced in Armv9.5

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/5] Support the FEAT_HDBSS introduced in Armv9.5
@ 2026-02-25  4:04 Tian Zheng
  2026-02-25  4:04 ` [PATCH v3 1/5] arm64/sysreg: Add HDBSS related register information Tian Zheng
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Tian Zheng @ 2026-02-25  4:04 UTC (permalink / raw)
  To: maz, oupton, catalin.marinas, corbet, pbonzini, will, zhengtian10
  Cc: yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2,
	linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose, leo.bras

This series of patches add support to the Hardware Dirty state tracking
Structure(HDBSS) feature, which is introduced by the ARM architecture
in the DDI0601(ID121123) version.

The HDBSS feature is an extension to the architecture that enhances
tracking translation table descriptors' dirty state, identified as
FEAT_HDBSS. This feature utilizes hardware assistance to achieve dirty
page tracking, aiming to significantly reduce the overhead of scanning
for dirty pages.

The purpose of this feature is to make the execution overhead of live
migration lower to both the guest and the host, compared to existing
approaches (write-protect or search stage 2 tables).

After these patches, users(such as qemu) can use the
KVM_CAP_ARM_HW_DIRTY_STATE_TRACK ioctl to enable or disable the HDBSS
feature before and after the live migration.

v2:
https://lore.kernel.org/linux-arm-kernel/20251121092342.3393318-1-zhengtian10@huawei.com/

v2->v3 changes:
- Remove the ARM64_HDBSS configuration option and ensure this feature
is only enabled in VHE mode.
- Move HDBSS-related variables to the arch-independent portion of the
kvm structure.
- Remove error messages during HDBSS enable/disable operations
- Change HDBSS buffer flushing from handle_exit to vcpu_put,
check_vcpu_requests, and kvm_handle_guest_abort.
- Add fault handling for HDBSS including buffer full, external abort,
and general protection fault (GPF).
- Add support for a 4KB HDBSS buffer size, mapped to the value 0b0000.
- Add a second argument to the ioctl to turn HDBSS on or off.

Tian Zheng (1):
  KVM: arm64: Document HDBSS ioctl

eillon (4):
  arm64/sysreg: Add HDBSS related register information
  KVM: arm64: Add support to set the DBM attr during memory abort
  KVM: arm64: Add support for FEAT_HDBSS
  KVM: arm64: Enable HDBSS support and handle HDBSSF events

 Documentation/virt/kvm/api.rst       |  16 +++++
 arch/arm64/include/asm/cpufeature.h  |   5 ++
 arch/arm64/include/asm/esr.h         |   7 ++
 arch/arm64/include/asm/kvm_host.h    |  17 +++++
 arch/arm64/include/asm/kvm_mmu.h     |   1 +
 arch/arm64/include/asm/kvm_pgtable.h |   4 ++
 arch/arm64/include/asm/sysreg.h      |  11 +++
 arch/arm64/kernel/cpufeature.c       |  12 ++++
 arch/arm64/kvm/arm.c                 | 102 +++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c         |   6 ++
 arch/arm64/kvm/hyp/vhe/switch.c      |  19 +++++
 arch/arm64/kvm/mmu.c                 |  70 ++++++++++++++++++
 arch/arm64/kvm/reset.c               |   3 +
 arch/arm64/tools/cpucaps             |   1 +
 arch/arm64/tools/sysreg              |  29 ++++++++
 include/uapi/linux/kvm.h             |   1 +
 tools/include/uapi/linux/kvm.h       |   1 +
 17 files changed, 305 insertions(+)

--
2.33.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/5] arm64/sysreg: Add HDBSS related register information
  2026-02-25  4:04 [PATCH v3 0/5] Support the FEAT_HDBSS introduced in Armv9.5 Tian Zheng
@ 2026-02-25  4:04 ` Tian Zheng
  2026-02-25  4:04 ` [PATCH v3 2/5] KVM: arm64: Add support to set the DBM attr during memory abort Tian Zheng
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Tian Zheng @ 2026-02-25  4:04 UTC (permalink / raw)
  To: maz, oupton, catalin.marinas, corbet, pbonzini, will, zhengtian10
  Cc: yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2,
	linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose, leo.bras

From: eillon <yezhenyu2@huawei.com>

The ARM architecture added the HDBSS feature and descriptions of
related registers (HDBSSBR/HDBSSPROD) in the DDI0601(ID121123) version,
add them to Linux.

Signed-off-by: eillon <yezhenyu2@huawei.com>
Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
---
 arch/arm64/include/asm/esr.h |  2 ++
 arch/arm64/tools/sysreg      | 29 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index 7e86d400864e..81c17320a588 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -160,6 +160,8 @@
 #define ESR_ELx_CM 		(UL(1) << ESR_ELx_CM_SHIFT)

 /* ISS2 field definitions for Data Aborts */
+#define ESR_ELx_HDBSSF_SHIFT	(11)
+#define ESR_ELx_HDBSSF		(UL(1) << ESR_ELx_HDBSSF_SHIFT)
 #define ESR_ELx_TnD_SHIFT	(10)
 #define ESR_ELx_TnD 		(UL(1) << ESR_ELx_TnD_SHIFT)
 #define ESR_ELx_TagAccess_SHIFT	(9)
diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 9d1c21108057..e166ab322de2 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -4528,6 +4528,35 @@ Sysreg	GCSPR_EL2	3	4	2	5	1
 Fields	GCSPR_ELx
 EndSysreg

+Sysreg	HDBSSBR_EL2	3	4	2	3	2
+Res0	63:56
+Field	55:12	BADDR
+Res0	11:4
+Enum	3:0	SZ
+	0b0000	4KB
+	0b0001	8KB
+	0b0010	16KB
+	0b0011	32KB
+	0b0100	64KB
+	0b0101	128KB
+	0b0110	256KB
+	0b0111	512KB
+	0b1000	1MB
+	0b1001	2MB
+EndEnum
+EndSysreg
+
+Sysreg	HDBSSPROD_EL2	3	4	2	3	3
+Res0	63:32
+Enum	31:26	FSC
+	0b000000	OK
+	0b010000	ExternalAbort
+	0b101000	GPF
+EndEnum
+Res0	25:19
+Field	18:0	INDEX
+EndSysreg
+
 Sysreg	DACR32_EL2	3	4	3	0	0
 Res0	63:32
 Field	31:30	D15
--
2.33.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 2/5] KVM: arm64: Add support to set the DBM attr during memory abort
  2026-02-25  4:04 [PATCH v3 0/5] Support the FEAT_HDBSS introduced in Armv9.5 Tian Zheng
  2026-02-25  4:04 ` [PATCH v3 1/5] arm64/sysreg: Add HDBSS related register information Tian Zheng
@ 2026-02-25  4:04 ` Tian Zheng
  2026-02-25  4:04 ` [PATCH v3 3/5] KVM: arm64: Add support for FEAT_HDBSS Tian Zheng
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Tian Zheng @ 2026-02-25  4:04 UTC (permalink / raw)
  To: maz, oupton, catalin.marinas, corbet, pbonzini, will, zhengtian10
  Cc: yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2,
	linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose, leo.bras

From: eillon <yezhenyu2@huawei.com>

This patch adds support to set the DBM (Dirty Bit Modifier) attribute
in S2 PTE during user_mem_abort(). This bit, introduced in ARMv8.1,
enables hardware to automatically promote write-clean pages to write-dirty.
This prevents the guest from being trapped in EL2 due to missing write
permissions.

Signed-off-by: eillon <yezhenyu2@huawei.com>
Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
---
 arch/arm64/include/asm/kvm_pgtable.h | 4 ++++
 arch/arm64/kvm/hyp/pgtable.c         | 6 ++++++
 2 files changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index c201168f2857..d0f280972a7a 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -93,6 +93,8 @@ typedef u64 kvm_pte_t;

 #define KVM_PTE_LEAF_ATTR_HI_S2_XN	GENMASK(54, 53)

+#define KVM_PTE_LEAF_ATTR_HI_S2_DBM	BIT(51)
+
 #define KVM_PTE_LEAF_ATTR_HI_S1_GP	BIT(50)

 #define KVM_PTE_LEAF_ATTR_S2_PERMS	(KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
@@ -248,6 +250,7 @@ enum kvm_pgtable_stage2_flags {
  * @KVM_PGTABLE_PROT_R:		Read permission.
  * @KVM_PGTABLE_PROT_DEVICE:	Device attributes.
  * @KVM_PGTABLE_PROT_NORMAL_NC:	Normal noncacheable attributes.
+ * @KVM_PGTABLE_PROT_DBM:	Dirty bit management attribute.
  * @KVM_PGTABLE_PROT_SW0:	Software bit 0.
  * @KVM_PGTABLE_PROT_SW1:	Software bit 1.
  * @KVM_PGTABLE_PROT_SW2:	Software bit 2.
@@ -263,6 +266,7 @@ enum kvm_pgtable_prot {

 	KVM_PGTABLE_PROT_DEVICE			= BIT(4),
 	KVM_PGTABLE_PROT_NORMAL_NC		= BIT(5),
+	KVM_PGTABLE_PROT_DBM			= BIT(6),

 	KVM_PGTABLE_PROT_SW0			= BIT(55),
 	KVM_PGTABLE_PROT_SW1			= BIT(56),
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0e4ddd28ef5d..5b4c46d8dc74 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -739,6 +739,9 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
 	if (prot & KVM_PGTABLE_PROT_W)
 		attr |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;

+	if (prot & KVM_PGTABLE_PROT_DBM)
+		attr |= KVM_PTE_LEAF_ATTR_HI_S2_DBM;
+
 	if (!kvm_lpa2_is_enabled())
 		attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);

@@ -1361,6 +1364,9 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 	if (prot & KVM_PGTABLE_PROT_W)
 		set |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;

+	if (prot & KVM_PGTABLE_PROT_DBM)
+		set |= KVM_PTE_LEAF_ATTR_HI_S2_DBM;
+
 	ret = stage2_set_xn_attr(prot, &xn);
 	if (ret)
 		return ret;
--
2.33.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 3/5] KVM: arm64: Add support for FEAT_HDBSS
  2026-02-25  4:04 [PATCH v3 0/5] Support the FEAT_HDBSS introduced in Armv9.5 Tian Zheng
  2026-02-25  4:04 ` [PATCH v3 1/5] arm64/sysreg: Add HDBSS related register information Tian Zheng
  2026-02-25  4:04 ` [PATCH v3 2/5] KVM: arm64: Add support to set the DBM attr during memory abort Tian Zheng
@ 2026-02-25  4:04 ` Tian Zheng
  2026-02-25  4:04 ` [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events Tian Zheng
  2026-02-25  4:04 ` [PATCH v3 5/5] KVM: arm64: Document HDBSS ioctl Tian Zheng
  4 siblings, 0 replies; 24+ messages in thread
From: Tian Zheng @ 2026-02-25  4:04 UTC (permalink / raw)
  To: maz, oupton, catalin.marinas, corbet, pbonzini, will, zhengtian10
  Cc: yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2,
	linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose, leo.bras

From: eillon <yezhenyu2@huawei.com>

Armv9.5 introduces the Hardware Dirty Bit State Structure (HDBSS) feature,
indicated by ID_AA64MMFR1_EL1.HAFDBS == 0b0100. A CPU capability is added
to notify the user of the feature.

Add KVM_CAP_ARM_HW_DIRTY_STATE_TRACK ioctl and basic framework for
ARM64 HDBSS support. Since the HDBSS buffer size is configurable and
cannot be determined at KVM initialization, an IOCTL interface is
required.

Actually exposing the new capability to user space happens in a later
patch.

Signed-off-by: eillon <yezhenyu2@huawei.com>
Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
---
 arch/arm64/include/asm/cpufeature.h |  5 +++++
 arch/arm64/kernel/cpufeature.c      | 12 ++++++++++++
 arch/arm64/tools/cpucaps            |  1 +
 include/uapi/linux/kvm.h            |  1 +
 tools/include/uapi/linux/kvm.h      |  1 +
 5 files changed, 20 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 4de51f8d92cb..dcc2e2cad5ad 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -856,6 +856,11 @@ static inline bool system_supports_haft(void)
 	return cpus_have_final_cap(ARM64_HAFT);
 }

+static inline bool system_supports_hdbss(void)
+{
+	return cpus_have_final_cap(ARM64_HAS_HDBSS);
+}
+
 static __always_inline bool system_supports_mpam(void)
 {
 	return alternative_has_cap_unlikely(ARM64_MPAM);
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index c31f8e17732a..348b0afffc3e 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2124,6 +2124,11 @@ static bool hvhe_possible(const struct arm64_cpu_capabilities *entry,
 	return arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_HVHE);
 }

+static bool has_vhe_hdbss(const struct arm64_cpu_capabilities *entry, int cope)
+{
+	return is_kernel_in_hyp_mode() && has_cpuid_feature(entry, cope);
+}
+
 bool cpu_supports_bbml2_noabort(void)
 {
 	/*
@@ -2759,6 +2764,13 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 		ARM64_CPUID_FIELDS(ID_AA64MMFR1_EL1, HAFDBS, HAFT)
 	},
 #endif
+	{
+		.desc = "Hardware Dirty state tracking structure (HDBSS)",
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.capability = ARM64_HAS_HDBSS,
+		.matches = has_vhe_hdbss,
+		ARM64_CPUID_FIELDS(ID_AA64MMFR1_EL1, HAFDBS, HDBSS)
+	},
 	{
 		.desc = "CRC32 instructions",
 		.capability = ARM64_HAS_CRC32,
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 7261553b644b..f6ece5b85532 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -68,6 +68,7 @@ HAS_VA52
 HAS_VIRT_HOST_EXTN
 HAS_WFXT
 HAS_XNX
+HAS_HDBSS
 HAFT
 HW_DBM
 KVM_HVHE
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 65500f5db379..15ee42cdbd51 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -985,6 +985,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_ARM_SEA_TO_USER 245
 #define KVM_CAP_S390_USER_OPEREXEC 246
 #define KVM_CAP_S390_KEYOP 247
+#define KVM_CAP_ARM_HW_DIRTY_STATE_TRACK 248

 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index dddb781b0507..93e0a1e14dc7 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -974,6 +974,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_GUEST_MEMFD_FLAGS 244
 #define KVM_CAP_ARM_SEA_TO_USER 245
 #define KVM_CAP_S390_USER_OPEREXEC 246
+#define KVM_CAP_ARM_HW_DIRTY_STATE_TRACK 248

 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
--
2.33.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-02-25  4:04 [PATCH v3 0/5] Support the FEAT_HDBSS introduced in Armv9.5 Tian Zheng
                   ` (2 preceding siblings ...)
  2026-02-25  4:04 ` [PATCH v3 3/5] KVM: arm64: Add support for FEAT_HDBSS Tian Zheng
@ 2026-02-25  4:04 ` Tian Zheng
  2026-02-25 17:46   ` Leonardo Bras
                     ` (2 more replies)
  2026-02-25  4:04 ` [PATCH v3 5/5] KVM: arm64: Document HDBSS ioctl Tian Zheng
  4 siblings, 3 replies; 24+ messages in thread
From: Tian Zheng @ 2026-02-25  4:04 UTC (permalink / raw)
  To: maz, oupton, catalin.marinas, corbet, pbonzini, will, zhengtian10
  Cc: yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2,
	linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose, leo.bras

From: eillon <yezhenyu2@huawei.com>

HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
migration. This feature is only supported in VHE mode.

Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
write faults are handled by user_mem_abort, which relaxes permissions
and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
writes no longer trap, as the hardware automatically transitions the page
from writable-clean to writable-dirty.

KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
enabled, the hardware observes the clean->dirty transition and records
the corresponding page into the HDBSS buffer.

During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
that check_vcpu_requests flushes the HDBSS buffer and propagates the
accumulated dirty information into the userspace-visible dirty bitmap.

Add fault handling for HDBSS including buffer full, external abort, and
general protection fault (GPF).

Signed-off-by: eillon <yezhenyu2@huawei.com>
Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
---
 arch/arm64/include/asm/esr.h      |   5 ++
 arch/arm64/include/asm/kvm_host.h |  17 +++++
 arch/arm64/include/asm/kvm_mmu.h  |   1 +
 arch/arm64/include/asm/sysreg.h   |  11 ++++
 arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
 arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
 arch/arm64/kvm/reset.c            |   3 +
 8 files changed, 228 insertions(+)

diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index 81c17320a588..2e6b679b5908 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -437,6 +437,11 @@
 #ifndef __ASSEMBLER__
 #include <asm/types.h>

+static inline bool esr_iss2_is_hdbssf(unsigned long esr)
+{
+	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
+}
+
 static inline unsigned long esr_brk_comment(unsigned long esr)
 {
 	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 5d5a3bbdb95e..57ee6b53e061 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -55,12 +55,17 @@
 #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
 #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
 #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
+#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)

 #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
 				     KVM_DIRTY_LOG_INITIALLY_SET)

 #define KVM_HAVE_MMU_RWLOCK

+/* HDBSS entry field definitions */
+#define HDBSS_ENTRY_VALID BIT(0)
+#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
+
 /*
  * Mode of operation configurable with kvm-arm.mode early param.
  * See Documentation/admin-guide/kernel-parameters.txt for more information.
@@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
 u32 __attribute_const__ kvm_target_cpu(void);
 void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
 void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
+void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);

 struct kvm_hyp_memcache {
 	phys_addr_t head;
@@ -405,6 +411,8 @@ struct kvm_arch {
 	 * the associated pKVM instance in the hypervisor.
 	 */
 	struct kvm_protected_vm pkvm;
+
+	bool enable_hdbss;
 };

 struct kvm_vcpu_fault_info {
@@ -816,6 +824,12 @@ struct vcpu_reset_state {
 	bool		reset;
 };

+struct vcpu_hdbss_state {
+	phys_addr_t base_phys;
+	u32 size;
+	u32 next_index;
+};
+
 struct vncr_tlb;

 struct kvm_vcpu_arch {
@@ -920,6 +934,9 @@ struct kvm_vcpu_arch {

 	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
 	struct vncr_tlb	*vncr_tlb;
+
+	/* HDBSS registers info */
+	struct vcpu_hdbss_state hdbss;
 };

 /*
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index d968aca0461a..3fea8cfe8869 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,

 int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
 int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
+void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);

 phys_addr_t kvm_mmu_get_httbr(void);
 phys_addr_t kvm_get_idmap_vector(void);
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index f4436ecc630c..d11f4d0dd4e7 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -1039,6 +1039,17 @@

 #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
 					       GCS_CAP_VALID_TOKEN)
+
+/*
+ * Definitions for the HDBSS feature
+ */
+#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
+
+#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
+				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
+
+#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
+
 /*
  * Definitions for GICv5 instructions
  */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 29f0326f7e00..d64da05e25c4 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
 }

+void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
+{
+	struct page *hdbss_pg;
+
+	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
+	if (hdbss_pg)
+		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
+
+	vcpu->arch.hdbss.size = 0;
+}
+
+static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
+				    struct kvm_enable_cap *cap)
+{
+	unsigned long i;
+	struct kvm_vcpu *vcpu;
+	struct page *hdbss_pg = NULL;
+	__u64 size = cap->args[0];
+	bool enable = cap->args[1] ? true : false;
+
+	if (!system_supports_hdbss())
+		return -EINVAL;
+
+	if (size > HDBSS_MAX_SIZE)
+		return -EINVAL;
+
+	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
+		return 0;
+
+	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
+		return -EINVAL;
+
+	if (!enable) { /* Turn it off */
+		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
+
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			/* Kick vcpus to flush hdbss buffer. */
+			kvm_vcpu_kick(vcpu);
+
+			kvm_arm_vcpu_free_hdbss(vcpu);
+		}
+
+		kvm->arch.enable_hdbss = false;
+
+		return 0;
+	}
+
+	/* Turn it on */
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
+		if (!hdbss_pg)
+			goto error_alloc;
+
+		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
+			.base_phys = page_to_phys(hdbss_pg),
+			.size = size,
+			.next_index = 0,
+		};
+	}
+
+	kvm->arch.enable_hdbss = true;
+	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
+
+	/*
+	 * We should kick vcpus out of guest mode here to load new
+	 * vtcr value to vtcr_el2 register when re-enter guest mode.
+	 */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_vcpu_kick(vcpu);
+
+	return 0;
+
+error_alloc:
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (vcpu->arch.hdbss.base_phys)
+			kvm_arm_vcpu_free_hdbss(vcpu);
+	}
+
+	return -ENOMEM;
+}
+
 int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 			    struct kvm_enable_cap *cap)
 {
@@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		r = 0;
 		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
 		break;
+	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
+		mutex_lock(&kvm->lock);
+		r = kvm_cap_arm_enable_hdbss(kvm, cap);
+		mutex_unlock(&kvm->lock);
+		break;
 	default:
 		break;
 	}
@@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 			r = kvm_supports_cacheable_pfnmap();
 		break;

+	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
+		r = system_supports_hdbss();
+		break;
 	default:
 		r = 0;
 	}
@@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 		if (kvm_dirty_ring_check_request(vcpu))
 			return 0;

+		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
+			kvm_flush_hdbss_buffer(vcpu);
+
 		check_nested_vcpu_requests(vcpu);
 	}

@@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,

 void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
 {
+	/*
+	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
+	 * before reporting dirty_bitmap to userspace. Send a request with
+	 * KVM_REQUEST_WAIT to flush buffer synchronously.
+	 */
+	struct kvm_vcpu *vcpu;
+
+	if (!kvm->arch.enable_hdbss)
+		return;

+	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
 }

 static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index 9db3f11a4754..600cbc4f8ae9 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
 	local_irq_restore(flags);
 }

+static void __load_hdbss(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u64 br_el2, prod_el2;
+
+	if (!kvm->arch.enable_hdbss)
+		return;
+
+	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
+	prod_el2 = vcpu->arch.hdbss.next_index;
+
+	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
+	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
+
+	isb();
+}
+
 void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
 {
 	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
@@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
 	__vcpu_load_switch_sysregs(vcpu);
 	__vcpu_load_activate_traps(vcpu);
 	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
+	__load_hdbss(vcpu);
 }

 void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
 {
+	kvm_flush_hdbss_buffer(vcpu);
 	__vcpu_put_deactivate_traps(vcpu);
 	__vcpu_put_switch_sysregs(vcpu);

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 070a01e53fcb..42b0710a16ce 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;

+	if (writable && kvm->arch.enable_hdbss && logging_active)
+		prot |= KVM_PGTABLE_PROT_DBM;
+
 	if (exec_fault)
 		prot |= KVM_PGTABLE_PROT_X;

@@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
 	return 0;
 }

+void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
+{
+	int idx, curr_idx;
+	u64 br_el2;
+	u64 *hdbss_buf;
+	struct kvm *kvm = vcpu->kvm;
+
+	if (!kvm->arch.enable_hdbss)
+		return;
+
+	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
+	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
+
+	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
+	if (curr_idx == 0 || br_el2 == 0)
+		return;
+
+	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
+	if (!hdbss_buf)
+		return;
+
+	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
+	for (idx = 0; idx < curr_idx; idx++) {
+		u64 gpa;
+
+		gpa = hdbss_buf[idx];
+		if (!(gpa & HDBSS_ENTRY_VALID))
+			continue;
+
+		gpa &= HDBSS_ENTRY_IPA;
+		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
+	}
+
+	/* reset HDBSS index */
+	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
+	vcpu->arch.hdbss.next_index = 0;
+	isb();
+}
+
+static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
+{
+	u64 prod;
+	u64 fsc;
+
+	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
+	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
+
+	switch (fsc) {
+	case HDBSSPROD_EL2_FSC_OK:
+		/* Buffer full, which is reported as permission fault. */
+		kvm_flush_hdbss_buffer(vcpu);
+		return 1;
+	case HDBSSPROD_EL2_FSC_ExternalAbort:
+	case HDBSSPROD_EL2_FSC_GPF:
+		return -EFAULT;
+	default:
+		/* Unknown fault. */
+		WARN_ONCE(1,
+				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
+				fsc, prod, vcpu->vcpu_id);
+		return -EFAULT;
+	}
+}
+
 /**
  * kvm_handle_guest_abort - handles all 2nd stage aborts
  * @vcpu:	the VCPU pointer
@@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)

 	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);

+	if (esr_iss2_is_hdbssf(esr))
+		return kvm_handle_hdbss_fault(vcpu);
+
 	if (esr_fsc_is_translation_fault(esr)) {
 		/* Beyond sanitised PARange (which is the IPA limit) */
 		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 959532422d3a..c03a4b310b53 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
 	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
 	kfree(vcpu->arch.vncr_tlb);
 	kfree(vcpu->arch.ccsidr);
+
+	if (vcpu->kvm->arch.enable_hdbss)
+		kvm_arm_vcpu_free_hdbss(vcpu);
 }

 static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
--
2.33.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 5/5] KVM: arm64: Document HDBSS ioctl
  2026-02-25  4:04 [PATCH v3 0/5] Support the FEAT_HDBSS introduced in Armv9.5 Tian Zheng
                   ` (3 preceding siblings ...)
  2026-02-25  4:04 ` [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events Tian Zheng
@ 2026-02-25  4:04 ` Tian Zheng
  4 siblings, 0 replies; 24+ messages in thread
From: Tian Zheng @ 2026-02-25  4:04 UTC (permalink / raw)
  To: maz, oupton, catalin.marinas, corbet, pbonzini, will, zhengtian10
  Cc: yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2,
	linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose, leo.bras

A new ioctl (KVM_CAP_ARM_HW_DIRTY_STATE_TRACK) provides a mechanism for
userspace to configure the HDBSS buffer size during live migration,
enabling hardware-assisted dirty page tracking.

Signed-off-by: eillon <yezhenyu2@huawei.com>
Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
---
 Documentation/virt/kvm/api.rst | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index fc5736839edd..2b5531d40d02 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8896,6 +8896,22 @@ helpful if user space wants to emulate instructions which are not
 This capability can be enabled dynamically even if VCPUs were already
 created and are running.

+7.47 KVM_CAP_ARM_HW_DIRTY_STATE_TRACK
+----------------------------
+
+:Architectures: arm64
+:Type: VM
+:Parameters: args[0] is the allocation order determining HDBSS buffer size
+             args[1] is 0 to disable, 1 to enable HDBSS
+:Returns: 0 on success, negative value on failure
+
+Enables hardware-assisted dirty page tracking via the Hardware Dirty State
+Tracking Structure (HDBSS).
+
+When live migration is initiated, userspace can enable this feature by
+setting KVM_CAP_ARM_HW_DIRTY_STATE_TRACK through IOCTL. KVM will allocate
+per-vCPU HDBSS buffers.
+
 8. Other capabilities.
 ======================

--
2.33.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-02-25  4:04 ` [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events Tian Zheng
@ 2026-02-25 17:46   ` Leonardo Bras
  2026-02-27 10:47     ` Tian Zheng
  2026-03-04 15:40   ` Leonardo Bras
  2026-03-25 18:05   ` Leonardo Bras
  2 siblings, 1 reply; 24+ messages in thread
From: Leonardo Bras @ 2026-02-25 17:46 UTC (permalink / raw)
  To: Tian Zheng
  Cc: Leonardo Bras, maz, oupton, catalin.marinas, corbet, pbonzini,
	will, yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron,
	yezhenyu2, linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel,
	linux-doc, linux-kernel, skhan, suzuki.poulose

Hi Tian, eillon,

On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
> From: eillon <yezhenyu2@huawei.com>
> 
> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
> migration. This feature is only supported in VHE mode.

I wonder if it would not be better just to use the feature if available, 
instead of needing to have userspace enabling it.

> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
> write faults are handled by user_mem_abort, which relaxes permissions
> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
> writes no longer trap, as the hardware automatically transitions the page
> from writable-clean to writable-dirty.

That way we have to actually take a fault for every write you do after 
migration starts.

What if, instead, we put the DBM bit in every memory faulted as writable 
(or better, due to a write fault), so from the beginning of the VM we 
know if memory is RO, WC (writable-clean) or WD (writable-dirty).

On top of that, we don't actually have to take faults when migration 
starts, as HDBSS is tracking it all.

> 
> KVM does not scan S2 page tables to consume DBM.

You mean dirty pages?
If so, it will actually use dirty-log or dirty-rings to track the dirty 
pages, and not scan S2 on every iteration.

> Instead, when HDBSS is
> enabled, the hardware observes the clean->dirty transition and records
> the corresponding page into the HDBSS buffer.
> 
> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
> that check_vcpu_requests flushes the HDBSS buffer and propagates the
> accumulated dirty information into the userspace-visible dirty bitmap.
> 
> Add fault handling for HDBSS including buffer full, external abort, and
> general protection fault (GPF).
> 
> Signed-off-by: eillon <yezhenyu2@huawei.com>
> Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
> ---
>  arch/arm64/include/asm/esr.h      |   5 ++
>  arch/arm64/include/asm/kvm_host.h |  17 +++++
>  arch/arm64/include/asm/kvm_mmu.h  |   1 +
>  arch/arm64/include/asm/sysreg.h   |  11 ++++
>  arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>  arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>  arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>  arch/arm64/kvm/reset.c            |   3 +
>  8 files changed, 228 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> index 81c17320a588..2e6b679b5908 100644
> --- a/arch/arm64/include/asm/esr.h
> +++ b/arch/arm64/include/asm/esr.h
> @@ -437,6 +437,11 @@
>  #ifndef __ASSEMBLER__
>  #include <asm/types.h>
> 
> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
> +{
> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
> +}
> +
>  static inline unsigned long esr_brk_comment(unsigned long esr)
>  {
>  	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 5d5a3bbdb95e..57ee6b53e061 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -55,12 +55,17 @@
>  #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>  #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>  #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
> 
>  #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>  				     KVM_DIRTY_LOG_INITIALLY_SET)
> 
>  #define KVM_HAVE_MMU_RWLOCK
> 
> +/* HDBSS entry field definitions */
> +#define HDBSS_ENTRY_VALID BIT(0)
> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> +
>  /*
>   * Mode of operation configurable with kvm-arm.mode early param.
>   * See Documentation/admin-guide/kernel-parameters.txt for more information.
> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>  u32 __attribute_const__ kvm_target_cpu(void);
>  void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>  void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
> 
>  struct kvm_hyp_memcache {
>  	phys_addr_t head;
> @@ -405,6 +411,8 @@ struct kvm_arch {
>  	 * the associated pKVM instance in the hypervisor.
>  	 */
>  	struct kvm_protected_vm pkvm;
> +
> +	bool enable_hdbss;
>  };
> 
>  struct kvm_vcpu_fault_info {
> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>  	bool		reset;
>  };
> 
> +struct vcpu_hdbss_state {
> +	phys_addr_t base_phys;
> +	u32 size;
> +	u32 next_index;
> +};
> +
>  struct vncr_tlb;
> 
>  struct kvm_vcpu_arch {
> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
> 
>  	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>  	struct vncr_tlb	*vncr_tlb;
> +
> +	/* HDBSS registers info */
> +	struct vcpu_hdbss_state hdbss;
>  };
> 
>  /*
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index d968aca0461a..3fea8cfe8869 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> 
>  int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>  int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
> 
>  phys_addr_t kvm_mmu_get_httbr(void);
>  phys_addr_t kvm_get_idmap_vector(void);
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index f4436ecc630c..d11f4d0dd4e7 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -1039,6 +1039,17 @@
> 
>  #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>  					       GCS_CAP_VALID_TOKEN)
> +
> +/*
> + * Definitions for the HDBSS feature
> + */
> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
> +
> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
> +
> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
> +

Do we actually need the GENMASK above? Could not we use just the 
HDBSSBR_EL2_BADDR_MASK? 

If the base address received in alloc_pages is not properly aligned, we 
might end up writing in some different memory region that we allocated on.

If you want to actually make sure the mem region is aligned, check just 
after it's allocated, instead of silently masking it at this moment.   

In any case, I wonder if we actually need above defines, as it looks they 
could easily be replaced by what they do.


>  /*
>   * Definitions for GICv5 instructions
>   */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 29f0326f7e00..d64da05e25c4 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>  	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>  }
> 
> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
> +{
> +	struct page *hdbss_pg;
> +
> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
> +	if (hdbss_pg)
> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
> +
> +	vcpu->arch.hdbss.size = 0;
> +}
> +
> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
> +				    struct kvm_enable_cap *cap)
> +{
> +	unsigned long i;
> +	struct kvm_vcpu *vcpu;
> +	struct page *hdbss_pg = NULL;
> +	__u64 size = cap->args[0];
> +	bool enable = cap->args[1] ? true : false;
> +
> +	if (!system_supports_hdbss())
> +		return -EINVAL;
> +
> +	if (size > HDBSS_MAX_SIZE)
> +		return -EINVAL;
> +
> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
> +		return 0;
> +
> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
> +		return -EINVAL;
> +
> +	if (!enable) { /* Turn it off */
> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
> +
> +		kvm_for_each_vcpu(i, vcpu, kvm) {
> +			/* Kick vcpus to flush hdbss buffer. */
> +			kvm_vcpu_kick(vcpu);
> +
> +			kvm_arm_vcpu_free_hdbss(vcpu);
> +		}
> +
> +		kvm->arch.enable_hdbss = false;
> +
> +		return 0;
> +	}
> +
> +	/* Turn it on */
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
> +		if (!hdbss_pg)
> +			goto error_alloc;
> +
> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
> +			.base_phys = page_to_phys(hdbss_pg),
> +			.size = size,
> +			.next_index = 0,
> +		};
> +	}
> +
> +	kvm->arch.enable_hdbss = true;
> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
> +
> +	/*
> +	 * We should kick vcpus out of guest mode here to load new
> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
> +	 */
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		kvm_vcpu_kick(vcpu);
> +
> +	return 0;
> +
> +error_alloc:
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (vcpu->arch.hdbss.base_phys)
> +			kvm_arm_vcpu_free_hdbss(vcpu);
> +	}
> +
> +	return -ENOMEM;
> +}
> +
>  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  			    struct kvm_enable_cap *cap)
>  {
> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		r = 0;
>  		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>  		break;
> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> +		mutex_lock(&kvm->lock);
> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
> +		mutex_unlock(&kvm->lock);
> +		break;

If we prefer using a ioctl, I wonder if it would not be better to have a 
arch-generic option that enables hw dirty-bit tracking, and all archs could 
use it to implement their versions when available. 

I guess any VMM would have a much easier time doing it once, than for every 
arch they support.

>  	default:
>  		break;
>  	}
> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  			r = kvm_supports_cacheable_pfnmap();
>  		break;
> 
> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> +		r = system_supports_hdbss();
> +		break;
>  	default:
>  		r = 0;
>  	}
> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>  		if (kvm_dirty_ring_check_request(vcpu))
>  			return 0;
> 
> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
> +			kvm_flush_hdbss_buffer(vcpu);
> +
>  		check_nested_vcpu_requests(vcpu);
>  	}
> 
> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
> 
>  void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>  {
> +	/*
> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
> +	 * before reporting dirty_bitmap to userspace. Send a request with
> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
> +	 */
> +	struct kvm_vcpu *vcpu;
> +
> +	if (!kvm->arch.enable_hdbss)
> +		return;
> 
> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>  }
> 
>  static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> index 9db3f11a4754..600cbc4f8ae9 100644
> --- a/arch/arm64/kvm/hyp/vhe/switch.c
> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>  	local_irq_restore(flags);
>  }
> 
> +static void __load_hdbss(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u64 br_el2, prod_el2;
> +
> +	if (!kvm->arch.enable_hdbss)
> +		return;
> +
> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> +	prod_el2 = vcpu->arch.hdbss.next_index;
> +
> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
> +
> +	isb();
> +}
> +
>  void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>  {
>  	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>  	__vcpu_load_switch_sysregs(vcpu);
>  	__vcpu_load_activate_traps(vcpu);
>  	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
> +	__load_hdbss(vcpu);
>  }
> 
>  void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>  {
> +	kvm_flush_hdbss_buffer(vcpu);
>  	__vcpu_put_deactivate_traps(vcpu);
>  	__vcpu_put_switch_sysregs(vcpu);
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 070a01e53fcb..42b0710a16ce 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
> 
> +	if (writable && kvm->arch.enable_hdbss && logging_active)
> +		prot |= KVM_PGTABLE_PROT_DBM;
> +
>  	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;
> 
> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
> 
> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
> +{
> +	int idx, curr_idx;
> +	u64 br_el2;
> +	u64 *hdbss_buf;
> +	struct kvm *kvm = vcpu->kvm;
> +
> +	if (!kvm->arch.enable_hdbss)
> +		return;
> +
> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> +
> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
> +	if (curr_idx == 0 || br_el2 == 0)
> +		return;
> +
> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
> +	if (!hdbss_buf)
> +		return;
> +
> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
> +	for (idx = 0; idx < curr_idx; idx++) {
> +		u64 gpa;
> +
> +		gpa = hdbss_buf[idx];
> +		if (!(gpa & HDBSS_ENTRY_VALID))
> +			continue;
> +
> +		gpa &= HDBSS_ENTRY_IPA;
> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> +	}
> +
> +	/* reset HDBSS index */
> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
> +	vcpu->arch.hdbss.next_index = 0;
> +	isb();
> +}
> +
> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
> +{
> +	u64 prod;
> +	u64 fsc;
> +
> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
> +
> +	switch (fsc) {
> +	case HDBSSPROD_EL2_FSC_OK:
> +		/* Buffer full, which is reported as permission fault. */
> +		kvm_flush_hdbss_buffer(vcpu);
> +		return 1;
> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
> +	case HDBSSPROD_EL2_FSC_GPF:
> +		return -EFAULT;
> +	default:
> +		/* Unknown fault. */
> +		WARN_ONCE(1,
> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
> +				fsc, prod, vcpu->vcpu_id);
> +		return -EFAULT;
> +	}
> +}
> +
>  /**
>   * kvm_handle_guest_abort - handles all 2nd stage aborts
>   * @vcpu:	the VCPU pointer
> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> 
>  	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
> 
> +	if (esr_iss2_is_hdbssf(esr))
> +		return kvm_handle_hdbss_fault(vcpu);
> +
>  	if (esr_fsc_is_translation_fault(esr)) {
>  		/* Beyond sanitised PARange (which is the IPA limit) */
>  		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 959532422d3a..c03a4b310b53 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>  	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>  	kfree(vcpu->arch.vncr_tlb);
>  	kfree(vcpu->arch.ccsidr);
> +
> +	if (vcpu->kvm->arch.enable_hdbss)
> +		kvm_arm_vcpu_free_hdbss(vcpu);
>  }
> 
>  static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
> --
> 2.33.0
> 

Thx,
Leo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-02-25 17:46   ` Leonardo Bras
@ 2026-02-27 10:47     ` Tian Zheng
  2026-02-27 14:10       ` Leonardo Bras
  0 siblings, 1 reply; 24+ messages in thread
From: Tian Zheng @ 2026-02-27 10:47 UTC (permalink / raw)
  To: Leonardo Bras, Tian Zheng
  Cc: maz, oupton, catalin.marinas, corbet, pbonzini, will, yuzenghui,
	wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2, joey.gouly,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, skhan,
	suzuki.poulose



On 2/26/2026 1:46 AM, Leonardo Bras wrote:
> Hi Tian, eillon,
> 
> On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
>> From: eillon <yezhenyu2@huawei.com>
>>
>> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
>> migration. This feature is only supported in VHE mode.
> 
> I wonder if it would not be better just to use the feature if available,
> instead of needing to have userspace enabling it.
> 

I agree. If we decide to make HDBSS automatically enabled, then 
userspace would no longer need an explicit ioctl to turn it on. In that 
case, the only userspace‑visible control we may still need is a 
parameter to specify the HDBSS buffer size, with the kernel providing a 
reasonable default (for example, 4 KB).

Under such a model, the workflow could be simplified to:
1. HDBSS is automatically enabled during KVM_SET_USER_MEMORY_REGION if 
the feature is available.
2. HDBSS is automatically disabled when the source VM stops.

>> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
>> write faults are handled by user_mem_abort, which relaxes permissions
>> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
>> writes no longer trap, as the hardware automatically transitions the page
>> from writable-clean to writable-dirty.
> 
> That way we have to actually take a fault for every write you do after
> migration starts.
> 
> What if, instead, we put the DBM bit in every memory faulted as writable
> (or better, due to a write fault), so from the beginning of the VM we
> know if memory is RO, WC (writable-clean) or WD (writable-dirty).
> 
> On top of that, we don't actually have to take faults when migration
> starts, as HDBSS is tracking it all.
>

That makes sense. Pre‑setting DBM on writable mappings would avoid
taking write faults after migration starts, I will optimize this in the
next version.

>>
>> KVM does not scan S2 page tables to consume DBM.
> 
> You mean dirty pages?
> If so, it will actually use dirty-log or dirty-rings to track the dirty
> pages, and not scan S2 on every iteration.
> 

Sorry for the confusion — what I meant is that if I only added the DBM
bit, then relying on DBM for dirty‑page tracking would require scanning
the S2 page tables to find which PTEs have DBM set, and then updating
the dirty bitmap. That would obviously be expensive.

However, in the current patch series, DBM is used together with HDBSS.
With HDBSS enabled, the hardware directly tracks the writable‑clean ->
writable‑dirty transitions and push it to HDBSS buffer, so we no longer
need to walk the S2 page tables at all. This is the main reason why
combining HDBSS with DBM provides a meaningful optimization.

I will clarify this more clearly in the next version.

>> Instead, when HDBSS is
>> enabled, the hardware observes the clean->dirty transition and records
>> the corresponding page into the HDBSS buffer.
>>
>> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
>> that check_vcpu_requests flushes the HDBSS buffer and propagates the
>> accumulated dirty information into the userspace-visible dirty bitmap.
>>
>> Add fault handling for HDBSS including buffer full, external abort, and
>> general protection fault (GPF).
>>
>> Signed-off-by: eillon <yezhenyu2@huawei.com>
>> Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
>> ---
>>   arch/arm64/include/asm/esr.h      |   5 ++
>>   arch/arm64/include/asm/kvm_host.h |  17 +++++
>>   arch/arm64/include/asm/kvm_mmu.h  |   1 +
>>   arch/arm64/include/asm/sysreg.h   |  11 ++++
>>   arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>>   arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>>   arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>>   arch/arm64/kvm/reset.c            |   3 +
>>   8 files changed, 228 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>> index 81c17320a588..2e6b679b5908 100644
>> --- a/arch/arm64/include/asm/esr.h
>> +++ b/arch/arm64/include/asm/esr.h
>> @@ -437,6 +437,11 @@
>>   #ifndef __ASSEMBLER__
>>   #include <asm/types.h>
>>
>> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
>> +{
>> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
>> +}
>> +
>>   static inline unsigned long esr_brk_comment(unsigned long esr)
>>   {
>>   	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 5d5a3bbdb95e..57ee6b53e061 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -55,12 +55,17 @@
>>   #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>>   #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>>   #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
>> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
>>
>>   #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>   				     KVM_DIRTY_LOG_INITIALLY_SET)
>>
>>   #define KVM_HAVE_MMU_RWLOCK
>>
>> +/* HDBSS entry field definitions */
>> +#define HDBSS_ENTRY_VALID BIT(0)
>> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
>> +
>>   /*
>>    * Mode of operation configurable with kvm-arm.mode early param.
>>    * See Documentation/admin-guide/kernel-parameters.txt for more information.
>> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>>   u32 __attribute_const__ kvm_target_cpu(void);
>>   void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>>   void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
>>
>>   struct kvm_hyp_memcache {
>>   	phys_addr_t head;
>> @@ -405,6 +411,8 @@ struct kvm_arch {
>>   	 * the associated pKVM instance in the hypervisor.
>>   	 */
>>   	struct kvm_protected_vm pkvm;
>> +
>> +	bool enable_hdbss;
>>   };
>>
>>   struct kvm_vcpu_fault_info {
>> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>>   	bool		reset;
>>   };
>>
>> +struct vcpu_hdbss_state {
>> +	phys_addr_t base_phys;
>> +	u32 size;
>> +	u32 next_index;
>> +};
>> +
>>   struct vncr_tlb;
>>
>>   struct kvm_vcpu_arch {
>> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
>>
>>   	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>>   	struct vncr_tlb	*vncr_tlb;
>> +
>> +	/* HDBSS registers info */
>> +	struct vcpu_hdbss_state hdbss;
>>   };
>>
>>   /*
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index d968aca0461a..3fea8cfe8869 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>
>>   int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>>   int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
>>
>>   phys_addr_t kvm_mmu_get_httbr(void);
>>   phys_addr_t kvm_get_idmap_vector(void);
>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>> index f4436ecc630c..d11f4d0dd4e7 100644
>> --- a/arch/arm64/include/asm/sysreg.h
>> +++ b/arch/arm64/include/asm/sysreg.h
>> @@ -1039,6 +1039,17 @@
>>
>>   #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>>   					       GCS_CAP_VALID_TOKEN)
>> +
>> +/*
>> + * Definitions for the HDBSS feature
>> + */
>> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
>> +
>> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
>> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
>> +
>> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
>> +
> 
> Do we actually need the GENMASK above? Could not we use just the
> HDBSSBR_EL2_BADDR_MASK?
> 
> If the base address received in alloc_pages is not properly aligned, we
> might end up writing in some different memory region that we allocated on.
> 
> If you want to actually make sure the mem region is aligned, check just
> after it's allocated, instead of silently masking it at this moment.
> 
> In any case, I wonder if we actually need above defines, as it looks they
> could easily be replaced by what they do.
> 
> 

You're right, I will replace it with HDBSSBR_EL2_BADDR_MASK, and I will
add an explicit check to ensure that the physical address returned by
alloc_pages() is properly aligned.

I agree that some of them may not be necessary once the alignment is
validated. I will review them and simplify the definitions where
appropriate.

>>   /*
>>    * Definitions for GICv5 instructions
>>    */
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 29f0326f7e00..d64da05e25c4 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>   	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>>   }
>>
>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
>> +{
>> +	struct page *hdbss_pg;
>> +
>> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
>> +	if (hdbss_pg)
>> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
>> +
>> +	vcpu->arch.hdbss.size = 0;
>> +}
>> +
>> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
>> +				    struct kvm_enable_cap *cap)
>> +{
>> +	unsigned long i;
>> +	struct kvm_vcpu *vcpu;
>> +	struct page *hdbss_pg = NULL;
>> +	__u64 size = cap->args[0];
>> +	bool enable = cap->args[1] ? true : false;
>> +
>> +	if (!system_supports_hdbss())
>> +		return -EINVAL;
>> +
>> +	if (size > HDBSS_MAX_SIZE)
>> +		return -EINVAL;
>> +
>> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
>> +		return 0;
>> +
>> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
>> +		return -EINVAL;
>> +
>> +	if (!enable) { /* Turn it off */
>> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
>> +
>> +		kvm_for_each_vcpu(i, vcpu, kvm) {
>> +			/* Kick vcpus to flush hdbss buffer. */
>> +			kvm_vcpu_kick(vcpu);
>> +
>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>> +		}
>> +
>> +		kvm->arch.enable_hdbss = false;
>> +
>> +		return 0;
>> +	}
>> +
>> +	/* Turn it on */
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
>> +		if (!hdbss_pg)
>> +			goto error_alloc;
>> +
>> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
>> +			.base_phys = page_to_phys(hdbss_pg),
>> +			.size = size,
>> +			.next_index = 0,
>> +		};
>> +	}
>> +
>> +	kvm->arch.enable_hdbss = true;
>> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
>> +
>> +	/*
>> +	 * We should kick vcpus out of guest mode here to load new
>> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
>> +	 */
>> +	kvm_for_each_vcpu(i, vcpu, kvm)
>> +		kvm_vcpu_kick(vcpu);
>> +
>> +	return 0;
>> +
>> +error_alloc:
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		if (vcpu->arch.hdbss.base_phys)
>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>> +	}
>> +
>> +	return -ENOMEM;
>> +}
>> +
>>   int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>   			    struct kvm_enable_cap *cap)
>>   {
>> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>   		r = 0;
>>   		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>>   		break;
>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>> +		mutex_lock(&kvm->lock);
>> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
>> +		mutex_unlock(&kvm->lock);
>> +		break;
> 
> If we prefer using a ioctl, I wonder if it would not be better to have a
> arch-generic option that enables hw dirty-bit tracking, and all archs could
> use it to implement their versions when available.
> 
> I guess any VMM would have a much easier time doing it once, than for every
> arch they support.
> 

I think that even if we eventually decide to enable HDBSS by default,
userspace will still need an ioctl to configure the HDBSS buffer size.
So an interface is required anyway.

Also, it makes sense to expose this as an arch‑generic capability rather
than an ARM‑specific one. I will rename the ioctl to something like
KVM_CAP_HW_DIRTY_STATE_TRACK, and each architecture can implement its
own hardware‑assisted dirty tracking when available.

I will update the interface in the next version.

>>   	default:
>>   		break;
>>   	}
>> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   			r = kvm_supports_cacheable_pfnmap();
>>   		break;
>>
>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>> +		r = system_supports_hdbss();
>> +		break;
>>   	default:
>>   		r = 0;
>>   	}
>> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>>   		if (kvm_dirty_ring_check_request(vcpu))
>>   			return 0;
>>
>> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
>> +			kvm_flush_hdbss_buffer(vcpu);
>> +
>>   		check_nested_vcpu_requests(vcpu);
>>   	}
>>
>> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
>>
>>   void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>>   {
>> +	/*
>> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
>> +	 * before reporting dirty_bitmap to userspace. Send a request with
>> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
>> +	 */
>> +	struct kvm_vcpu *vcpu;
>> +
>> +	if (!kvm->arch.enable_hdbss)
>> +		return;
>>
>> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>>   }
>>
>>   static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
>> index 9db3f11a4754..600cbc4f8ae9 100644
>> --- a/arch/arm64/kvm/hyp/vhe/switch.c
>> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
>> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>>   	local_irq_restore(flags);
>>   }
>>
>> +static void __load_hdbss(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm *kvm = vcpu->kvm;
>> +	u64 br_el2, prod_el2;
>> +
>> +	if (!kvm->arch.enable_hdbss)
>> +		return;
>> +
>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>> +	prod_el2 = vcpu->arch.hdbss.next_index;
>> +
>> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
>> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
>> +
>> +	isb();
>> +}
>> +
>>   void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>   {
>>   	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
>> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>   	__vcpu_load_switch_sysregs(vcpu);
>>   	__vcpu_load_activate_traps(vcpu);
>>   	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
>> +	__load_hdbss(vcpu);
>>   }
>>
>>   void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>>   {
>> +	kvm_flush_hdbss_buffer(vcpu);
>>   	__vcpu_put_deactivate_traps(vcpu);
>>   	__vcpu_put_switch_sysregs(vcpu);
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 070a01e53fcb..42b0710a16ce 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	if (writable)
>>   		prot |= KVM_PGTABLE_PROT_W;
>>
>> +	if (writable && kvm->arch.enable_hdbss && logging_active)
>> +		prot |= KVM_PGTABLE_PROT_DBM;
>> +
>>   	if (exec_fault)
>>   		prot |= KVM_PGTABLE_PROT_X;
>>
>> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>   	return 0;
>>   }
>>
>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
>> +{
>> +	int idx, curr_idx;
>> +	u64 br_el2;
>> +	u64 *hdbss_buf;
>> +	struct kvm *kvm = vcpu->kvm;
>> +
>> +	if (!kvm->arch.enable_hdbss)
>> +		return;
>> +
>> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>> +
>> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
>> +	if (curr_idx == 0 || br_el2 == 0)
>> +		return;
>> +
>> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
>> +	if (!hdbss_buf)
>> +		return;
>> +
>> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
>> +	for (idx = 0; idx < curr_idx; idx++) {
>> +		u64 gpa;
>> +
>> +		gpa = hdbss_buf[idx];
>> +		if (!(gpa & HDBSS_ENTRY_VALID))
>> +			continue;
>> +
>> +		gpa &= HDBSS_ENTRY_IPA;
>> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
>> +	}
>> +
>> +	/* reset HDBSS index */
>> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
>> +	vcpu->arch.hdbss.next_index = 0;
>> +	isb();
>> +}
>> +
>> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 prod;
>> +	u64 fsc;
>> +
>> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
>> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
>> +
>> +	switch (fsc) {
>> +	case HDBSSPROD_EL2_FSC_OK:
>> +		/* Buffer full, which is reported as permission fault. */
>> +		kvm_flush_hdbss_buffer(vcpu);
>> +		return 1;
>> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
>> +	case HDBSSPROD_EL2_FSC_GPF:
>> +		return -EFAULT;
>> +	default:
>> +		/* Unknown fault. */
>> +		WARN_ONCE(1,
>> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
>> +				fsc, prod, vcpu->vcpu_id);
>> +		return -EFAULT;
>> +	}
>> +}
>> +
>>   /**
>>    * kvm_handle_guest_abort - handles all 2nd stage aborts
>>    * @vcpu:	the VCPU pointer
>> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>
>>   	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>>
>> +	if (esr_iss2_is_hdbssf(esr))
>> +		return kvm_handle_hdbss_fault(vcpu);
>> +
>>   	if (esr_fsc_is_translation_fault(esr)) {
>>   		/* Beyond sanitised PARange (which is the IPA limit) */
>>   		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index 959532422d3a..c03a4b310b53 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>>   	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>>   	kfree(vcpu->arch.vncr_tlb);
>>   	kfree(vcpu->arch.ccsidr);
>> +
>> +	if (vcpu->kvm->arch.enable_hdbss)
>> +		kvm_arm_vcpu_free_hdbss(vcpu);
>>   }
>>
>>   static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>> --
>> 2.33.0
>>
> 
> Thx,
> Leo
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-02-27 10:47     ` Tian Zheng
@ 2026-02-27 14:10       ` Leonardo Bras
  2026-03-04  3:06         ` Tian Zheng
  0 siblings, 1 reply; 24+ messages in thread
From: Leonardo Bras @ 2026-02-27 14:10 UTC (permalink / raw)
  To: Tian Zheng
  Cc: Leonardo Bras, maz, oupton, catalin.marinas, corbet, pbonzini,
	will, yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron,
	yezhenyu2, joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose

On Fri, Feb 27, 2026 at 06:47:25PM +0800, Tian Zheng wrote:
> 
> 
> On 2/26/2026 1:46 AM, Leonardo Bras wrote:
> > Hi Tian, eillon,
> > 
> > On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
> > > From: eillon <yezhenyu2@huawei.com>
> > > 
> > > HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
> > > migration. This feature is only supported in VHE mode.
> > 
> > I wonder if it would not be better just to use the feature if available,
> > instead of needing to have userspace enabling it.
> > 
> 
> I agree. If we decide to make HDBSS automatically enabled, then userspace
> would no longer need an explicit ioctl to turn it on. In that case, the only
> userspace‑visible control we may still need is a parameter to specify the
> HDBSS buffer size, with the kernel providing a reasonable default (for
> example, 4 KB).
> 
> Under such a model, the workflow could be simplified to:
> 1. HDBSS is automatically enabled during KVM_SET_USER_MEMORY_REGION if the
> feature is available.
> 2. HDBSS is automatically disabled when the source VM stops.
> 

I suggest we allocate the buffers and enable HDBSS during the first step of 
live migration, this way we don't need o have this memory usage during the 
lifetime of the VM, and we turn on HDBSS only when needed.

> > > Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
> > > write faults are handled by user_mem_abort, which relaxes permissions
> > > and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
> > > writes no longer trap, as the hardware automatically transitions the page
> > > from writable-clean to writable-dirty.
> > 
> > That way we have to actually take a fault for every write you do after
> > migration starts.
> > 
> > What if, instead, we put the DBM bit in every memory faulted as writable
> > (or better, due to a write fault), so from the beginning of the VM we
> > know if memory is RO, WC (writable-clean) or WD (writable-dirty).
> > 
> > On top of that, we don't actually have to take faults when migration
> > starts, as HDBSS is tracking it all.
> > 
> 
> That makes sense. Pre‑setting DBM on writable mappings would avoid
> taking write faults after migration starts, I will optimize this in the
> next version.
> 
> > > 
> > > KVM does not scan S2 page tables to consume DBM.
> > 
> > You mean dirty pages?
> > If so, it will actually use dirty-log or dirty-rings to track the dirty
> > pages, and not scan S2 on every iteration.
> > 
> 
> Sorry for the confusion — what I meant is that if I only added the DBM
> bit, then relying on DBM for dirty‑page tracking would require scanning
> the S2 page tables to find which PTEs have DBM set, and then updating
> the dirty bitmap. That would obviously be expensive.
> 
> However, in the current patch series, DBM is used together with HDBSS.
> With HDBSS enabled, the hardware directly tracks the writable‑clean ->
> writable‑dirty transitions and push it to HDBSS buffer, so we no longer
> need to walk the S2 page tables at all. This is the main reason why
> combining HDBSS with DBM provides a meaningful optimization.
> 
> I will clarify this more clearly in the next version.

Got it! Thanks for making it clear!
I would just mention that you are transfering the recorded data to the 
dirty log, then.

> 
> > > Instead, when HDBSS is
> > > enabled, the hardware observes the clean->dirty transition and records
> > > the corresponding page into the HDBSS buffer.
> > > 
> > > During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
> > > that check_vcpu_requests flushes the HDBSS buffer and propagates the
> > > accumulated dirty information into the userspace-visible dirty bitmap.
> > > 
> > > Add fault handling for HDBSS including buffer full, external abort, and
> > > general protection fault (GPF).
> > > 
> > > Signed-off-by: eillon <yezhenyu2@huawei.com>
> > > Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
> > > ---
> > >   arch/arm64/include/asm/esr.h      |   5 ++
> > >   arch/arm64/include/asm/kvm_host.h |  17 +++++
> > >   arch/arm64/include/asm/kvm_mmu.h  |   1 +
> > >   arch/arm64/include/asm/sysreg.h   |  11 ++++
> > >   arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
> > >   arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
> > >   arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
> > >   arch/arm64/kvm/reset.c            |   3 +
> > >   8 files changed, 228 insertions(+)
> > > 
> > > diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> > > index 81c17320a588..2e6b679b5908 100644
> > > --- a/arch/arm64/include/asm/esr.h
> > > +++ b/arch/arm64/include/asm/esr.h
> > > @@ -437,6 +437,11 @@
> > >   #ifndef __ASSEMBLER__
> > >   #include <asm/types.h>
> > > 
> > > +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
> > > +{
> > > +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
> > > +}
> > > +
> > >   static inline unsigned long esr_brk_comment(unsigned long esr)
> > >   {
> > >   	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index 5d5a3bbdb95e..57ee6b53e061 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -55,12 +55,17 @@
> > >   #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
> > >   #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
> > >   #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
> > > +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
> > > 
> > >   #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
> > >   				     KVM_DIRTY_LOG_INITIALLY_SET)
> > > 
> > >   #define KVM_HAVE_MMU_RWLOCK
> > > 
> > > +/* HDBSS entry field definitions */
> > > +#define HDBSS_ENTRY_VALID BIT(0)
> > > +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> > > +
> > >   /*
> > >    * Mode of operation configurable with kvm-arm.mode early param.
> > >    * See Documentation/admin-guide/kernel-parameters.txt for more information.
> > > @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
> > >   u32 __attribute_const__ kvm_target_cpu(void);
> > >   void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
> > >   void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
> > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
> > > 
> > >   struct kvm_hyp_memcache {
> > >   	phys_addr_t head;
> > > @@ -405,6 +411,8 @@ struct kvm_arch {
> > >   	 * the associated pKVM instance in the hypervisor.
> > >   	 */
> > >   	struct kvm_protected_vm pkvm;
> > > +
> > > +	bool enable_hdbss;
> > >   };
> > > 
> > >   struct kvm_vcpu_fault_info {
> > > @@ -816,6 +824,12 @@ struct vcpu_reset_state {
> > >   	bool		reset;
> > >   };
> > > 
> > > +struct vcpu_hdbss_state {
> > > +	phys_addr_t base_phys;
> > > +	u32 size;
> > > +	u32 next_index;
> > > +};
> > > +
> > >   struct vncr_tlb;
> > > 
> > >   struct kvm_vcpu_arch {
> > > @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
> > > 
> > >   	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
> > >   	struct vncr_tlb	*vncr_tlb;
> > > +
> > > +	/* HDBSS registers info */
> > > +	struct vcpu_hdbss_state hdbss;
> > >   };
> > > 
> > >   /*
> > > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > > index d968aca0461a..3fea8cfe8869 100644
> > > --- a/arch/arm64/include/asm/kvm_mmu.h
> > > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > > @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > > 
> > >   int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> > >   int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
> > > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
> > > 
> > >   phys_addr_t kvm_mmu_get_httbr(void);
> > >   phys_addr_t kvm_get_idmap_vector(void);
> > > diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> > > index f4436ecc630c..d11f4d0dd4e7 100644
> > > --- a/arch/arm64/include/asm/sysreg.h
> > > +++ b/arch/arm64/include/asm/sysreg.h
> > > @@ -1039,6 +1039,17 @@
> > > 
> > >   #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
> > >   					       GCS_CAP_VALID_TOKEN)
> > > +
> > > +/*
> > > + * Definitions for the HDBSS feature
> > > + */
> > > +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
> > > +
> > > +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
> > > +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
> > > +
> > > +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
> > > +
> > 
> > Do we actually need the GENMASK above? Could not we use just the
> > HDBSSBR_EL2_BADDR_MASK?
> > 
> > If the base address received in alloc_pages is not properly aligned, we
> > might end up writing in some different memory region that we allocated on.
> > 
> > If you want to actually make sure the mem region is aligned, check just
> > after it's allocated, instead of silently masking it at this moment.
> > 
> > In any case, I wonder if we actually need above defines, as it looks they
> > could easily be replaced by what they do.
> > 
> > 
> 
> You're right, I will replace it with HDBSSBR_EL2_BADDR_MASK, and I will
> add an explicit check to ensure that the physical address returned by
> alloc_pages() is properly aligned.

I recommend checking if the used function have any garantees in respect to 
alignment, so that maybe we may not actually need to check.

> 
> I agree that some of them may not be necessary once the alignment is
> validated. I will review them and simplify the definitions where
> appropriate.

Thanks!

> 
> > >   /*
> > >    * Definitions for GICv5 instructions
> > >    */
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 29f0326f7e00..d64da05e25c4 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
> > >   	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
> > >   }
> > > 
> > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
> > > +{
> > > +	struct page *hdbss_pg;
> > > +
> > > +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
> > > +	if (hdbss_pg)
> > > +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
> > > +
> > > +	vcpu->arch.hdbss.size = 0;
> > > +}
> > > +
> > > +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
> > > +				    struct kvm_enable_cap *cap)
> > > +{
> > > +	unsigned long i;
> > > +	struct kvm_vcpu *vcpu;
> > > +	struct page *hdbss_pg = NULL;
> > > +	__u64 size = cap->args[0];
> > > +	bool enable = cap->args[1] ? true : false;
> > > +
> > > +	if (!system_supports_hdbss())
> > > +		return -EINVAL;
> > > +
> > > +	if (size > HDBSS_MAX_SIZE)
> > > +		return -EINVAL;
> > > +
> > > +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
> > > +		return 0;
> > > +
> > > +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
> > > +		return -EINVAL;
> > > +
> > > +	if (!enable) { /* Turn it off */
> > > +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
> > > +
> > > +		kvm_for_each_vcpu(i, vcpu, kvm) {
> > > +			/* Kick vcpus to flush hdbss buffer. */
> > > +			kvm_vcpu_kick(vcpu);
> > > +
> > > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > > +		}
> > > +
> > > +		kvm->arch.enable_hdbss = false;
> > > +
> > > +		return 0;
> > > +	}
> > > +
> > > +	/* Turn it on */
> > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
> > > +		if (!hdbss_pg)
> > > +			goto error_alloc;
> > > +
> > > +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
> > > +			.base_phys = page_to_phys(hdbss_pg),
> > > +			.size = size,
> > > +			.next_index = 0,
> > > +		};
> > > +	}
> > > +
> > > +	kvm->arch.enable_hdbss = true;
> > > +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
> > > +
> > > +	/*
> > > +	 * We should kick vcpus out of guest mode here to load new
> > > +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
> > > +	 */
> > > +	kvm_for_each_vcpu(i, vcpu, kvm)
> > > +		kvm_vcpu_kick(vcpu);
> > > +
> > > +	return 0;
> > > +
> > > +error_alloc:
> > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > +		if (vcpu->arch.hdbss.base_phys)
> > > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > > +	}
> > > +
> > > +	return -ENOMEM;
> > > +}
> > > +
> > >   int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > >   			    struct kvm_enable_cap *cap)
> > >   {
> > > @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > >   		r = 0;
> > >   		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
> > >   		break;
> > > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > +		mutex_lock(&kvm->lock);
> > > +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
> > > +		mutex_unlock(&kvm->lock);
> > > +		break;
> > 
> > If we prefer using a ioctl, I wonder if it would not be better to have a
> > arch-generic option that enables hw dirty-bit tracking, and all archs could
> > use it to implement their versions when available.
> > 
> > I guess any VMM would have a much easier time doing it once, than for every
> > arch they support.
> > 
> 
> I think that even if we eventually decide to enable HDBSS by default,
> userspace will still need an ioctl to configure the HDBSS buffer size.
> So an interface is required anyway.

That's a valid argument. But we could as well have those configured based 
on VM memory size, or other parameter from the VM. 
Since it could be allocated just during the migration, we may have some 
flexibility on size.

But sure, we could have a default value and let user (optionally) configure 
the hdbss percpu bufsize.

> 
> Also, it makes sense to expose this as an arch‑generic capability rather
> than an ARM‑specific one. I will rename the ioctl to something like
> KVM_CAP_HW_DIRTY_STATE_TRACK, and each architecture can implement its
> own hardware‑assisted dirty tracking when available.

I wonder if we need a new capability for this, at all. 
Couldn't we only use the feature when available? 
 
> 
> I will update the interface in the next version.
> 

Thanks!
Leo

> > >   	default:
> > >   		break;
> > >   	}
> > > @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > >   			r = kvm_supports_cacheable_pfnmap();
> > >   		break;
> > > 
> > > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > +		r = system_supports_hdbss();
> > > +		break;
> > >   	default:
> > >   		r = 0;
> > >   	}
> > > @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > >   		if (kvm_dirty_ring_check_request(vcpu))
> > >   			return 0;
> > > 
> > > +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
> > > +			kvm_flush_hdbss_buffer(vcpu);
> > > +
> > >   		check_nested_vcpu_requests(vcpu);
> > >   	}
> > > 
> > > @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
> > > 
> > >   void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
> > >   {
> > > +	/*
> > > +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
> > > +	 * before reporting dirty_bitmap to userspace. Send a request with
> > > +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
> > > +	 */
> > > +	struct kvm_vcpu *vcpu;
> > > +
> > > +	if (!kvm->arch.enable_hdbss)
> > > +		return;
> > > 
> > > +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
> > >   }
> > > 
> > >   static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> > > diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> > > index 9db3f11a4754..600cbc4f8ae9 100644
> > > --- a/arch/arm64/kvm/hyp/vhe/switch.c
> > > +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> > > @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
> > >   	local_irq_restore(flags);
> > >   }
> > > 
> > > +static void __load_hdbss(struct kvm_vcpu *vcpu)
> > > +{
> > > +	struct kvm *kvm = vcpu->kvm;
> > > +	u64 br_el2, prod_el2;
> > > +
> > > +	if (!kvm->arch.enable_hdbss)
> > > +		return;
> > > +
> > > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > > +	prod_el2 = vcpu->arch.hdbss.next_index;
> > > +
> > > +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
> > > +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
> > > +
> > > +	isb();
> > > +}
> > > +
> > >   void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> > >   {
> > >   	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
> > > @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> > >   	__vcpu_load_switch_sysregs(vcpu);
> > >   	__vcpu_load_activate_traps(vcpu);
> > >   	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
> > > +	__load_hdbss(vcpu);
> > >   }
> > > 
> > >   void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
> > >   {
> > > +	kvm_flush_hdbss_buffer(vcpu);
> > >   	__vcpu_put_deactivate_traps(vcpu);
> > >   	__vcpu_put_switch_sysregs(vcpu);
> > > 
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 070a01e53fcb..42b0710a16ce 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >   	if (writable)
> > >   		prot |= KVM_PGTABLE_PROT_W;
> > > 
> > > +	if (writable && kvm->arch.enable_hdbss && logging_active)
> > > +		prot |= KVM_PGTABLE_PROT_DBM;
> > > +
> > >   	if (exec_fault)
> > >   		prot |= KVM_PGTABLE_PROT_X;
> > > 
> > > @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > >   	return 0;
> > >   }
> > > 
> > > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
> > > +{
> > > +	int idx, curr_idx;
> > > +	u64 br_el2;
> > > +	u64 *hdbss_buf;
> > > +	struct kvm *kvm = vcpu->kvm;
> > > +
> > > +	if (!kvm->arch.enable_hdbss)
> > > +		return;
> > > +
> > > +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
> > > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > > +
> > > +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
> > > +	if (curr_idx == 0 || br_el2 == 0)
> > > +		return;
> > > +
> > > +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
> > > +	if (!hdbss_buf)
> > > +		return;
> > > +
> > > +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
> > > +	for (idx = 0; idx < curr_idx; idx++) {
> > > +		u64 gpa;
> > > +
> > > +		gpa = hdbss_buf[idx];
> > > +		if (!(gpa & HDBSS_ENTRY_VALID))
> > > +			continue;
> > > +
> > > +		gpa &= HDBSS_ENTRY_IPA;
> > > +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> > > +	}
> > > +
> > > +	/* reset HDBSS index */
> > > +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
> > > +	vcpu->arch.hdbss.next_index = 0;
> > > +	isb();
> > > +}
> > > +
> > > +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
> > > +{
> > > +	u64 prod;
> > > +	u64 fsc;
> > > +
> > > +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
> > > +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
> > > +
> > > +	switch (fsc) {
> > > +	case HDBSSPROD_EL2_FSC_OK:
> > > +		/* Buffer full, which is reported as permission fault. */
> > > +		kvm_flush_hdbss_buffer(vcpu);
> > > +		return 1;
> > > +	case HDBSSPROD_EL2_FSC_ExternalAbort:
> > > +	case HDBSSPROD_EL2_FSC_GPF:
> > > +		return -EFAULT;
> > > +	default:
> > > +		/* Unknown fault. */
> > > +		WARN_ONCE(1,
> > > +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
> > > +				fsc, prod, vcpu->vcpu_id);
> > > +		return -EFAULT;
> > > +	}
> > > +}
> > > +
> > >   /**
> > >    * kvm_handle_guest_abort - handles all 2nd stage aborts
> > >    * @vcpu:	the VCPU pointer
> > > @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > 
> > >   	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
> > > 
> > > +	if (esr_iss2_is_hdbssf(esr))
> > > +		return kvm_handle_hdbss_fault(vcpu);
> > > +
> > >   	if (esr_fsc_is_translation_fault(esr)) {
> > >   		/* Beyond sanitised PARange (which is the IPA limit) */
> > >   		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
> > > diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> > > index 959532422d3a..c03a4b310b53 100644
> > > --- a/arch/arm64/kvm/reset.c
> > > +++ b/arch/arm64/kvm/reset.c
> > > @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
> > >   	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
> > >   	kfree(vcpu->arch.vncr_tlb);
> > >   	kfree(vcpu->arch.ccsidr);
> > > +
> > > +	if (vcpu->kvm->arch.enable_hdbss)
> > > +		kvm_arm_vcpu_free_hdbss(vcpu);
> > >   }
> > > 
> > >   static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
> > > --
> > > 2.33.0
> > > 
> > 
> > Thx,
> > Leo
> > 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-02-27 14:10       ` Leonardo Bras
@ 2026-03-04  3:06         ` Tian Zheng
  2026-03-04 12:08           ` Leonardo Bras
  0 siblings, 1 reply; 24+ messages in thread
From: Tian Zheng @ 2026-03-04  3:06 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: maz, oupton, catalin.marinas, corbet, pbonzini, will, yuzenghui,
	wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2, joey.gouly,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, skhan,
	suzuki.poulose


Hi Leo,


On 2/27/2026 10:10 PM, Leonardo Bras wrote:
> On Fri, Feb 27, 2026 at 06:47:25PM +0800, Tian Zheng wrote:
>>
>> On 2/26/2026 1:46 AM, Leonardo Bras wrote:
>>> Hi Tian, eillon,
>>>
>>> On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
>>>> From: eillon <yezhenyu2@huawei.com>
>>>>
>>>> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
>>>> migration. This feature is only supported in VHE mode.
>>> I wonder if it would not be better just to use the feature if available,
>>> instead of needing to have userspace enabling it.
>>>
>> I agree. If we decide to make HDBSS automatically enabled, then userspace
>> would no longer need an explicit ioctl to turn it on. In that case, the only
>> userspace‑visible control we may still need is a parameter to specify the
>> HDBSS buffer size, with the kernel providing a reasonable default (for
>> example, 4 KB).
>>
>> Under such a model, the workflow could be simplified to:
>> 1. HDBSS is automatically enabled during KVM_SET_USER_MEMORY_REGION if the
>> feature is available.
>> 2. HDBSS is automatically disabled when the source VM stops.
>>
> I suggest we allocate the buffers and enable HDBSS during the first step of
> live migration, this way we don't need o have this memory usage during the
> lifetime of the VM, and we turn on HDBSS only when needed.


Yes, we also think that enabling this feature and allocating the buffers 
during the

first step of live migration is the right approach.


>>>> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
>>>> write faults are handled by user_mem_abort, which relaxes permissions
>>>> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
>>>> writes no longer trap, as the hardware automatically transitions the page
>>>> from writable-clean to writable-dirty.
>>> That way we have to actually take a fault for every write you do after
>>> migration starts.
>>>
>>> What if, instead, we put the DBM bit in every memory faulted as writable
>>> (or better, due to a write fault), so from the beginning of the VM we
>>> know if memory is RO, WC (writable-clean) or WD (writable-dirty).
>>>
>>> On top of that, we don't actually have to take faults when migration
>>> starts, as HDBSS is tracking it all.
>>>
>> That makes sense. Pre‑setting DBM on writable mappings would avoid
>> taking write faults after migration starts, I will optimize this in the
>> next version.
>>
>>>> KVM does not scan S2 page tables to consume DBM.
>>> You mean dirty pages?
>>> If so, it will actually use dirty-log or dirty-rings to track the dirty
>>> pages, and not scan S2 on every iteration.
>>>
>> Sorry for the confusion — what I meant is that if I only added the DBM
>> bit, then relying on DBM for dirty‑page tracking would require scanning
>> the S2 page tables to find which PTEs have DBM set, and then updating
>> the dirty bitmap. That would obviously be expensive.
>>
>> However, in the current patch series, DBM is used together with HDBSS.
>> With HDBSS enabled, the hardware directly tracks the writable‑clean ->
>> writable‑dirty transitions and push it to HDBSS buffer, so we no longer
>> need to walk the S2 page tables at all. This is the main reason why
>> combining HDBSS with DBM provides a meaningful optimization.
>>
>> I will clarify this more clearly in the next version.
> Got it! Thanks for making it clear!
> I would just mention that you are transfering the recorded data to the
> dirty log, then.


You're welcome! I'm glad the explanation is clear now.


>>>> Instead, when HDBSS is
>>>> enabled, the hardware observes the clean->dirty transition and records
>>>> the corresponding page into the HDBSS buffer.
>>>>
>>>> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
>>>> that check_vcpu_requests flushes the HDBSS buffer and propagates the
>>>> accumulated dirty information into the userspace-visible dirty bitmap.
>>>>
>>>> Add fault handling for HDBSS including buffer full, external abort, and
>>>> general protection fault (GPF).
>>>>
>>>> Signed-off-by: eillon <yezhenyu2@huawei.com>
>>>> Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
>>>> ---
>>>>    arch/arm64/include/asm/esr.h      |   5 ++
>>>>    arch/arm64/include/asm/kvm_host.h |  17 +++++
>>>>    arch/arm64/include/asm/kvm_mmu.h  |   1 +
>>>>    arch/arm64/include/asm/sysreg.h   |  11 ++++
>>>>    arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>>>>    arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>>>>    arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>>>>    arch/arm64/kvm/reset.c            |   3 +
>>>>    8 files changed, 228 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>>>> index 81c17320a588..2e6b679b5908 100644
>>>> --- a/arch/arm64/include/asm/esr.h
>>>> +++ b/arch/arm64/include/asm/esr.h
>>>> @@ -437,6 +437,11 @@
>>>>    #ifndef __ASSEMBLER__
>>>>    #include <asm/types.h>
>>>>
>>>> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
>>>> +{
>>>> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
>>>> +}
>>>> +
>>>>    static inline unsigned long esr_brk_comment(unsigned long esr)
>>>>    {
>>>>    	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>>> index 5d5a3bbdb95e..57ee6b53e061 100644
>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>> @@ -55,12 +55,17 @@
>>>>    #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>>>>    #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>>>>    #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
>>>> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
>>>>
>>>>    #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>>>    				     KVM_DIRTY_LOG_INITIALLY_SET)
>>>>
>>>>    #define KVM_HAVE_MMU_RWLOCK
>>>>
>>>> +/* HDBSS entry field definitions */
>>>> +#define HDBSS_ENTRY_VALID BIT(0)
>>>> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
>>>> +
>>>>    /*
>>>>     * Mode of operation configurable with kvm-arm.mode early param.
>>>>     * See Documentation/admin-guide/kernel-parameters.txt for more information.
>>>> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>>>>    u32 __attribute_const__ kvm_target_cpu(void);
>>>>    void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>>>>    void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
>>>>
>>>>    struct kvm_hyp_memcache {
>>>>    	phys_addr_t head;
>>>> @@ -405,6 +411,8 @@ struct kvm_arch {
>>>>    	 * the associated pKVM instance in the hypervisor.
>>>>    	 */
>>>>    	struct kvm_protected_vm pkvm;
>>>> +
>>>> +	bool enable_hdbss;
>>>>    };
>>>>
>>>>    struct kvm_vcpu_fault_info {
>>>> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>>>>    	bool		reset;
>>>>    };
>>>>
>>>> +struct vcpu_hdbss_state {
>>>> +	phys_addr_t base_phys;
>>>> +	u32 size;
>>>> +	u32 next_index;
>>>> +};
>>>> +
>>>>    struct vncr_tlb;
>>>>
>>>>    struct kvm_vcpu_arch {
>>>> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
>>>>
>>>>    	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>>>>    	struct vncr_tlb	*vncr_tlb;
>>>> +
>>>> +	/* HDBSS registers info */
>>>> +	struct vcpu_hdbss_state hdbss;
>>>>    };
>>>>
>>>>    /*
>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>>> index d968aca0461a..3fea8cfe8869 100644
>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>>>
>>>>    int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>>>>    int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
>>>>
>>>>    phys_addr_t kvm_mmu_get_httbr(void);
>>>>    phys_addr_t kvm_get_idmap_vector(void);
>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>>>> index f4436ecc630c..d11f4d0dd4e7 100644
>>>> --- a/arch/arm64/include/asm/sysreg.h
>>>> +++ b/arch/arm64/include/asm/sysreg.h
>>>> @@ -1039,6 +1039,17 @@
>>>>
>>>>    #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>>>>    					       GCS_CAP_VALID_TOKEN)
>>>> +
>>>> +/*
>>>> + * Definitions for the HDBSS feature
>>>> + */
>>>> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
>>>> +
>>>> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
>>>> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
>>>> +
>>>> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
>>>> +
>>> Do we actually need the GENMASK above? Could not we use just the
>>> HDBSSBR_EL2_BADDR_MASK?
>>>
>>> If the base address received in alloc_pages is not properly aligned, we
>>> might end up writing in some different memory region that we allocated on.
>>>
>>> If you want to actually make sure the mem region is aligned, check just
>>> after it's allocated, instead of silently masking it at this moment.
>>>
>>> In any case, I wonder if we actually need above defines, as it looks they
>>> could easily be replaced by what they do.
>>>
>>>
>> You're right, I will replace it with HDBSSBR_EL2_BADDR_MASK, and I will
>> add an explicit check to ensure that the physical address returned by
>> alloc_pages() is properly aligned.
> I recommend checking if the used function have any garantees in respect to
> alignment, so that maybe we may not actually need to check.


Ok, I will check that and confirm whether the function provides any 
alignment

guarantees.


>> I agree that some of them may not be necessary once the alignment is
>> validated. I will review them and simplify the definitions where
>> appropriate.
> Thanks!
>>>>    /*
>>>>     * Definitions for GICv5 instructions
>>>>     */
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index 29f0326f7e00..d64da05e25c4 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>>>    	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>>>>    }
>>>>
>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct page *hdbss_pg;
>>>> +
>>>> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
>>>> +	if (hdbss_pg)
>>>> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
>>>> +
>>>> +	vcpu->arch.hdbss.size = 0;
>>>> +}
>>>> +
>>>> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
>>>> +				    struct kvm_enable_cap *cap)
>>>> +{
>>>> +	unsigned long i;
>>>> +	struct kvm_vcpu *vcpu;
>>>> +	struct page *hdbss_pg = NULL;
>>>> +	__u64 size = cap->args[0];
>>>> +	bool enable = cap->args[1] ? true : false;
>>>> +
>>>> +	if (!system_supports_hdbss())
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (size > HDBSS_MAX_SIZE)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
>>>> +		return 0;
>>>> +
>>>> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (!enable) { /* Turn it off */
>>>> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
>>>> +
>>>> +		kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> +			/* Kick vcpus to flush hdbss buffer. */
>>>> +			kvm_vcpu_kick(vcpu);
>>>> +
>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>> +		}
>>>> +
>>>> +		kvm->arch.enable_hdbss = false;
>>>> +
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	/* Turn it on */
>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
>>>> +		if (!hdbss_pg)
>>>> +			goto error_alloc;
>>>> +
>>>> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
>>>> +			.base_phys = page_to_phys(hdbss_pg),
>>>> +			.size = size,
>>>> +			.next_index = 0,
>>>> +		};
>>>> +	}
>>>> +
>>>> +	kvm->arch.enable_hdbss = true;
>>>> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
>>>> +
>>>> +	/*
>>>> +	 * We should kick vcpus out of guest mode here to load new
>>>> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
>>>> +	 */
>>>> +	kvm_for_each_vcpu(i, vcpu, kvm)
>>>> +		kvm_vcpu_kick(vcpu);
>>>> +
>>>> +	return 0;
>>>> +
>>>> +error_alloc:
>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> +		if (vcpu->arch.hdbss.base_phys)
>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>> +	}
>>>> +
>>>> +	return -ENOMEM;
>>>> +}
>>>> +
>>>>    int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>    			    struct kvm_enable_cap *cap)
>>>>    {
>>>> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>    		r = 0;
>>>>    		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>>>>    		break;
>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>> +		mutex_lock(&kvm->lock);
>>>> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
>>>> +		mutex_unlock(&kvm->lock);
>>>> +		break;
>>> If we prefer using a ioctl, I wonder if it would not be better to have a
>>> arch-generic option that enables hw dirty-bit tracking, and all archs could
>>> use it to implement their versions when available.
>>>
>>> I guess any VMM would have a much easier time doing it once, than for every
>>> arch they support.
>>>
>> I think that even if we eventually decide to enable HDBSS by default,
>> userspace will still need an ioctl to configure the HDBSS buffer size.
>> So an interface is required anyway.
> That's a valid argument. But we could as well have those configured based
> on VM memory size, or other parameter from the VM.
> Since it could be allocated just during the migration, we may have some
> flexibility on size.
>
> But sure, we could have a default value and let user (optionally) configure
> the hdbss percpu bufsize.


That's a good idea! We can automatically determine an appropriate buffer 
size

when this feature is enabled during the first step of live migration, 
and then we

can remove the ioctl interface.


I will update this in the next version.


>> Also, it makes sense to expose this as an arch‑generic capability rather
>> than an ARM‑specific one. I will rename the ioctl to something like
>> KVM_CAP_HW_DIRTY_STATE_TRACK, and each architecture can implement its
>> own hardware‑assisted dirty tracking when available.
> I wonder if we need a new capability for this, at all.
> Couldn't we only use the feature when available?
>   


If we decide to enable this feature entirely inside KVM, we could remove

this interface.


>> I will update the interface in the next version.
>>
> Thanks!
> Leo
>
>>>>    	default:
>>>>    		break;
>>>>    	}
>>>> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>    			r = kvm_supports_cacheable_pfnmap();
>>>>    		break;
>>>>
>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>> +		r = system_supports_hdbss();
>>>> +		break;
>>>>    	default:
>>>>    		r = 0;
>>>>    	}
>>>> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>>>>    		if (kvm_dirty_ring_check_request(vcpu))
>>>>    			return 0;
>>>>
>>>> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
>>>> +			kvm_flush_hdbss_buffer(vcpu);
>>>> +
>>>>    		check_nested_vcpu_requests(vcpu);
>>>>    	}
>>>>
>>>> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
>>>>
>>>>    void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>>>>    {
>>>> +	/*
>>>> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
>>>> +	 * before reporting dirty_bitmap to userspace. Send a request with
>>>> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
>>>> +	 */
>>>> +	struct kvm_vcpu *vcpu;
>>>> +
>>>> +	if (!kvm->arch.enable_hdbss)
>>>> +		return;
>>>>
>>>> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>>>>    }
>>>>
>>>>    static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>>>> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
>>>> index 9db3f11a4754..600cbc4f8ae9 100644
>>>> --- a/arch/arm64/kvm/hyp/vhe/switch.c
>>>> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
>>>> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>>>>    	local_irq_restore(flags);
>>>>    }
>>>>
>>>> +static void __load_hdbss(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct kvm *kvm = vcpu->kvm;
>>>> +	u64 br_el2, prod_el2;
>>>> +
>>>> +	if (!kvm->arch.enable_hdbss)
>>>> +		return;
>>>> +
>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>> +	prod_el2 = vcpu->arch.hdbss.next_index;
>>>> +
>>>> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
>>>> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
>>>> +
>>>> +	isb();
>>>> +}
>>>> +
>>>>    void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>    {
>>>>    	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
>>>> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>    	__vcpu_load_switch_sysregs(vcpu);
>>>>    	__vcpu_load_activate_traps(vcpu);
>>>>    	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
>>>> +	__load_hdbss(vcpu);
>>>>    }
>>>>
>>>>    void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>>>>    {
>>>> +	kvm_flush_hdbss_buffer(vcpu);
>>>>    	__vcpu_put_deactivate_traps(vcpu);
>>>>    	__vcpu_put_switch_sysregs(vcpu);
>>>>
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index 070a01e53fcb..42b0710a16ce 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>    	if (writable)
>>>>    		prot |= KVM_PGTABLE_PROT_W;
>>>>
>>>> +	if (writable && kvm->arch.enable_hdbss && logging_active)
>>>> +		prot |= KVM_PGTABLE_PROT_DBM;
>>>> +
>>>>    	if (exec_fault)
>>>>    		prot |= KVM_PGTABLE_PROT_X;
>>>>
>>>> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>>>    	return 0;
>>>>    }
>>>>
>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	int idx, curr_idx;
>>>> +	u64 br_el2;
>>>> +	u64 *hdbss_buf;
>>>> +	struct kvm *kvm = vcpu->kvm;
>>>> +
>>>> +	if (!kvm->arch.enable_hdbss)
>>>> +		return;
>>>> +
>>>> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>> +
>>>> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
>>>> +	if (curr_idx == 0 || br_el2 == 0)
>>>> +		return;
>>>> +
>>>> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
>>>> +	if (!hdbss_buf)
>>>> +		return;
>>>> +
>>>> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
>>>> +	for (idx = 0; idx < curr_idx; idx++) {
>>>> +		u64 gpa;
>>>> +
>>>> +		gpa = hdbss_buf[idx];
>>>> +		if (!(gpa & HDBSS_ENTRY_VALID))
>>>> +			continue;
>>>> +
>>>> +		gpa &= HDBSS_ENTRY_IPA;
>>>> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
>>>> +	}
>>>> +
>>>> +	/* reset HDBSS index */
>>>> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
>>>> +	vcpu->arch.hdbss.next_index = 0;
>>>> +	isb();
>>>> +}
>>>> +
>>>> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	u64 prod;
>>>> +	u64 fsc;
>>>> +
>>>> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
>>>> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
>>>> +
>>>> +	switch (fsc) {
>>>> +	case HDBSSPROD_EL2_FSC_OK:
>>>> +		/* Buffer full, which is reported as permission fault. */
>>>> +		kvm_flush_hdbss_buffer(vcpu);
>>>> +		return 1;
>>>> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
>>>> +	case HDBSSPROD_EL2_FSC_GPF:
>>>> +		return -EFAULT;
>>>> +	default:
>>>> +		/* Unknown fault. */
>>>> +		WARN_ONCE(1,
>>>> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
>>>> +				fsc, prod, vcpu->vcpu_id);
>>>> +		return -EFAULT;
>>>> +	}
>>>> +}
>>>> +
>>>>    /**
>>>>     * kvm_handle_guest_abort - handles all 2nd stage aborts
>>>>     * @vcpu:	the VCPU pointer
>>>> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>>
>>>>    	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>>>>
>>>> +	if (esr_iss2_is_hdbssf(esr))
>>>> +		return kvm_handle_hdbss_fault(vcpu);
>>>> +
>>>>    	if (esr_fsc_is_translation_fault(esr)) {
>>>>    		/* Beyond sanitised PARange (which is the IPA limit) */
>>>>    		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
>>>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>>>> index 959532422d3a..c03a4b310b53 100644
>>>> --- a/arch/arm64/kvm/reset.c
>>>> +++ b/arch/arm64/kvm/reset.c
>>>> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>>>>    	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>>>>    	kfree(vcpu->arch.vncr_tlb);
>>>>    	kfree(vcpu->arch.ccsidr);
>>>> +
>>>> +	if (vcpu->kvm->arch.enable_hdbss)
>>>> +		kvm_arm_vcpu_free_hdbss(vcpu);
>>>>    }
>>>>
>>>>    static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>>>> --
>>>> 2.33.0
>>>>
>>> Thx,
>>> Leo
>>>

Thx,

Tian



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-03-04  3:06         ` Tian Zheng
@ 2026-03-04 12:08           ` Leonardo Bras
  2026-03-05  7:37             ` Tian Zheng
  0 siblings, 1 reply; 24+ messages in thread
From: Leonardo Bras @ 2026-03-04 12:08 UTC (permalink / raw)
  To: Tian Zheng
  Cc: Leonardo Bras, maz, oupton, catalin.marinas, corbet, pbonzini,
	will, yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron,
	yezhenyu2, joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose

On Wed, Mar 04, 2026 at 11:06:46AM +0800, Tian Zheng wrote:
> 
> Hi Leo,
> 
> 
> On 2/27/2026 10:10 PM, Leonardo Bras wrote:
> > On Fri, Feb 27, 2026 at 06:47:25PM +0800, Tian Zheng wrote:
> > > 
> > > On 2/26/2026 1:46 AM, Leonardo Bras wrote:
> > > > Hi Tian, eillon,
> > > > 
> > > > On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
> > > > > From: eillon <yezhenyu2@huawei.com>
> > > > > 
> > > > > HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
> > > > > migration. This feature is only supported in VHE mode.
> > > > I wonder if it would not be better just to use the feature if available,
> > > > instead of needing to have userspace enabling it.
> > > > 
> > > I agree. If we decide to make HDBSS automatically enabled, then userspace
> > > would no longer need an explicit ioctl to turn it on. In that case, the only
> > > userspace‑visible control we may still need is a parameter to specify the
> > > HDBSS buffer size, with the kernel providing a reasonable default (for
> > > example, 4 KB).
> > > 
> > > Under such a model, the workflow could be simplified to:
> > > 1. HDBSS is automatically enabled during KVM_SET_USER_MEMORY_REGION if the
> > > feature is available.
> > > 2. HDBSS is automatically disabled when the source VM stops.
> > > 
> > I suggest we allocate the buffers and enable HDBSS during the first step of
> > live migration, this way we don't need o have this memory usage during the
> > lifetime of the VM, and we turn on HDBSS only when needed.
> 
> 
> Yes, we also think that enabling this feature and allocating the buffers
> during the
> 
> first step of live migration is the right approach.
> 
> 
> > > > > Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
> > > > > write faults are handled by user_mem_abort, which relaxes permissions
> > > > > and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
> > > > > writes no longer trap, as the hardware automatically transitions the page
> > > > > from writable-clean to writable-dirty.
> > > > That way we have to actually take a fault for every write you do after
> > > > migration starts.
> > > > 
> > > > What if, instead, we put the DBM bit in every memory faulted as writable
> > > > (or better, due to a write fault), so from the beginning of the VM we
> > > > know if memory is RO, WC (writable-clean) or WD (writable-dirty).
> > > > 
> > > > On top of that, we don't actually have to take faults when migration
> > > > starts, as HDBSS is tracking it all.
> > > > 
> > > That makes sense. Pre‑setting DBM on writable mappings would avoid
> > > taking write faults after migration starts, I will optimize this in the
> > > next version.
> > > 
> > > > > KVM does not scan S2 page tables to consume DBM.
> > > > You mean dirty pages?
> > > > If so, it will actually use dirty-log or dirty-rings to track the dirty
> > > > pages, and not scan S2 on every iteration.
> > > > 
> > > Sorry for the confusion — what I meant is that if I only added the DBM
> > > bit, then relying on DBM for dirty‑page tracking would require scanning
> > > the S2 page tables to find which PTEs have DBM set, and then updating
> > > the dirty bitmap. That would obviously be expensive.
> > > 
> > > However, in the current patch series, DBM is used together with HDBSS.
> > > With HDBSS enabled, the hardware directly tracks the writable‑clean ->
> > > writable‑dirty transitions and push it to HDBSS buffer, so we no longer
> > > need to walk the S2 page tables at all. This is the main reason why
> > > combining HDBSS with DBM provides a meaningful optimization.
> > > 
> > > I will clarify this more clearly in the next version.
> > Got it! Thanks for making it clear!
> > I would just mention that you are transfering the recorded data to the
> > dirty log, then.
> 
> 
> You're welcome! I'm glad the explanation is clear now.
> 
> 
> > > > > Instead, when HDBSS is
> > > > > enabled, the hardware observes the clean->dirty transition and records
> > > > > the corresponding page into the HDBSS buffer.
> > > > > 
> > > > > During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
> > > > > that check_vcpu_requests flushes the HDBSS buffer and propagates the
> > > > > accumulated dirty information into the userspace-visible dirty bitmap.
> > > > > 
> > > > > Add fault handling for HDBSS including buffer full, external abort, and
> > > > > general protection fault (GPF).
> > > > > 
> > > > > Signed-off-by: eillon <yezhenyu2@huawei.com>
> > > > > Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
> > > > > ---
> > > > >    arch/arm64/include/asm/esr.h      |   5 ++
> > > > >    arch/arm64/include/asm/kvm_host.h |  17 +++++
> > > > >    arch/arm64/include/asm/kvm_mmu.h  |   1 +
> > > > >    arch/arm64/include/asm/sysreg.h   |  11 ++++
> > > > >    arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
> > > > >    arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
> > > > >    arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
> > > > >    arch/arm64/kvm/reset.c            |   3 +
> > > > >    8 files changed, 228 insertions(+)
> > > > > 
> > > > > diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> > > > > index 81c17320a588..2e6b679b5908 100644
> > > > > --- a/arch/arm64/include/asm/esr.h
> > > > > +++ b/arch/arm64/include/asm/esr.h
> > > > > @@ -437,6 +437,11 @@
> > > > >    #ifndef __ASSEMBLER__
> > > > >    #include <asm/types.h>
> > > > > 
> > > > > +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
> > > > > +{
> > > > > +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
> > > > > +}
> > > > > +
> > > > >    static inline unsigned long esr_brk_comment(unsigned long esr)
> > > > >    {
> > > > >    	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
> > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > > > index 5d5a3bbdb95e..57ee6b53e061 100644
> > > > > --- a/arch/arm64/include/asm/kvm_host.h
> > > > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > > > @@ -55,12 +55,17 @@
> > > > >    #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
> > > > >    #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
> > > > >    #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
> > > > > +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
> > > > > 
> > > > >    #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
> > > > >    				     KVM_DIRTY_LOG_INITIALLY_SET)
> > > > > 
> > > > >    #define KVM_HAVE_MMU_RWLOCK
> > > > > 
> > > > > +/* HDBSS entry field definitions */
> > > > > +#define HDBSS_ENTRY_VALID BIT(0)
> > > > > +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> > > > > +
> > > > >    /*
> > > > >     * Mode of operation configurable with kvm-arm.mode early param.
> > > > >     * See Documentation/admin-guide/kernel-parameters.txt for more information.
> > > > > @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
> > > > >    u32 __attribute_const__ kvm_target_cpu(void);
> > > > >    void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
> > > > >    void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
> > > > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
> > > > > 
> > > > >    struct kvm_hyp_memcache {
> > > > >    	phys_addr_t head;
> > > > > @@ -405,6 +411,8 @@ struct kvm_arch {
> > > > >    	 * the associated pKVM instance in the hypervisor.
> > > > >    	 */
> > > > >    	struct kvm_protected_vm pkvm;
> > > > > +
> > > > > +	bool enable_hdbss;
> > > > >    };
> > > > > 
> > > > >    struct kvm_vcpu_fault_info {
> > > > > @@ -816,6 +824,12 @@ struct vcpu_reset_state {
> > > > >    	bool		reset;
> > > > >    };
> > > > > 
> > > > > +struct vcpu_hdbss_state {
> > > > > +	phys_addr_t base_phys;
> > > > > +	u32 size;
> > > > > +	u32 next_index;
> > > > > +};
> > > > > +
> > > > >    struct vncr_tlb;
> > > > > 
> > > > >    struct kvm_vcpu_arch {
> > > > > @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
> > > > > 
> > > > >    	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
> > > > >    	struct vncr_tlb	*vncr_tlb;
> > > > > +
> > > > > +	/* HDBSS registers info */
> > > > > +	struct vcpu_hdbss_state hdbss;
> > > > >    };
> > > > > 
> > > > >    /*
> > > > > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > > > > index d968aca0461a..3fea8cfe8869 100644
> > > > > --- a/arch/arm64/include/asm/kvm_mmu.h
> > > > > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > > > > @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > > > > 
> > > > >    int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> > > > >    int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
> > > > > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
> > > > > 
> > > > >    phys_addr_t kvm_mmu_get_httbr(void);
> > > > >    phys_addr_t kvm_get_idmap_vector(void);
> > > > > diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> > > > > index f4436ecc630c..d11f4d0dd4e7 100644
> > > > > --- a/arch/arm64/include/asm/sysreg.h
> > > > > +++ b/arch/arm64/include/asm/sysreg.h
> > > > > @@ -1039,6 +1039,17 @@
> > > > > 
> > > > >    #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
> > > > >    					       GCS_CAP_VALID_TOKEN)
> > > > > +
> > > > > +/*
> > > > > + * Definitions for the HDBSS feature
> > > > > + */
> > > > > +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
> > > > > +
> > > > > +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
> > > > > +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
> > > > > +
> > > > > +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
> > > > > +
> > > > Do we actually need the GENMASK above? Could not we use just the
> > > > HDBSSBR_EL2_BADDR_MASK?
> > > > 
> > > > If the base address received in alloc_pages is not properly aligned, we
> > > > might end up writing in some different memory region that we allocated on.
> > > > 
> > > > If you want to actually make sure the mem region is aligned, check just
> > > > after it's allocated, instead of silently masking it at this moment.
> > > > 
> > > > In any case, I wonder if we actually need above defines, as it looks they
> > > > could easily be replaced by what they do.
> > > > 
> > > > 
> > > You're right, I will replace it with HDBSSBR_EL2_BADDR_MASK, and I will
> > > add an explicit check to ensure that the physical address returned by
> > > alloc_pages() is properly aligned.
> > I recommend checking if the used function have any garantees in respect to
> > alignment, so that maybe we may not actually need to check.
> 
> 
> Ok, I will check that and confirm whether the function provides any
> alignment
> 
> guarantees.
> 
>

See below comment
 
> > > I agree that some of them may not be necessary once the alignment is
> > > validated. I will review them and simplify the definitions where
> > > appropriate.
> > Thanks!
> > > > >    /*
> > > > >     * Definitions for GICv5 instructions
> > > > >     */
> > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > > index 29f0326f7e00..d64da05e25c4 100644
> > > > > --- a/arch/arm64/kvm/arm.c
> > > > > +++ b/arch/arm64/kvm/arm.c
> > > > > @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
> > > > >    	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
> > > > >    }
> > > > > 
> > > > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > +	struct page *hdbss_pg;
> > > > > +
> > > > > +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
> > > > > +	if (hdbss_pg)
> > > > > +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
> > > > > +
> > > > > +	vcpu->arch.hdbss.size = 0;
> > > > > +}
> > > > > +
> > > > > +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
> > > > > +				    struct kvm_enable_cap *cap)
> > > > > +{
> > > > > +	unsigned long i;
> > > > > +	struct kvm_vcpu *vcpu;
> > > > > +	struct page *hdbss_pg = NULL;
> > > > > +	__u64 size = cap->args[0];
> > > > > +	bool enable = cap->args[1] ? true : false;
> > > > > +
> > > > > +	if (!system_supports_hdbss())
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (size > HDBSS_MAX_SIZE)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
> > > > > +		return 0;
> > > > > +
> > > > > +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (!enable) { /* Turn it off */
> > > > > +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
> > > > > +
> > > > > +		kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > +			/* Kick vcpus to flush hdbss buffer. */
> > > > > +			kvm_vcpu_kick(vcpu);
> > > > > +
> > > > > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > > > > +		}
> > > > > +
> > > > > +		kvm->arch.enable_hdbss = false;
> > > > > +
> > > > > +		return 0;
> > > > > +	}
> > > > > +
> > > > > +	/* Turn it on */
> > > > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);

/**
 * alloc_pages - Allocate pages.
 * @gfp: GFP flags.
 * @order: Power of two of number of pages to allocate.
 *
 * Allocate 1 << @order contiguous pages.  The physical address of the
 * first page is naturally aligned (eg an order-3 allocation will be aligned
 * to a multiple of 8 * PAGE_SIZE bytes).  The NUMA policy of the current
 * process is honoured when in process context.
 *
 * Context: Can be called from any context, providing the appropriate GFP
 * flags are used.
 * Return: The page on success or NULL if allocation fails.
 */

It looks like we are safe from the aspect of alignment, according to the 
documentation on alloc_pages. 

I would rename the variable 'size' here, it could be misleading, even 
though the ioctl docs state that it's the order.

> > > > > +		if (!hdbss_pg)
> > > > > +			goto error_alloc;
> > > > > +
> > > > > +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
> > > > > +			.base_phys = page_to_phys(hdbss_pg),
> > > > > +			.size = size,
> > > > > +			.next_index = 0,
> > > > > +		};
> > > > > +	}
> > > > > +
> > > > > +	kvm->arch.enable_hdbss = true;
> > > > > +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
> > > > > +
> > > > > +	/*
> > > > > +	 * We should kick vcpus out of guest mode here to load new
> > > > > +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
> > > > > +	 */
> > > > > +	kvm_for_each_vcpu(i, vcpu, kvm)
> > > > > +		kvm_vcpu_kick(vcpu);
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +error_alloc:
> > > > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > +		if (vcpu->arch.hdbss.base_phys)
> > > > > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > > > > +	}
> > > > > +
> > > > > +	return -ENOMEM;
> > > > > +}
> > > > > +
> > > > >    int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > > >    			    struct kvm_enable_cap *cap)
> > > > >    {
> > > > > @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > > >    		r = 0;
> > > > >    		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
> > > > >    		break;
> > > > > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > > > +		mutex_lock(&kvm->lock);
> > > > > +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
> > > > > +		mutex_unlock(&kvm->lock);
> > > > > +		break;
> > > > If we prefer using a ioctl, I wonder if it would not be better to have a
> > > > arch-generic option that enables hw dirty-bit tracking, and all archs could
> > > > use it to implement their versions when available.
> > > > 
> > > > I guess any VMM would have a much easier time doing it once, than for every
> > > > arch they support.
> > > > 
> > > I think that even if we eventually decide to enable HDBSS by default,
> > > userspace will still need an ioctl to configure the HDBSS buffer size.
> > > So an interface is required anyway.
> > That's a valid argument. But we could as well have those configured based
> > on VM memory size, or other parameter from the VM.
> > Since it could be allocated just during the migration, we may have some
> > flexibility on size.
> > 
> > But sure, we could have a default value and let user (optionally) configure
> > the hdbss percpu bufsize.
> 
> 
> That's a good idea! We can automatically determine an appropriate buffer
> size
> 
> when this feature is enabled during the first step of live migration, and
> then we
> 
> can remove the ioctl interface.
> 
> 
> I will update this in the next version.
> 
> 
> > > Also, it makes sense to expose this as an arch‑generic capability rather
> > > than an ARM‑specific one. I will rename the ioctl to something like
> > > KVM_CAP_HW_DIRTY_STATE_TRACK, and each architecture can implement its
> > > own hardware‑assisted dirty tracking when available.
> > I wonder if we need a new capability for this, at all.
> > Couldn't we only use the feature when available?
> 
> 
> If we decide to enable this feature entirely inside KVM, we could remove
> 
> this interface.
> 

I think that's the best option.

Thanks!
Leo

> 
> > > I will update the interface in the next version.
> > > 
> > Thanks!
> > Leo
> > 
> > > > >    	default:
> > > > >    		break;
> > > > >    	}
> > > > > @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > > > >    			r = kvm_supports_cacheable_pfnmap();
> > > > >    		break;
> > > > > 
> > > > > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > > > +		r = system_supports_hdbss();
> > > > > +		break;
> > > > >    	default:
> > > > >    		r = 0;
> > > > >    	}
> > > > > @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > > > >    		if (kvm_dirty_ring_check_request(vcpu))
> > > > >    			return 0;
> > > > > 
> > > > > +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
> > > > > +			kvm_flush_hdbss_buffer(vcpu);
> > > > > +
> > > > >    		check_nested_vcpu_requests(vcpu);
> > > > >    	}
> > > > > 
> > > > > @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
> > > > > 
> > > > >    void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
> > > > >    {
> > > > > +	/*
> > > > > +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
> > > > > +	 * before reporting dirty_bitmap to userspace. Send a request with
> > > > > +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
> > > > > +	 */
> > > > > +	struct kvm_vcpu *vcpu;
> > > > > +
> > > > > +	if (!kvm->arch.enable_hdbss)
> > > > > +		return;
> > > > > 
> > > > > +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
> > > > >    }
> > > > > 
> > > > >    static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> > > > > diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > index 9db3f11a4754..600cbc4f8ae9 100644
> > > > > --- a/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
> > > > >    	local_irq_restore(flags);
> > > > >    }
> > > > > 
> > > > > +static void __load_hdbss(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > +	struct kvm *kvm = vcpu->kvm;
> > > > > +	u64 br_el2, prod_el2;
> > > > > +
> > > > > +	if (!kvm->arch.enable_hdbss)
> > > > > +		return;
> > > > > +
> > > > > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > > > > +	prod_el2 = vcpu->arch.hdbss.next_index;
> > > > > +
> > > > > +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
> > > > > +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
> > > > > +
> > > > > +	isb();
> > > > > +}
> > > > > +
> > > > >    void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> > > > >    {
> > > > >    	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
> > > > > @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> > > > >    	__vcpu_load_switch_sysregs(vcpu);
> > > > >    	__vcpu_load_activate_traps(vcpu);
> > > > >    	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
> > > > > +	__load_hdbss(vcpu);
> > > > >    }
> > > > > 
> > > > >    void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
> > > > >    {
> > > > > +	kvm_flush_hdbss_buffer(vcpu);
> > > > >    	__vcpu_put_deactivate_traps(vcpu);
> > > > >    	__vcpu_put_switch_sysregs(vcpu);
> > > > > 
> > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > > index 070a01e53fcb..42b0710a16ce 100644
> > > > > --- a/arch/arm64/kvm/mmu.c
> > > > > +++ b/arch/arm64/kvm/mmu.c
> > > > > @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > > >    	if (writable)
> > > > >    		prot |= KVM_PGTABLE_PROT_W;
> > > > > 
> > > > > +	if (writable && kvm->arch.enable_hdbss && logging_active)
> > > > > +		prot |= KVM_PGTABLE_PROT_DBM;
> > > > > +
> > > > >    	if (exec_fault)
> > > > >    		prot |= KVM_PGTABLE_PROT_X;
> > > > > 
> > > > > @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > > > >    	return 0;
> > > > >    }
> > > > > 
> > > > > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > +	int idx, curr_idx;
> > > > > +	u64 br_el2p;
> > > > > +	u64 *hdbss_buf;
> > > > > +	struct kvm *kvm = vcpu->kvm;
> > > > > +
> > > > > +	if (!kvm->arch.enable_hdbss)
> > > > > +		return;
> > > > > +
> > > > > +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
> > > > > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > > > > +
> > > > > +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
> > > > > +	if (curr_idx == 0 || br_el2 == 0)
> > > > > +		return;
> > > > > +
> > > > > +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
> > > > > +	if (!hdbss_buf)
> > > > > +		return;
> > > > > +
> > > > > +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
> > > > > +	for (idx = 0; idx < curr_idx; idx++) {
> > > > > +		u64 gpa;
> > > > > +
> > > > > +		gpa = hdbss_buf[idx];
> > > > > +		if (!(gpa & HDBSS_ENTRY_VALID))
> > > > > +			continue;
> > > > > +
> > > > > +		gpa &= HDBSS_ENTRY_IPA;
> > > > > +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> > > > > +	}
> > > > > +
> > > > > +	/* reset HDBSS index */
> > > > > +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
> > > > > +	vcpu->arch.hdbss.next_index = 0;
> > > > > +	isb();
> > > > > +}
> > > > > +
> > > > > +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > +	u64 prod;
> > > > > +	u64 fsc;
> > > > > +
> > > > > +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
> > > > > +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
> > > > > +
> > > > > +	switch (fsc) {
> > > > > +	case HDBSSPROD_EL2_FSC_OK:
> > > > > +		/* Buffer full, which is reported as permission fault. */
> > > > > +		kvm_flush_hdbss_buffer(vcpu);
> > > > > +		return 1;
> > > > > +	case HDBSSPROD_EL2_FSC_ExternalAbort:
> > > > > +	case HDBSSPROD_EL2_FSC_GPF:
> > > > > +		return -EFAULT;
> > > > > +	default:
> > > > > +		/* Unknown fault. */
> > > > > +		WARN_ONCE(1,
> > > > > +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
> > > > > +				fsc, prod, vcpu->vcpu_id);
> > > > > +		return -EFAULT;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > >    /**
> > > > >     * kvm_handle_guest_abort - handles all 2nd stage aborts
> > > > >     * @vcpu:	the VCPU pointer
> > > > > @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > > > 
> > > > >    	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
> > > > > 
> > > > > +	if (esr_iss2_is_hdbssf(esr))
> > > > > +		return kvm_handle_hdbss_fault(vcpu);
> > > > > +
> > > > >    	if (esr_fsc_is_translation_fault(esr)) {
> > > > >    		/* Beyond sanitised PARange (which is the IPA limit) */
> > > > >    		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
> > > > > diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> > > > > index 959532422d3a..c03a4b310b53 100644
> > > > > --- a/arch/arm64/kvm/reset.c
> > > > > +++ b/arch/arm64/kvm/reset.c
> > > > > @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
> > > > >    	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
> > > > >    	kfree(vcpu->arch.vncr_tlb);
> > > > >    	kfree(vcpu->arch.ccsidr);
> > > > > +
> > > > > +	if (vcpu->kvm->arch.enable_hdbss)
> > > > > +		kvm_arm_vcpu_free_hdbss(vcpu);
> > > > >    }
> > > > > 
> > > > >    static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
> > > > > --
> > > > > 2.33.0
> > > > > 
> > > > Thx,
> > > > Leo
> > > > 
> 
> Thx,
> 
> Tian
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-02-25  4:04 ` [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events Tian Zheng
  2026-02-25 17:46   ` Leonardo Bras
@ 2026-03-04 15:40   ` Leonardo Bras
  2026-03-06  9:27     ` Tian Zheng
  2026-03-25 18:05   ` Leonardo Bras
  2 siblings, 1 reply; 24+ messages in thread
From: Leonardo Bras @ 2026-03-04 15:40 UTC (permalink / raw)
  To: Tian Zheng
  Cc: Leonardo Bras, maz, oupton, catalin.marinas, corbet, pbonzini,
	will, yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron,
	yezhenyu2, linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel,
	linux-doc, linux-kernel, skhan, suzuki.poulose

Hi Tian, 

Few extra notes/questions below

On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
> From: eillon <yezhenyu2@huawei.com>
> 
> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
> migration. This feature is only supported in VHE mode.
> 
> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
> write faults are handled by user_mem_abort, which relaxes permissions
> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
> writes no longer trap, as the hardware automatically transitions the page
> from writable-clean to writable-dirty.
> 
> KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
> enabled, the hardware observes the clean->dirty transition and records
> the corresponding page into the HDBSS buffer.
> 
> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
> that check_vcpu_requests flushes the HDBSS buffer and propagates the
> accumulated dirty information into the userspace-visible dirty bitmap.
> 
> Add fault handling for HDBSS including buffer full, external abort, and
> general protection fault (GPF).
> 
> Signed-off-by: eillon <yezhenyu2@huawei.com>
> Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
> ---
>  arch/arm64/include/asm/esr.h      |   5 ++
>  arch/arm64/include/asm/kvm_host.h |  17 +++++
>  arch/arm64/include/asm/kvm_mmu.h  |   1 +
>  arch/arm64/include/asm/sysreg.h   |  11 ++++
>  arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>  arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>  arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>  arch/arm64/kvm/reset.c            |   3 +
>  8 files changed, 228 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> index 81c17320a588..2e6b679b5908 100644
> --- a/arch/arm64/include/asm/esr.h
> +++ b/arch/arm64/include/asm/esr.h
> @@ -437,6 +437,11 @@
>  #ifndef __ASSEMBLER__
>  #include <asm/types.h>
> 
> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
> +{
> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
> +}
> +
>  static inline unsigned long esr_brk_comment(unsigned long esr)
>  {
>  	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 5d5a3bbdb95e..57ee6b53e061 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -55,12 +55,17 @@
>  #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>  #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>  #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
> 
>  #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>  				     KVM_DIRTY_LOG_INITIALLY_SET)
> 
>  #define KVM_HAVE_MMU_RWLOCK
> 
> +/* HDBSS entry field definitions */
> +#define HDBSS_ENTRY_VALID BIT(0)
> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> +
>  /*
>   * Mode of operation configurable with kvm-arm.mode early param.
>   * See Documentation/admin-guide/kernel-parameters.txt for more information.
> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>  u32 __attribute_const__ kvm_target_cpu(void);
>  void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>  void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
> 
>  struct kvm_hyp_memcache {
>  	phys_addr_t head;
> @@ -405,6 +411,8 @@ struct kvm_arch {
>  	 * the associated pKVM instance in the hypervisor.
>  	 */
>  	struct kvm_protected_vm pkvm;
> +
> +	bool enable_hdbss;
>  };
> 
>  struct kvm_vcpu_fault_info {
> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>  	bool		reset;
>  };
> 
> +struct vcpu_hdbss_state {
> +	phys_addr_t base_phys;
> +	u32 size;
> +	u32 next_index;
> +};
> +

IIUC this is used once both on enable/disable and massively on 
vcpu_put/get.

What if we actually save just HDBSSBR_EL2 and HDBSSPROD_EL2 instead?
That way we avoid having masking operations in put/get as well as any 
possible error we may have formatting those.

The cost is doing those operations once for enable and once for disable, 
which should be fine.

>  struct vncr_tlb;
> 
>  struct kvm_vcpu_arch {
> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
> 
>  	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>  	struct vncr_tlb	*vncr_tlb;
> +
> +	/* HDBSS registers info */
> +	struct vcpu_hdbss_state hdbss;
>  };
> 
>  /*
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index d968aca0461a..3fea8cfe8869 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> 
>  int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>  int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
> 
>  phys_addr_t kvm_mmu_get_httbr(void);
>  phys_addr_t kvm_get_idmap_vector(void);
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index f4436ecc630c..d11f4d0dd4e7 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -1039,6 +1039,17 @@
> 
>  #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>  					       GCS_CAP_VALID_TOKEN)
> +
> +/*
> + * Definitions for the HDBSS feature
> + */
> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
> +
> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
> +
> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
> +
>  /*
>   * Definitions for GICv5 instructions
>   */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 29f0326f7e00..d64da05e25c4 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>  	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>  }
> 
> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
> +{
> +	struct page *hdbss_pg;
> +
> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
> +	if (hdbss_pg)
> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
> +
> +	vcpu->arch.hdbss.size = 0;
> +}
> +
> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
> +				    struct kvm_enable_cap *cap)
> +{
> +	unsigned long i;
> +	struct kvm_vcpu *vcpu;
> +	struct page *hdbss_pg = NULL;
> +	__u64 size = cap->args[0];
> +	bool enable = cap->args[1] ? true : false;
> +
> +	if (!system_supports_hdbss())
> +		return -EINVAL;
> +
> +	if (size > HDBSS_MAX_SIZE)
> +		return -EINVAL;
> +
> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
> +		return 0;
> +
> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
> +		return -EINVAL;
> +
> +	if (!enable) { /* Turn it off */
> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
> +
> +		kvm_for_each_vcpu(i, vcpu, kvm) {
> +			/* Kick vcpus to flush hdbss buffer. */
> +			kvm_vcpu_kick(vcpu);
> +
> +			kvm_arm_vcpu_free_hdbss(vcpu);
> +		}
> +
> +		kvm->arch.enable_hdbss = false;
> +
> +		return 0;
> +	}
> +
> +	/* Turn it on */
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
> +		if (!hdbss_pg)
> +			goto error_alloc;
> +
> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
> +			.base_phys = page_to_phys(hdbss_pg),
> +			.size = size,
> +			.next_index = 0,
> +		};
> +	}
> +
> +	kvm->arch.enable_hdbss = true;
> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
> +
> +	/*
> +	 * We should kick vcpus out of guest mode here to load new
> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
> +	 */
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		kvm_vcpu_kick(vcpu);
> +
> +	return 0;
> +
> +error_alloc:
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (vcpu->arch.hdbss.base_phys)
> +			kvm_arm_vcpu_free_hdbss(vcpu);
> +	}
> +
> +	return -ENOMEM;
> +}
> +
>  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  			    struct kvm_enable_cap *cap)
>  {
> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		r = 0;
>  		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>  		break;
> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> +		mutex_lock(&kvm->lock);
> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
> +		mutex_unlock(&kvm->lock);
> +		break;
>  	default:
>  		break;
>  	}
> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  			r = kvm_supports_cacheable_pfnmap();
>  		break;
> 
> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> +		r = system_supports_hdbss();
> +		break;
>  	default:
>  		r = 0;
>  	}
> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>  		if (kvm_dirty_ring_check_request(vcpu))
>  			return 0;
> 
> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
> +			kvm_flush_hdbss_buffer(vcpu);

I am curious on why we need a flush-hdbss request,
Don't we have the flush function happening every time we run vcpu_put?

Oh, I see, you want to check if there is anything needed inside the inner 
loop of vcpu_run, without having to vcpu_put. I think it makes sense.

But instead of having this on guest entry, does not it make more sense to 
have it in guest exit? This way we flush every time (if needed) we exit the 
guest, and instead of having a vcpu request, we just require a vcpu kick 
and it should flush if needed.

Maybe have vcpu_put just save the registers, and add a the flush before 
handle_exit.

What do you think?


> +
>  		check_nested_vcpu_requests(vcpu);
>  	}
> 
> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
> 
>  void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>  {
> +	/*
> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
> +	 * before reporting dirty_bitmap to userspace. Send a request with
> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
> +	 */
> +	struct kvm_vcpu *vcpu;
> +
> +	if (!kvm->arch.enable_hdbss)
> +		return;
> 
> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>  }
> 
>  static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> index 9db3f11a4754..600cbc4f8ae9 100644
> --- a/arch/arm64/kvm/hyp/vhe/switch.c
> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>  	local_irq_restore(flags);
>  }
> 
> +static void __load_hdbss(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u64 br_el2, prod_el2;
> +
> +	if (!kvm->arch.enable_hdbss)
> +		return;
> +
> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> +	prod_el2 = vcpu->arch.hdbss.next_index;
> +
> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
> +
> +	isb();
> +}
> +
>  void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>  {
>  	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>  	__vcpu_load_switch_sysregs(vcpu);
>  	__vcpu_load_activate_traps(vcpu);
>  	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
> +	__load_hdbss(vcpu);
>  }
> 
>  void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>  {
> +	kvm_flush_hdbss_buffer(vcpu);
>  	__vcpu_put_deactivate_traps(vcpu);
>  	__vcpu_put_switch_sysregs(vcpu);
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 070a01e53fcb..42b0710a16ce 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
> 
> +	if (writable && kvm->arch.enable_hdbss && logging_active)
> +		prot |= KVM_PGTABLE_PROT_DBM;
> +
>  	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;
> 
> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
> 
> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
> +{
> +	int idx, curr_idx;
> +	u64 br_el2;
> +	u64 *hdbss_buf;
> +	struct kvm *kvm = vcpu->kvm;
> +
> +	if (!kvm->arch.enable_hdbss)
> +		return;
> +
> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> +
> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
> +	if (curr_idx == 0 || br_el2 == 0)
> +		return;
> +
> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
> +	if (!hdbss_buf)
> +		return;
> +
> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
> +	for (idx = 0; idx < curr_idx; idx++) {
> +		u64 gpa;
> +
> +		gpa = hdbss_buf[idx];
> +		if (!(gpa & HDBSS_ENTRY_VALID))
> +			continue;
> +
> +		gpa &= HDBSS_ENTRY_IPA;
> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> +	}

This will mark a page dirty for both dirty_bitmap or dirty_ring, depending 
on what is in use. 

Out of plain curiosity, have you planned / tested for the dirty-ring as 
well, or just for dirty-bitmap?

> +
> +	/* reset HDBSS index */
> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
> +	vcpu->arch.hdbss.next_index = 0;
> +	isb();
> +}
> +
> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
> +{
> +	u64 prod;
> +	u64 fsc;
> +
> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
> +
> +	switch (fsc) {
> +	case HDBSSPROD_EL2_FSC_OK:
> +		/* Buffer full, which is reported as permission fault. */
> +		kvm_flush_hdbss_buffer(vcpu);
> +		return 1;

Humm, flushing in a fault handler means hanging there, in IRQ context, for 
a while.

Since we already deal with this on guest_exit (vcpu_put IIUC), why not just 
return in a way the vcpu has to exit the inner loop and let it flush there 
instead?

Thanks!
Leo

> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
> +	case HDBSSPROD_EL2_FSC_GPF:
> +		return -EFAULT;
> +	default:
> +		/* Unknown fault. */
> +		WARN_ONCE(1,
> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
> +				fsc, prod, vcpu->vcpu_id);
> +		return -EFAULT;
> +	}
> +}
> +
>  /**
>   * kvm_handle_guest_abort - handles all 2nd stage aborts
>   * @vcpu:	the VCPU pointer
> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> 
>  	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
> 
> +	if (esr_iss2_is_hdbssf(esr))
> +		return kvm_handle_hdbss_fault(vcpu);
> +
>  	if (esr_fsc_is_translation_fault(esr)) {
>  		/* Beyond sanitised PARange (which is the IPA limit) */
>  		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 959532422d3a..c03a4b310b53 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>  	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>  	kfree(vcpu->arch.vncr_tlb);
>  	kfree(vcpu->arch.ccsidr);
> +
> +	if (vcpu->kvm->arch.enable_hdbss)
> +		kvm_arm_vcpu_free_hdbss(vcpu);
>  }
> 
>  static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
> --
> 2.33.0
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-03-04 12:08           ` Leonardo Bras
@ 2026-03-05  7:37             ` Tian Zheng
  0 siblings, 0 replies; 24+ messages in thread
From: Tian Zheng @ 2026-03-05  7:37 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: maz, oupton, catalin.marinas, corbet, pbonzini, will, yuzenghui,
	wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2, joey.gouly,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, skhan,
	suzuki.poulose

Hi Leo,

On 3/4/2026 8:08 PM, Leonardo Bras wrote:
> On Wed, Mar 04, 2026 at 11:06:46AM +0800, Tian Zheng wrote:
>> Hi Leo,
>>
>>
>> On 2/27/2026 10:10 PM, Leonardo Bras wrote:
>>> On Fri, Feb 27, 2026 at 06:47:25PM +0800, Tian Zheng wrote:
>>>> On 2/26/2026 1:46 AM, Leonardo Bras wrote:
>>>>> Hi Tian, eillon,
>>>>>
>>>>> On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
>>>>>> From: eillon <yezhenyu2@huawei.com>
>>>>>>
>>>>>> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
>>>>>> migration. This feature is only supported in VHE mode.
>>>>> I wonder if it would not be better just to use the feature if available,
>>>>> instead of needing to have userspace enabling it.
>>>>>
>>>> I agree. If we decide to make HDBSS automatically enabled, then userspace
>>>> would no longer need an explicit ioctl to turn it on. In that case, the only
>>>> userspace‑visible control we may still need is a parameter to specify the
>>>> HDBSS buffer size, with the kernel providing a reasonable default (for
>>>> example, 4 KB).
>>>>
>>>> Under such a model, the workflow could be simplified to:
>>>> 1. HDBSS is automatically enabled during KVM_SET_USER_MEMORY_REGION if the
>>>> feature is available.
>>>> 2. HDBSS is automatically disabled when the source VM stops.
>>>>
>>> I suggest we allocate the buffers and enable HDBSS during the first step of
>>> live migration, this way we don't need o have this memory usage during the
>>> lifetime of the VM, and we turn on HDBSS only when needed.
>>
>> Yes, we also think that enabling this feature and allocating the buffers
>> during the
>>
>> first step of live migration is the right approach.
>>
>>
>>>>>> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
>>>>>> write faults are handled by user_mem_abort, which relaxes permissions
>>>>>> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
>>>>>> writes no longer trap, as the hardware automatically transitions the page
>>>>>> from writable-clean to writable-dirty.
>>>>> That way we have to actually take a fault for every write you do after
>>>>> migration starts.
>>>>>
>>>>> What if, instead, we put the DBM bit in every memory faulted as writable
>>>>> (or better, due to a write fault), so from the beginning of the VM we
>>>>> know if memory is RO, WC (writable-clean) or WD (writable-dirty).
>>>>>
>>>>> On top of that, we don't actually have to take faults when migration
>>>>> starts, as HDBSS is tracking it all.
>>>>>
>>>> That makes sense. Pre‑setting DBM on writable mappings would avoid
>>>> taking write faults after migration starts, I will optimize this in the
>>>> next version.
>>>>
>>>>>> KVM does not scan S2 page tables to consume DBM.
>>>>> You mean dirty pages?
>>>>> If so, it will actually use dirty-log or dirty-rings to track the dirty
>>>>> pages, and not scan S2 on every iteration.
>>>>>
>>>> Sorry for the confusion — what I meant is that if I only added the DBM
>>>> bit, then relying on DBM for dirty‑page tracking would require scanning
>>>> the S2 page tables to find which PTEs have DBM set, and then updating
>>>> the dirty bitmap. That would obviously be expensive.
>>>>
>>>> However, in the current patch series, DBM is used together with HDBSS.
>>>> With HDBSS enabled, the hardware directly tracks the writable‑clean ->
>>>> writable‑dirty transitions and push it to HDBSS buffer, so we no longer
>>>> need to walk the S2 page tables at all. This is the main reason why
>>>> combining HDBSS with DBM provides a meaningful optimization.
>>>>
>>>> I will clarify this more clearly in the next version.
>>> Got it! Thanks for making it clear!
>>> I would just mention that you are transfering the recorded data to the
>>> dirty log, then.
>>
>> You're welcome! I'm glad the explanation is clear now.
>>
>>
>>>>>> Instead, when HDBSS is
>>>>>> enabled, the hardware observes the clean->dirty transition and records
>>>>>> the corresponding page into the HDBSS buffer.
>>>>>>
>>>>>> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
>>>>>> that check_vcpu_requests flushes the HDBSS buffer and propagates the
>>>>>> accumulated dirty information into the userspace-visible dirty bitmap.
>>>>>>
>>>>>> Add fault handling for HDBSS including buffer full, external abort, and
>>>>>> general protection fault (GPF).
>>>>>>
>>>>>> Signed-off-by: eillon <yezhenyu2@huawei.com>
>>>>>> Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
>>>>>> ---
>>>>>>     arch/arm64/include/asm/esr.h      |   5 ++
>>>>>>     arch/arm64/include/asm/kvm_host.h |  17 +++++
>>>>>>     arch/arm64/include/asm/kvm_mmu.h  |   1 +
>>>>>>     arch/arm64/include/asm/sysreg.h   |  11 ++++
>>>>>>     arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>>>>>>     arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>>>>>>     arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>>>>>>     arch/arm64/kvm/reset.c            |   3 +
>>>>>>     8 files changed, 228 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>>>>>> index 81c17320a588..2e6b679b5908 100644
>>>>>> --- a/arch/arm64/include/asm/esr.h
>>>>>> +++ b/arch/arm64/include/asm/esr.h
>>>>>> @@ -437,6 +437,11 @@
>>>>>>     #ifndef __ASSEMBLER__
>>>>>>     #include <asm/types.h>
>>>>>>
>>>>>> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
>>>>>> +{
>>>>>> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
>>>>>> +}
>>>>>> +
>>>>>>     static inline unsigned long esr_brk_comment(unsigned long esr)
>>>>>>     {
>>>>>>     	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
>>>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>>>>> index 5d5a3bbdb95e..57ee6b53e061 100644
>>>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>>>> @@ -55,12 +55,17 @@
>>>>>>     #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>>>>>>     #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>>>>>>     #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
>>>>>> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
>>>>>>
>>>>>>     #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>>>>>     				     KVM_DIRTY_LOG_INITIALLY_SET)
>>>>>>
>>>>>>     #define KVM_HAVE_MMU_RWLOCK
>>>>>>
>>>>>> +/* HDBSS entry field definitions */
>>>>>> +#define HDBSS_ENTRY_VALID BIT(0)
>>>>>> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
>>>>>> +
>>>>>>     /*
>>>>>>      * Mode of operation configurable with kvm-arm.mode early param.
>>>>>>      * See Documentation/admin-guide/kernel-parameters.txt for more information.
>>>>>> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>>>>>>     u32 __attribute_const__ kvm_target_cpu(void);
>>>>>>     void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>>>>>>     void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
>>>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
>>>>>>
>>>>>>     struct kvm_hyp_memcache {
>>>>>>     	phys_addr_t head;
>>>>>> @@ -405,6 +411,8 @@ struct kvm_arch {
>>>>>>     	 * the associated pKVM instance in the hypervisor.
>>>>>>     	 */
>>>>>>     	struct kvm_protected_vm pkvm;
>>>>>> +
>>>>>> +	bool enable_hdbss;
>>>>>>     };
>>>>>>
>>>>>>     struct kvm_vcpu_fault_info {
>>>>>> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>>>>>>     	bool		reset;
>>>>>>     };
>>>>>>
>>>>>> +struct vcpu_hdbss_state {
>>>>>> +	phys_addr_t base_phys;
>>>>>> +	u32 size;
>>>>>> +	u32 next_index;
>>>>>> +};
>>>>>> +
>>>>>>     struct vncr_tlb;
>>>>>>
>>>>>>     struct kvm_vcpu_arch {
>>>>>> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
>>>>>>
>>>>>>     	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>>>>>>     	struct vncr_tlb	*vncr_tlb;
>>>>>> +
>>>>>> +	/* HDBSS registers info */
>>>>>> +	struct vcpu_hdbss_state hdbss;
>>>>>>     };
>>>>>>
>>>>>>     /*
>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>>>>> index d968aca0461a..3fea8cfe8869 100644
>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>>>> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>>>>>
>>>>>>     int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>>>>>>     int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
>>>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
>>>>>>
>>>>>>     phys_addr_t kvm_mmu_get_httbr(void);
>>>>>>     phys_addr_t kvm_get_idmap_vector(void);
>>>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>>>>>> index f4436ecc630c..d11f4d0dd4e7 100644
>>>>>> --- a/arch/arm64/include/asm/sysreg.h
>>>>>> +++ b/arch/arm64/include/asm/sysreg.h
>>>>>> @@ -1039,6 +1039,17 @@
>>>>>>
>>>>>>     #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>>>>>>     					       GCS_CAP_VALID_TOKEN)
>>>>>> +
>>>>>> +/*
>>>>>> + * Definitions for the HDBSS feature
>>>>>> + */
>>>>>> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
>>>>>> +
>>>>>> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
>>>>>> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
>>>>>> +
>>>>>> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
>>>>>> +
>>>>> Do we actually need the GENMASK above? Could not we use just the
>>>>> HDBSSBR_EL2_BADDR_MASK?
>>>>>
>>>>> If the base address received in alloc_pages is not properly aligned, we
>>>>> might end up writing in some different memory region that we allocated on.
>>>>>
>>>>> If you want to actually make sure the mem region is aligned, check just
>>>>> after it's allocated, instead of silently masking it at this moment.
>>>>>
>>>>> In any case, I wonder if we actually need above defines, as it looks they
>>>>> could easily be replaced by what they do.
>>>>>
>>>>>
>>>> You're right, I will replace it with HDBSSBR_EL2_BADDR_MASK, and I will
>>>> add an explicit check to ensure that the physical address returned by
>>>> alloc_pages() is properly aligned.
>>> I recommend checking if the used function have any garantees in respect to
>>> alignment, so that maybe we may not actually need to check.
>>
>> Ok, I will check that and confirm whether the function provides any
>> alignment
>>
>> guarantees.
>>
>>
> See below comment
>   
>>>> I agree that some of them may not be necessary once the alignment is
>>>> validated. I will review them and simplify the definitions where
>>>> appropriate.
>>> Thanks!
>>>>>>     /*
>>>>>>      * Definitions for GICv5 instructions
>>>>>>      */
>>>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>>>> index 29f0326f7e00..d64da05e25c4 100644
>>>>>> --- a/arch/arm64/kvm/arm.c
>>>>>> +++ b/arch/arm64/kvm/arm.c
>>>>>> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>>>>>     	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>>>>>>     }
>>>>>>
>>>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	struct page *hdbss_pg;
>>>>>> +
>>>>>> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
>>>>>> +	if (hdbss_pg)
>>>>>> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
>>>>>> +
>>>>>> +	vcpu->arch.hdbss.size = 0;
>>>>>> +}
>>>>>> +
>>>>>> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
>>>>>> +				    struct kvm_enable_cap *cap)
>>>>>> +{
>>>>>> +	unsigned long i;
>>>>>> +	struct kvm_vcpu *vcpu;
>>>>>> +	struct page *hdbss_pg = NULL;
>>>>>> +	__u64 size = cap->args[0];
>>>>>> +	bool enable = cap->args[1] ? true : false;
>>>>>> +
>>>>>> +	if (!system_supports_hdbss())
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	if (size > HDBSS_MAX_SIZE)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
>>>>>> +		return 0;
>>>>>> +
>>>>>> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	if (!enable) { /* Turn it off */
>>>>>> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
>>>>>> +
>>>>>> +		kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>>> +			/* Kick vcpus to flush hdbss buffer. */
>>>>>> +			kvm_vcpu_kick(vcpu);
>>>>>> +
>>>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>>>> +		}
>>>>>> +
>>>>>> +		kvm->arch.enable_hdbss = false;
>>>>>> +
>>>>>> +		return 0;
>>>>>> +	}
>>>>>> +
>>>>>> +	/* Turn it on */
>>>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>>> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
> /**
>   * alloc_pages - Allocate pages.
>   * @gfp: GFP flags.
>   * @order: Power of two of number of pages to allocate.
>   *
>   * Allocate 1 << @order contiguous pages.  The physical address of the
>   * first page is naturally aligned (eg an order-3 allocation will be aligned
>   * to a multiple of 8 * PAGE_SIZE bytes).  The NUMA policy of the current
>   * process is honoured when in process context.
>   *
>   * Context: Can be called from any context, providing the appropriate GFP
>   * flags are used.
>   * Return: The page on success or NULL if allocation fails.
>   */
>
> It looks like we are safe from the aspect of alignment, according to the
> documentation on alloc_pages.
>
> I would rename the variable 'size' here, it could be misleading, even
> though the ioctl docs state that it's the order.


Sure, I will rename it from size to order. Thx!


>>>>>> +		if (!hdbss_pg)
>>>>>> +			goto error_alloc;
>>>>>> +
>>>>>> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
>>>>>> +			.base_phys = page_to_phys(hdbss_pg),
>>>>>> +			.size = size,
>>>>>> +			.next_index = 0,
>>>>>> +		};
>>>>>> +	}
>>>>>> +
>>>>>> +	kvm->arch.enable_hdbss = true;
>>>>>> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * We should kick vcpus out of guest mode here to load new
>>>>>> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
>>>>>> +	 */
>>>>>> +	kvm_for_each_vcpu(i, vcpu, kvm)
>>>>>> +		kvm_vcpu_kick(vcpu);
>>>>>> +
>>>>>> +	return 0;
>>>>>> +
>>>>>> +error_alloc:
>>>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>>> +		if (vcpu->arch.hdbss.base_phys)
>>>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>>>> +	}
>>>>>> +
>>>>>> +	return -ENOMEM;
>>>>>> +}
>>>>>> +
>>>>>>     int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>>>     			    struct kvm_enable_cap *cap)
>>>>>>     {
>>>>>> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>>>     		r = 0;
>>>>>>     		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>>>>>>     		break;
>>>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>>>> +		mutex_lock(&kvm->lock);
>>>>>> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
>>>>>> +		mutex_unlock(&kvm->lock);
>>>>>> +		break;
>>>>> If we prefer using a ioctl, I wonder if it would not be better to have a
>>>>> arch-generic option that enables hw dirty-bit tracking, and all archs could
>>>>> use it to implement their versions when available.
>>>>>
>>>>> I guess any VMM would have a much easier time doing it once, than for every
>>>>> arch they support.
>>>>>
>>>> I think that even if we eventually decide to enable HDBSS by default,
>>>> userspace will still need an ioctl to configure the HDBSS buffer size.
>>>> So an interface is required anyway.
>>> That's a valid argument. But we could as well have those configured based
>>> on VM memory size, or other parameter from the VM.
>>> Since it could be allocated just during the migration, we may have some
>>> flexibility on size.
>>>
>>> But sure, we could have a default value and let user (optionally) configure
>>> the hdbss percpu bufsize.
>>
>> That's a good idea! We can automatically determine an appropriate buffer
>> size
>>
>> when this feature is enabled during the first step of live migration, and
>> then we
>>
>> can remove the ioctl interface.
>>
>>
>> I will update this in the next version.
>>
>>
>>>> Also, it makes sense to expose this as an arch‑generic capability rather
>>>> than an ARM‑specific one. I will rename the ioctl to something like
>>>> KVM_CAP_HW_DIRTY_STATE_TRACK, and each architecture can implement its
>>>> own hardware‑assisted dirty tracking when available.
>>> I wonder if we need a new capability for this, at all.
>>> Couldn't we only use the feature when available?
>>
>> If we decide to enable this feature entirely inside KVM, we could remove
>>
>> this interface.
>>
> I think that's the best option.
>
> Thanks!
> Leo
>
>>>> I will update the interface in the next version.
>>>>
>>> Thanks!
>>> Leo
>>>
>>>>>>     	default:
>>>>>>     		break;
>>>>>>     	}
>>>>>> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>>>     			r = kvm_supports_cacheable_pfnmap();
>>>>>>     		break;
>>>>>>
>>>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>>>> +		r = system_supports_hdbss();
>>>>>> +		break;
>>>>>>     	default:
>>>>>>     		r = 0;
>>>>>>     	}
>>>>>> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>>>>>>     		if (kvm_dirty_ring_check_request(vcpu))
>>>>>>     			return 0;
>>>>>>
>>>>>> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
>>>>>> +			kvm_flush_hdbss_buffer(vcpu);
>>>>>> +
>>>>>>     		check_nested_vcpu_requests(vcpu);
>>>>>>     	}
>>>>>>
>>>>>> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
>>>>>>
>>>>>>     void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>>>>>>     {
>>>>>> +	/*
>>>>>> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
>>>>>> +	 * before reporting dirty_bitmap to userspace. Send a request with
>>>>>> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
>>>>>> +	 */
>>>>>> +	struct kvm_vcpu *vcpu;
>>>>>> +
>>>>>> +	if (!kvm->arch.enable_hdbss)
>>>>>> +		return;
>>>>>>
>>>>>> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>>>>>>     }
>>>>>>
>>>>>>     static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>>>>>> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
>>>>>> index 9db3f11a4754..600cbc4f8ae9 100644
>>>>>> --- a/arch/arm64/kvm/hyp/vhe/switch.c
>>>>>> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
>>>>>> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>>>>>>     	local_irq_restore(flags);
>>>>>>     }
>>>>>>
>>>>>> +static void __load_hdbss(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	struct kvm *kvm = vcpu->kvm;
>>>>>> +	u64 br_el2, prod_el2;
>>>>>> +
>>>>>> +	if (!kvm->arch.enable_hdbss)
>>>>>> +		return;
>>>>>> +
>>>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>>>> +	prod_el2 = vcpu->arch.hdbss.next_index;
>>>>>> +
>>>>>> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
>>>>>> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
>>>>>> +
>>>>>> +	isb();
>>>>>> +}
>>>>>> +
>>>>>>     void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>>>     {
>>>>>>     	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
>>>>>> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>>>     	__vcpu_load_switch_sysregs(vcpu);
>>>>>>     	__vcpu_load_activate_traps(vcpu);
>>>>>>     	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
>>>>>> +	__load_hdbss(vcpu);
>>>>>>     }
>>>>>>
>>>>>>     void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>>>>>>     {
>>>>>> +	kvm_flush_hdbss_buffer(vcpu);
>>>>>>     	__vcpu_put_deactivate_traps(vcpu);
>>>>>>     	__vcpu_put_switch_sysregs(vcpu);
>>>>>>
>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>>>> index 070a01e53fcb..42b0710a16ce 100644
>>>>>> --- a/arch/arm64/kvm/mmu.c
>>>>>> +++ b/arch/arm64/kvm/mmu.c
>>>>>> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>     	if (writable)
>>>>>>     		prot |= KVM_PGTABLE_PROT_W;
>>>>>>
>>>>>> +	if (writable && kvm->arch.enable_hdbss && logging_active)
>>>>>> +		prot |= KVM_PGTABLE_PROT_DBM;
>>>>>> +
>>>>>>     	if (exec_fault)
>>>>>>     		prot |= KVM_PGTABLE_PROT_X;
>>>>>>
>>>>>> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>>>>>     	return 0;
>>>>>>     }
>>>>>>
>>>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	int idx, curr_idx;
>>>>>> +	u64 br_el2p;
>>>>>> +	u64 *hdbss_buf;
>>>>>> +	struct kvm *kvm = vcpu->kvm;
>>>>>> +
>>>>>> +	if (!kvm->arch.enable_hdbss)
>>>>>> +		return;
>>>>>> +
>>>>>> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
>>>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>>>> +
>>>>>> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
>>>>>> +	if (curr_idx == 0 || br_el2 == 0)
>>>>>> +		return;
>>>>>> +
>>>>>> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
>>>>>> +	if (!hdbss_buf)
>>>>>> +		return;
>>>>>> +
>>>>>> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
>>>>>> +	for (idx = 0; idx < curr_idx; idx++) {
>>>>>> +		u64 gpa;
>>>>>> +
>>>>>> +		gpa = hdbss_buf[idx];
>>>>>> +		if (!(gpa & HDBSS_ENTRY_VALID))
>>>>>> +			continue;
>>>>>> +
>>>>>> +		gpa &= HDBSS_ENTRY_IPA;
>>>>>> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
>>>>>> +	}
>>>>>> +
>>>>>> +	/* reset HDBSS index */
>>>>>> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
>>>>>> +	vcpu->arch.hdbss.next_index = 0;
>>>>>> +	isb();
>>>>>> +}
>>>>>> +
>>>>>> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	u64 prod;
>>>>>> +	u64 fsc;
>>>>>> +
>>>>>> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
>>>>>> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
>>>>>> +
>>>>>> +	switch (fsc) {
>>>>>> +	case HDBSSPROD_EL2_FSC_OK:
>>>>>> +		/* Buffer full, which is reported as permission fault. */
>>>>>> +		kvm_flush_hdbss_buffer(vcpu);
>>>>>> +		return 1;
>>>>>> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
>>>>>> +	case HDBSSPROD_EL2_FSC_GPF:
>>>>>> +		return -EFAULT;
>>>>>> +	default:
>>>>>> +		/* Unknown fault. */
>>>>>> +		WARN_ONCE(1,
>>>>>> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
>>>>>> +				fsc, prod, vcpu->vcpu_id);
>>>>>> +		return -EFAULT;
>>>>>> +	}
>>>>>> +}
>>>>>> +
>>>>>>     /**
>>>>>>      * kvm_handle_guest_abort - handles all 2nd stage aborts
>>>>>>      * @vcpu:	the VCPU pointer
>>>>>> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>>>>
>>>>>>     	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>>>>>>
>>>>>> +	if (esr_iss2_is_hdbssf(esr))
>>>>>> +		return kvm_handle_hdbss_fault(vcpu);
>>>>>> +
>>>>>>     	if (esr_fsc_is_translation_fault(esr)) {
>>>>>>     		/* Beyond sanitised PARange (which is the IPA limit) */
>>>>>>     		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
>>>>>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>>>>>> index 959532422d3a..c03a4b310b53 100644
>>>>>> --- a/arch/arm64/kvm/reset.c
>>>>>> +++ b/arch/arm64/kvm/reset.c
>>>>>> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>>>>>>     	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>>>>>>     	kfree(vcpu->arch.vncr_tlb);
>>>>>>     	kfree(vcpu->arch.ccsidr);
>>>>>> +
>>>>>> +	if (vcpu->kvm->arch.enable_hdbss)
>>>>>> +		kvm_arm_vcpu_free_hdbss(vcpu);
>>>>>>     }
>>>>>>
>>>>>>     static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>>>>>> --
>>>>>> 2.33.0
>>>>>>
>>>>> Thx,
>>>>> Leo
>>>>>
>> Thx,
>>
>> Tian
>>
>>
Thx,

Tian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-03-04 15:40   ` Leonardo Bras
@ 2026-03-06  9:27     ` Tian Zheng
  2026-03-06 15:01       ` Leonardo Bras
  0 siblings, 1 reply; 24+ messages in thread
From: Tian Zheng @ 2026-03-06  9:27 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: maz, oupton, catalin.marinas, corbet, pbonzini, will, yuzenghui,
	wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2, linuxarm,
	joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose

Hi Leo,

On 3/4/2026 11:40 PM, Leonardo Bras wrote:
> Hi Tian,
>
> Few extra notes/questions below
>
> On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
>> From: eillon<yezhenyu2@huawei.com>
>>
>> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
>> migration. This feature is only supported in VHE mode.
>>
>> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
>> write faults are handled by user_mem_abort, which relaxes permissions
>> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
>> writes no longer trap, as the hardware automatically transitions the page
>> from writable-clean to writable-dirty.
>>
>> KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
>> enabled, the hardware observes the clean->dirty transition and records
>> the corresponding page into the HDBSS buffer.
>>
>> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
>> that check_vcpu_requests flushes the HDBSS buffer and propagates the
>> accumulated dirty information into the userspace-visible dirty bitmap.
>>
>> Add fault handling for HDBSS including buffer full, external abort, and
>> general protection fault (GPF).
>>
>> Signed-off-by: eillon<yezhenyu2@huawei.com>
>> Signed-off-by: Tian Zheng<zhengtian10@huawei.com>
>> ---
>>   arch/arm64/include/asm/esr.h      |   5 ++
>>   arch/arm64/include/asm/kvm_host.h |  17 +++++
>>   arch/arm64/include/asm/kvm_mmu.h  |   1 +
>>   arch/arm64/include/asm/sysreg.h   |  11 ++++
>>   arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>>   arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>>   arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>>   arch/arm64/kvm/reset.c            |   3 +
>>   8 files changed, 228 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>> index 81c17320a588..2e6b679b5908 100644
>> --- a/arch/arm64/include/asm/esr.h
>> +++ b/arch/arm64/include/asm/esr.h
>> @@ -437,6 +437,11 @@
>>   #ifndef __ASSEMBLER__
>>   #include <asm/types.h>
>>
>> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
>> +{
>> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
>> +}
>> +
>>   static inline unsigned long esr_brk_comment(unsigned long esr)
>>   {
>>   	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 5d5a3bbdb95e..57ee6b53e061 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -55,12 +55,17 @@
>>   #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>>   #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>>   #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
>> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
>>
>>   #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>   				     KVM_DIRTY_LOG_INITIALLY_SET)
>>
>>   #define KVM_HAVE_MMU_RWLOCK
>>
>> +/* HDBSS entry field definitions */
>> +#define HDBSS_ENTRY_VALID BIT(0)
>> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
>> +
>>   /*
>>    * Mode of operation configurable with kvm-arm.mode early param.
>>    * See Documentation/admin-guide/kernel-parameters.txt for more information.
>> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>>   u32 __attribute_const__ kvm_target_cpu(void);
>>   void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>>   void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
>>
>>   struct kvm_hyp_memcache {
>>   	phys_addr_t head;
>> @@ -405,6 +411,8 @@ struct kvm_arch {
>>   	 * the associated pKVM instance in the hypervisor.
>>   	 */
>>   	struct kvm_protected_vm pkvm;
>> +
>> +	bool enable_hdbss;
>>   };
>>
>>   struct kvm_vcpu_fault_info {
>> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>>   	bool		reset;
>>   };
>>
>> +struct vcpu_hdbss_state {
>> +	phys_addr_t base_phys;
>> +	u32 size;
>> +	u32 next_index;
>> +};
>> +
> IIUC this is used once both on enable/disable and massively on
> vcpu_put/get.
>
> What if we actually save just HDBSSBR_EL2 and HDBSSPROD_EL2 instead?
> That way we avoid having masking operations in put/get as well as any
> possible error we may have formatting those.
>
> The cost is doing those operations once for enable and once for disable,
> which should be fine.


Thanks for the suggestion. I actually started with storing the raw 
system register

values, as you proposed.


However, after discussing it with Oliver Upton in v1, we felt that 
keeping the base address,

size, and index as separate fields makes the state easier to understand.


Discussion 
link:https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/ 
<https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/>


That's why I ended up changing the storage approach in the end.


>>   struct vncr_tlb;
>>
>>   struct kvm_vcpu_arch {
>> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
>>
>>   	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>>   	struct vncr_tlb	*vncr_tlb;
>> +
>> +	/* HDBSS registers info */
>> +	struct vcpu_hdbss_state hdbss;
>>   };
>>
>>   /*
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index d968aca0461a..3fea8cfe8869 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>
>>   int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>>   int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
>>
>>   phys_addr_t kvm_mmu_get_httbr(void);
>>   phys_addr_t kvm_get_idmap_vector(void);
>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>> index f4436ecc630c..d11f4d0dd4e7 100644
>> --- a/arch/arm64/include/asm/sysreg.h
>> +++ b/arch/arm64/include/asm/sysreg.h
>> @@ -1039,6 +1039,17 @@
>>
>>   #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>>   					       GCS_CAP_VALID_TOKEN)
>> +
>> +/*
>> + * Definitions for the HDBSS feature
>> + */
>> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
>> +
>> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
>> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
>> +
>> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
>> +
>>   /*
>>    * Definitions for GICv5 instructions
>>    */
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 29f0326f7e00..d64da05e25c4 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>   	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>>   }
>>
>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
>> +{
>> +	struct page *hdbss_pg;
>> +
>> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
>> +	if (hdbss_pg)
>> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
>> +
>> +	vcpu->arch.hdbss.size = 0;
>> +}
>> +
>> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
>> +				    struct kvm_enable_cap *cap)
>> +{
>> +	unsigned long i;
>> +	struct kvm_vcpu *vcpu;
>> +	struct page *hdbss_pg = NULL;
>> +	__u64 size = cap->args[0];
>> +	bool enable = cap->args[1] ? true : false;
>> +
>> +	if (!system_supports_hdbss())
>> +		return -EINVAL;
>> +
>> +	if (size > HDBSS_MAX_SIZE)
>> +		return -EINVAL;
>> +
>> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
>> +		return 0;
>> +
>> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
>> +		return -EINVAL;
>> +
>> +	if (!enable) { /* Turn it off */
>> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
>> +
>> +		kvm_for_each_vcpu(i, vcpu, kvm) {
>> +			/* Kick vcpus to flush hdbss buffer. */
>> +			kvm_vcpu_kick(vcpu);
>> +
>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>> +		}
>> +
>> +		kvm->arch.enable_hdbss = false;
>> +
>> +		return 0;
>> +	}
>> +
>> +	/* Turn it on */
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
>> +		if (!hdbss_pg)
>> +			goto error_alloc;
>> +
>> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
>> +			.base_phys = page_to_phys(hdbss_pg),
>> +			.size = size,
>> +			.next_index = 0,
>> +		};
>> +	}
>> +
>> +	kvm->arch.enable_hdbss = true;
>> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
>> +
>> +	/*
>> +	 * We should kick vcpus out of guest mode here to load new
>> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
>> +	 */
>> +	kvm_for_each_vcpu(i, vcpu, kvm)
>> +		kvm_vcpu_kick(vcpu);
>> +
>> +	return 0;
>> +
>> +error_alloc:
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		if (vcpu->arch.hdbss.base_phys)
>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>> +	}
>> +
>> +	return -ENOMEM;
>> +}
>> +
>>   int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>   			    struct kvm_enable_cap *cap)
>>   {
>> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>   		r = 0;
>>   		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>>   		break;
>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>> +		mutex_lock(&kvm->lock);
>> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
>> +		mutex_unlock(&kvm->lock);
>> +		break;
>>   	default:
>>   		break;
>>   	}
>> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   			r = kvm_supports_cacheable_pfnmap();
>>   		break;
>>
>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>> +		r = system_supports_hdbss();
>> +		break;
>>   	default:
>>   		r = 0;
>>   	}
>> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>>   		if (kvm_dirty_ring_check_request(vcpu))
>>   			return 0;
>>
>> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
>> +			kvm_flush_hdbss_buffer(vcpu);
> I am curious on why we need a flush-hdbss request,
> Don't we have the flush function happening every time we run vcpu_put?
>
> Oh, I see, you want to check if there is anything needed inside the inner
> loop of vcpu_run, without having to vcpu_put. I think it makes sense.
>
> But instead of having this on guest entry, does not it make more sense to
> have it in guest exit? This way we flush every time (if needed) we exit the
> guest, and instead of having a vcpu request, we just require a vcpu kick
> and it should flush if needed.
>
> Maybe have vcpu_put just save the registers, and add a the flush before
> handle_exit.
>
> What do you think?


Thank you for the feedback.


Indeed, in the initial version (v1), I placed the flush operation inside 
handle_exit and

used a vcpu_kick in kvm_arch_sync_dirty_log to trigger the flush of the 
HDBSS buffer.


However, during the review, Marc pointed out that calling this function 
on every exit

event is too frequent if it's not always needed.


Discussion link: 
_https://lore.kernel.org/linux-arm-kernel/86senjony9.wl-maz@kernel.org/_


I agreed with his assessment. Therefore, in the current version, I've 
separated the flush

operation into more specific and less frequent points:


1. In vcpu_put

2. During dirty log synchronization, by kicking the vCPU to trigger a 
request that flushes

on its next exit.


3. When handling a specific HDBSSF event.


This ensures the flush happens only when necessary, avoiding the 
overhead of doing it

on every guest exit.


>> +
>>   		check_nested_vcpu_requests(vcpu);
>>   	}
>>
>> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
>>
>>   void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>>   {
>> +	/*
>> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
>> +	 * before reporting dirty_bitmap to userspace. Send a request with
>> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
>> +	 */
>> +	struct kvm_vcpu *vcpu;
>> +
>> +	if (!kvm->arch.enable_hdbss)
>> +		return;
>>
>> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>>   }
>>
>>   static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
>> index 9db3f11a4754..600cbc4f8ae9 100644
>> --- a/arch/arm64/kvm/hyp/vhe/switch.c
>> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
>> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>>   	local_irq_restore(flags);
>>   }
>>
>> +static void __load_hdbss(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm *kvm = vcpu->kvm;
>> +	u64 br_el2, prod_el2;
>> +
>> +	if (!kvm->arch.enable_hdbss)
>> +		return;
>> +
>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>> +	prod_el2 = vcpu->arch.hdbss.next_index;
>> +
>> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
>> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
>> +
>> +	isb();
>> +}
>> +
>>   void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>   {
>>   	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
>> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>   	__vcpu_load_switch_sysregs(vcpu);
>>   	__vcpu_load_activate_traps(vcpu);
>>   	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
>> +	__load_hdbss(vcpu);
>>   }
>>
>>   void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>>   {
>> +	kvm_flush_hdbss_buffer(vcpu);
>>   	__vcpu_put_deactivate_traps(vcpu);
>>   	__vcpu_put_switch_sysregs(vcpu);
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 070a01e53fcb..42b0710a16ce 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	if (writable)
>>   		prot |= KVM_PGTABLE_PROT_W;
>>
>> +	if (writable && kvm->arch.enable_hdbss && logging_active)
>> +		prot |= KVM_PGTABLE_PROT_DBM;
>> +
>>   	if (exec_fault)
>>   		prot |= KVM_PGTABLE_PROT_X;
>>
>> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>   	return 0;
>>   }
>>
>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
>> +{
>> +	int idx, curr_idx;
>> +	u64 br_el2;
>> +	u64 *hdbss_buf;
>> +	struct kvm *kvm = vcpu->kvm;
>> +
>> +	if (!kvm->arch.enable_hdbss)
>> +		return;
>> +
>> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>> +
>> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
>> +	if (curr_idx == 0 || br_el2 == 0)
>> +		return;
>> +
>> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
>> +	if (!hdbss_buf)
>> +		return;
>> +
>> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
>> +	for (idx = 0; idx < curr_idx; idx++) {
>> +		u64 gpa;
>> +
>> +		gpa = hdbss_buf[idx];
>> +		if (!(gpa & HDBSS_ENTRY_VALID))
>> +			continue;
>> +
>> +		gpa &= HDBSS_ENTRY_IPA;
>> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
>> +	}
> This will mark a page dirty for both dirty_bitmap or dirty_ring, depending
> on what is in use.
>
> Out of plain curiosity, have you planned / tested for the dirty-ring as
> well, or just for dirty-bitmap?


Currently, I have only tested this with dirty-bitmap mode.


I will test and ensure the HDBSS feature works correctly with dirty-ring 
in the next version.


>> +
>> +	/* reset HDBSS index */
>> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
>> +	vcpu->arch.hdbss.next_index = 0;
>> +	isb();
>> +}
>> +
>> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 prod;
>> +	u64 fsc;
>> +
>> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
>> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
>> +
>> +	switch (fsc) {
>> +	case HDBSSPROD_EL2_FSC_OK:
>> +		/* Buffer full, which is reported as permission fault. */
>> +		kvm_flush_hdbss_buffer(vcpu);
>> +		return 1;
> Humm, flushing in a fault handler means hanging there, in IRQ context, for
> a while.
>
> Since we already deal with this on guest_exit (vcpu_put IIUC), why not just
> return in a way the vcpu has to exit the inner loop and let it flush there
> instead?
>
> Thanks!
> Leo


Thanks for the feedback.


If we flush on every guest exit (by moving the flush before handle_exit, 
then we can

indeed drop the flush from the fault handler and from vcpu_put.


However, given Marc's earlier concern about not imposing this overhead 
on all vCPUs,

I'd rather avoid flushing on every exit.


My current plan is to set a request bit in kvm_handle_hdbss_fault (via 
kvm_make_request),

and move the actual flush to the normal exit path, where it can execute 
in a safe context.

This also allows us to remove the flush from the fault handler entirely.


Does that approach sound reasonable to you?


>> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
>> +	case HDBSSPROD_EL2_FSC_GPF:
>> +		return -EFAULT;
>> +	default:
>> +		/* Unknown fault. */
>> +		WARN_ONCE(1,
>> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
>> +				fsc, prod, vcpu->vcpu_id);
>> +		return -EFAULT;
>> +	}
>> +}
>> +
>>   /**
>>    * kvm_handle_guest_abort - handles all 2nd stage aborts
>>    * @vcpu:	the VCPU pointer
>> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>
>>   	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>>
>> +	if (esr_iss2_is_hdbssf(esr))
>> +		return kvm_handle_hdbss_fault(vcpu);
>> +
>>   	if (esr_fsc_is_translation_fault(esr)) {
>>   		/* Beyond sanitised PARange (which is the IPA limit) */
>>   		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index 959532422d3a..c03a4b310b53 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>>   	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>>   	kfree(vcpu->arch.vncr_tlb);
>>   	kfree(vcpu->arch.ccsidr);
>> +
>> +	if (vcpu->kvm->arch.enable_hdbss)
>> +		kvm_arm_vcpu_free_hdbss(vcpu);
>>   }
>>
>>   static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>> --
>> 2.33.0

Thanks!

Tian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-03-06  9:27     ` Tian Zheng
@ 2026-03-06 15:01       ` Leonardo Bras
  2026-03-12  6:17         ` Tian Zheng
  0 siblings, 1 reply; 24+ messages in thread
From: Leonardo Bras @ 2026-03-06 15:01 UTC (permalink / raw)
  To: Tian Zheng
  Cc: Leonardo Bras, maz, oupton, catalin.marinas, corbet, pbonzini,
	will, yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron,
	yezhenyu2, linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel,
	linux-doc, linux-kernel, skhan, suzuki.poulose

On Fri, Mar 06, 2026 at 05:27:58PM +0800, Tian Zheng wrote:
> Hi Leo,
> 
> On 3/4/2026 11:40 PM, Leonardo Bras wrote:
> > Hi Tian,
> > 
> > Few extra notes/questions below
> > 
> > On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
> > > From: eillon<yezhenyu2@huawei.com>
> > > 
> > > HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
> > > migration. This feature is only supported in VHE mode.
> > > 
> > > Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
> > > write faults are handled by user_mem_abort, which relaxes permissions
> > > and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
> > > writes no longer trap, as the hardware automatically transitions the page
> > > from writable-clean to writable-dirty.
> > > 
> > > KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
> > > enabled, the hardware observes the clean->dirty transition and records
> > > the corresponding page into the HDBSS buffer.
> > > 
> > > During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
> > > that check_vcpu_requests flushes the HDBSS buffer and propagates the
> > > accumulated dirty information into the userspace-visible dirty bitmap.
> > > 
> > > Add fault handling for HDBSS including buffer full, external abort, and
> > > general protection fault (GPF).
> > > 
> > > Signed-off-by: eillon<yezhenyu2@huawei.com>
> > > Signed-off-by: Tian Zheng<zhengtian10@huawei.com>
> > > ---
> > >   arch/arm64/include/asm/esr.h      |   5 ++
> > >   arch/arm64/include/asm/kvm_host.h |  17 +++++
> > >   arch/arm64/include/asm/kvm_mmu.h  |   1 +
> > >   arch/arm64/include/asm/sysreg.h   |  11 ++++
> > >   arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
> > >   arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
> > >   arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
> > >   arch/arm64/kvm/reset.c            |   3 +
> > >   8 files changed, 228 insertions(+)
> > > 
> > > diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> > > index 81c17320a588..2e6b679b5908 100644
> > > --- a/arch/arm64/include/asm/esr.h
> > > +++ b/arch/arm64/include/asm/esr.h
> > > @@ -437,6 +437,11 @@
> > >   #ifndef __ASSEMBLER__
> > >   #include <asm/types.h>
> > > 
> > > +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
> > > +{
> > > +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
> > > +}
> > > +
> > >   static inline unsigned long esr_brk_comment(unsigned long esr)
> > >   {
> > >   	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index 5d5a3bbdb95e..57ee6b53e061 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -55,12 +55,17 @@
> > >   #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
> > >   #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
> > >   #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
> > > +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
> > > 
> > >   #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
> > >   				     KVM_DIRTY_LOG_INITIALLY_SET)
> > > 
> > >   #define KVM_HAVE_MMU_RWLOCK
> > > 
> > > +/* HDBSS entry field definitions */
> > > +#define HDBSS_ENTRY_VALID BIT(0)
> > > +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> > > +
> > >   /*
> > >    * Mode of operation configurable with kvm-arm.mode early param.
> > >    * See Documentation/admin-guide/kernel-parameters.txt for more information.
> > > @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
> > >   u32 __attribute_const__ kvm_target_cpu(void);
> > >   void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
> > >   void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
> > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
> > > 
> > >   struct kvm_hyp_memcache {
> > >   	phys_addr_t head;
> > > @@ -405,6 +411,8 @@ struct kvm_arch {
> > >   	 * the associated pKVM instance in the hypervisor.
> > >   	 */
> > >   	struct kvm_protected_vm pkvm;
> > > +
> > > +	bool enable_hdbss;
> > >   };
> > > 
> > >   struct kvm_vcpu_fault_info {
> > > @@ -816,6 +824,12 @@ struct vcpu_reset_state {
> > >   	bool		reset;
> > >   };
> > > 
> > > +struct vcpu_hdbss_state {
> > > +	phys_addr_t base_phys;
> > > +	u32 size;
> > > +	u32 next_index;
> > > +};
> > > +
> > IIUC this is used once both on enable/disable and massively on
> > vcpu_put/get.
> > 
> > What if we actually save just HDBSSBR_EL2 and HDBSSPROD_EL2 instead?
> > That way we avoid having masking operations in put/get as well as any
> > possible error we may have formatting those.
> > 
> > The cost is doing those operations once for enable and once for disable,
> > which should be fine.

Hi Tian,

> 
> 
> Thanks for the suggestion. I actually started with storing the raw system
> register
> 
> values, as you proposed.
> 
> 
> However, after discussing it with Oliver Upton in v1, we felt that keeping
> the base address,
> 
> size, and index as separate fields makes the state easier to understand.
> 
> 
> Discussion
> link:https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/
> <https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/>
> 
> 
> That's why I ended up changing the storage approach in the end.
> 
> 

Humm, FWIW I disagree with the above argument.
I would argue that vcpu_put should save the registers, and not 
actually know what they are about or how are they formatted at this point.

The responsibility of understanding it's fields and usage value should be 
in the code that actually uses it.

IIUC on kvm_vcpu_put_vhe and kvm_vcpu_load_vhe there are calls to other 
functions than only save the register as it is.

> > >   struct vncr_tlb;
> > > 
> > >   struct kvm_vcpu_arch {
> > > @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
> > > 
> > >   	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
> > >   	struct vncr_tlb	*vncr_tlb;
> > > +
> > > +	/* HDBSS registers info */
> > > +	struct vcpu_hdbss_state hdbss;
> > >   };
> > > 
> > >   /*
> > > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > > index d968aca0461a..3fea8cfe8869 100644
> > > --- a/arch/arm64/include/asm/kvm_mmu.h
> > > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > > @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > > 
> > >   int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> > >   int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
> > > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
> > > 
> > >   phys_addr_t kvm_mmu_get_httbr(void);
> > >   phys_addr_t kvm_get_idmap_vector(void);
> > > diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> > > index f4436ecc630c..d11f4d0dd4e7 100644
> > > --- a/arch/arm64/include/asm/sysreg.h
> > > +++ b/arch/arm64/include/asm/sysreg.h
> > > @@ -1039,6 +1039,17 @@
> > > 
> > >   #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
> > >   					       GCS_CAP_VALID_TOKEN)
> > > +
> > > +/*
> > > + * Definitions for the HDBSS feature
> > > + */
> > > +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
> > > +
> > > +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
> > > +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
> > > +
> > > +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
> > > +
> > >   /*
> > >    * Definitions for GICv5 instructions
> > >    */
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 29f0326f7e00..d64da05e25c4 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
> > >   	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
> > >   }
> > > 
> > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
> > > +{
> > > +	struct page *hdbss_pg;
> > > +
> > > +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
> > > +	if (hdbss_pg)
> > > +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
> > > +
> > > +	vcpu->arch.hdbss.size = 0;
> > > +}
> > > +
> > > +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
> > > +				    struct kvm_enable_cap *cap)
> > > +{
> > > +	unsigned long i;
> > > +	struct kvm_vcpu *vcpu;
> > > +	struct page *hdbss_pg = NULL;
> > > +	__u64 size = cap->args[0];
> > > +	bool enable = cap->args[1] ? true : false;
> > > +
> > > +	if (!system_supports_hdbss())
> > > +		return -EINVAL;
> > > +
> > > +	if (size > HDBSS_MAX_SIZE)
> > > +		return -EINVAL;
> > > +
> > > +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
> > > +		return 0;
> > > +
> > > +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
> > > +		return -EINVAL;
> > > +
> > > +	if (!enable) { /* Turn it off */
> > > +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
> > > +
> > > +		kvm_for_each_vcpu(i, vcpu, kvm) {
> > > +			/* Kick vcpus to flush hdbss buffer. */
> > > +			kvm_vcpu_kick(vcpu);
> > > +
> > > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > > +		}
> > > +
> > > +		kvm->arch.enable_hdbss = false;
> > > +
> > > +		return 0;
> > > +	}
> > > +
> > > +	/* Turn it on */
> > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
> > > +		if (!hdbss_pg)
> > > +			goto error_alloc;
> > > +
> > > +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
> > > +			.base_phys = page_to_phys(hdbss_pg),
> > > +			.size = size,
> > > +			.next_index = 0,
> > > +		};
> > > +	}
> > > +
> > > +	kvm->arch.enable_hdbss = true;
> > > +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
> > > +
> > > +	/*
> > > +	 * We should kick vcpus out of guest mode here to load new
> > > +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
> > > +	 */
> > > +	kvm_for_each_vcpu(i, vcpu, kvm)
> > > +		kvm_vcpu_kick(vcpu);
> > > +
> > > +	return 0;
> > > +
> > > +error_alloc:
> > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > +		if (vcpu->arch.hdbss.base_phys)
> > > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > > +	}
> > > +
> > > +	return -ENOMEM;
> > > +}
> > > +
> > >   int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > >   			    struct kvm_enable_cap *cap)
> > >   {
> > > @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > >   		r = 0;
> > >   		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
> > >   		break;
> > > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > +		mutex_lock(&kvm->lock);
> > > +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
> > > +		mutex_unlock(&kvm->lock);
> > > +		break;
> > >   	default:
> > >   		break;
> > >   	}
> > > @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > >   			r = kvm_supports_cacheable_pfnmap();
> > >   		break;
> > > 
> > > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > +		r = system_supports_hdbss();
> > > +		break;
> > >   	default:
> > >   		r = 0;
> > >   	}
> > > @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > >   		if (kvm_dirty_ring_check_request(vcpu))
> > >   			return 0;
> > > 
> > > +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
> > > +			kvm_flush_hdbss_buffer(vcpu);
> > I am curious on why we need a flush-hdbss request,
> > Don't we have the flush function happening every time we run vcpu_put?
> > 
> > Oh, I see, you want to check if there is anything needed inside the inner
> > loop of vcpu_run, without having to vcpu_put. I think it makes sense.
> > 
> > But instead of having this on guest entry, does not it make more sense to
> > have it in guest exit? This way we flush every time (if needed) we exit the
> > guest, and instead of having a vcpu request, we just require a vcpu kick
> > and it should flush if needed.
> > 
> > Maybe have vcpu_put just save the registers, and add a the flush before
> > handle_exit.
> > 
> > What do you think?
> 
> 
> Thank you for the feedback.
> 
> 
> Indeed, in the initial version (v1), I placed the flush operation inside
> handle_exit and
> 
> used a vcpu_kick in kvm_arch_sync_dirty_log to trigger the flush of the
> HDBSS buffer.
> 
> 
> However, during the review, Marc pointed out that calling this function on
> every exit
> 
> event is too frequent if it's not always needed.
> 
> 
> Discussion link:
> _https://lore.kernel.org/linux-arm-kernel/86senjony9.wl-maz@kernel.org/_
> 
> 
> I agreed with his assessment. Therefore, in the current version, I've
> separated the flush
> 
> operation into more specific and less frequent points:
> 
> 
> 1. In vcpu_put
> 
> 2. During dirty log synchronization, by kicking the vCPU to trigger a
> request that flushes
> 
> on its next exit.
> 
> 
> 3. When handling a specific HDBSSF event.
> 
> 
> This ensures the flush happens only when necessary, avoiding the overhead of
> doing it
> 
> on every guest exit.
> 

Fair enough, calling it every time you go in the inner loop may be too 
much, even with a check to make sure it needs to run.

Having it as a request means you may do that sometimes without 
leaving the inner loop. That could be useful if you want to use it with the 
IRQ handler to deal with full buffer, or any error, as well as dealing with 
a regular request in the 2nd case.

While I agree it's needed to run before leaving guest context (i.e leaving 
the inner loop), I am not really sure vcpu_put is the best place to put the 
flushing. I may be wrong, but for me it looks more like of a place to save 
registers and context, as well as dropping refcounts or something like 
that. I would not expect a flush happening in vcpu_put, if I was reading 
the code.

Would it be too bad if we had it into a call before vcpu_put, at 
kvm_arch_vcpu_ioctl_run()? 



> > > +
> > >   		check_nested_vcpu_requests(vcpu);
> > >   	}
> > > 
> > > @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
> > > 
> > >   void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
> > >   {
> > > +	/*
> > > +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
> > > +	 * before reporting dirty_bitmap to userspace. Send a request with
> > > +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
> > > +	 */
> > > +	struct kvm_vcpu *vcpu;
> > > +
> > > +	if (!kvm->arch.enable_hdbss)
> > > +		return;
> > > 
> > > +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
> > >   }
> > > 
> > >   static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> > > diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> > > index 9db3f11a4754..600cbc4f8ae9 100644
> > > --- a/arch/arm64/kvm/hyp/vhe/switch.c
> > > +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> > > @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
> > >   	local_irq_restore(flags);
> > >   }
> > > 
> > > +static void __load_hdbss(struct kvm_vcpu *vcpu)
> > > +{
> > > +	struct kvm *kvm = vcpu->kvm;
> > > +	u64 br_el2, prod_el2;
> > > +
> > > +	if (!kvm->arch.enable_hdbss)
> > > +		return;
> > > +
> > > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > > +	prod_el2 = vcpu->arch.hdbss.next_index;
> > > +
> > > +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
> > > +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
> > > +
> > > +	isb();
> > > +}
> > > +
> > >   void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> > >   {
> > >   	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
> > > @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> > >   	__vcpu_load_switch_sysregs(vcpu);
> > >   	__vcpu_load_activate_traps(vcpu);
> > >   	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
> > > +	__load_hdbss(vcpu);
> > >   }
> > > 
> > >   void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
> > >   {
> > > +	kvm_flush_hdbss_buffer(vcpu);
> > >   	__vcpu_put_deactivate_traps(vcpu);
> > >   	__vcpu_put_switch_sysregs(vcpu);
> > > 
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 070a01e53fcb..42b0710a16ce 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >   	if (writable)
> > >   		prot |= KVM_PGTABLE_PROT_W;
> > > 
> > > +	if (writable && kvm->arch.enable_hdbss && logging_active)
> > > +		prot |= KVM_PGTABLE_PROT_DBM;
> > > +
> > >   	if (exec_fault)
> > >   		prot |= KVM_PGTABLE_PROT_X;
> > > 
> > > @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > >   	return 0;
> > >   }
> > > 
> > > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
> > > +{
> > > +	int idx, curr_idx;
> > > +	u64 br_el2;
> > > +	u64 *hdbss_buf;
> > > +	struct kvm *kvm = vcpu->kvm;
> > > +
> > > +	if (!kvm->arch.enable_hdbss)
> > > +		return;
> > > +
> > > +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
> > > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > > +
> > > +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
> > > +	if (curr_idx == 0 || br_el2 == 0)
> > > +		return;
> > > +
> > > +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
> > > +	if (!hdbss_buf)
> > > +		return;
> > > +
> > > +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
> > > +	for (idx = 0; idx < curr_idx; idx++) {
> > > +		u64 gpa;
> > > +
> > > +		gpa = hdbss_buf[idx];
> > > +		if (!(gpa & HDBSS_ENTRY_VALID))
> > > +			continue;
> > > +
> > > +		gpa &= HDBSS_ENTRY_IPA;
> > > +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> > > +	}
> > This will mark a page dirty for both dirty_bitmap or dirty_ring, depending
> > on what is in use.
> > 
> > Out of plain curiosity, have you planned / tested for the dirty-ring as
> > well, or just for dirty-bitmap?
> 
> 
> Currently, I have only tested this with dirty-bitmap mode.
> 
> 
> I will test and ensure the HDBSS feature works correctly with dirty-ring in
> the next version.
> 
> 

Thanks!

> > > +
> > > +	/* reset HDBSS index */
> > > +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
> > > +	vcpu->arch.hdbss.next_index = 0;
> > > +	isb();
> > > +}
> > > +
> > > +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
> > > +{
> > > +	u64 prod;
> > > +	u64 fsc;
> > > +
> > > +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
> > > +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
> > > +
> > > +	switch (fsc) {
> > > +	case HDBSSPROD_EL2_FSC_OK:
> > > +		/* Buffer full, which is reported as permission fault. */
> > > +		kvm_flush_hdbss_buffer(vcpu);
> > > +		return 1;
> > Humm, flushing in a fault handler means hanging there, in IRQ context, for
> > a while.
> > 
> > Since we already deal with this on guest_exit (vcpu_put IIUC), why not just
> > return in a way the vcpu has to exit the inner loop and let it flush there
> > instead?
> > 
> > Thanks!
> > Leo
> 
> 
> Thanks for the feedback.
> 
> 
> If we flush on every guest exit (by moving the flush before handle_exit,
> then we can
> 
> indeed drop the flush from the fault handler and from vcpu_put.
> 
> 
> However, given Marc's earlier concern about not imposing this overhead on
> all vCPUs,
> 
> I'd rather avoid flushing on every exit.
> 
> 
> My current plan is to set a request bit in kvm_handle_hdbss_fault (via
> kvm_make_request),
> 
> and move the actual flush to the normal exit path, where it can execute in a
> safe context.
> 
> This also allows us to remove the flush from the fault handler entirely.
> 
> 
> Does that approach sound reasonable to you?
> 
> 

Yes, I think it looks much better, as the fault will cause guest to exit, 
and it can run the flush on it's way back in.

Thanks!
Leo

> > > +	case HDBSSPROD_EL2_FSC_ExternalAbort:
> > > +	case HDBSSPROD_EL2_FSC_GPF:
> > > +		return -EFAULT;
> > > +	default:
> > > +		/* Unknown fault. */
> > > +		WARN_ONCE(1,
> > > +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
> > > +				fsc, prod, vcpu->vcpu_id);
> > > +		return -EFAULT;
> > > +	}
> > > +}
> > > +
> > >   /**
> > >    * kvm_handle_guest_abort - handles all 2nd stage aborts
> > >    * @vcpu:	the VCPU pointer
> > > @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > 
> > >   	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
> > > 
> > > +	if (esr_iss2_is_hdbssf(esr))
> > > +		return kvm_handle_hdbss_fault(vcpu);
> > > +
> > >   	if (esr_fsc_is_translation_fault(esr)) {
> > >   		/* Beyond sanitised PARange (which is the IPA limit) */
> > >   		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
> > > diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> > > index 959532422d3a..c03a4b310b53 100644
> > > --- a/arch/arm64/kvm/reset.c
> > > +++ b/arch/arm64/kvm/reset.c
> > > @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
> > >   	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
> > >   	kfree(vcpu->arch.vncr_tlb);
> > >   	kfree(vcpu->arch.ccsidr);
> > > +
> > > +	if (vcpu->kvm->arch.enable_hdbss)
> > > +		kvm_arm_vcpu_free_hdbss(vcpu);
> > >   }
> > > 
> > >   static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
> > > --
> > > 2.33.0
> 
> Thanks!
> 
> Tian
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-03-06 15:01       ` Leonardo Bras
@ 2026-03-12  6:17         ` Tian Zheng
  2026-03-12 12:06           ` Leonardo Bras
  0 siblings, 1 reply; 24+ messages in thread
From: Tian Zheng @ 2026-03-12  6:17 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: maz, oupton, catalin.marinas, corbet, pbonzini, will, yuzenghui,
	wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2, linuxarm,
	joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose


On 3/6/2026 11:01 PM, Leonardo Bras wrote:
> On Fri, Mar 06, 2026 at 05:27:58PM +0800, Tian Zheng wrote:
>> Hi Leo,
>>
>> On 3/4/2026 11:40 PM, Leonardo Bras wrote:
>>> Hi Tian,
>>>
>>> Few extra notes/questions below
>>>
>>> On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
>>>> From: eillon<yezhenyu2@huawei.com>
>>>>
>>>> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
>>>> migration. This feature is only supported in VHE mode.
>>>>
>>>> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
>>>> write faults are handled by user_mem_abort, which relaxes permissions
>>>> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
>>>> writes no longer trap, as the hardware automatically transitions the page
>>>> from writable-clean to writable-dirty.
>>>>
>>>> KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
>>>> enabled, the hardware observes the clean->dirty transition and records
>>>> the corresponding page into the HDBSS buffer.
>>>>
>>>> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
>>>> that check_vcpu_requests flushes the HDBSS buffer and propagates the
>>>> accumulated dirty information into the userspace-visible dirty bitmap.
>>>>
>>>> Add fault handling for HDBSS including buffer full, external abort, and
>>>> general protection fault (GPF).
>>>>
>>>> Signed-off-by: eillon<yezhenyu2@huawei.com>
>>>> Signed-off-by: Tian Zheng<zhengtian10@huawei.com>
>>>> ---
>>>>    arch/arm64/include/asm/esr.h      |   5 ++
>>>>    arch/arm64/include/asm/kvm_host.h |  17 +++++
>>>>    arch/arm64/include/asm/kvm_mmu.h  |   1 +
>>>>    arch/arm64/include/asm/sysreg.h   |  11 ++++
>>>>    arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>>>>    arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>>>>    arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>>>>    arch/arm64/kvm/reset.c            |   3 +
>>>>    8 files changed, 228 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>>>> index 81c17320a588..2e6b679b5908 100644
>>>> --- a/arch/arm64/include/asm/esr.h
>>>> +++ b/arch/arm64/include/asm/esr.h
>>>> @@ -437,6 +437,11 @@
>>>>    #ifndef __ASSEMBLER__
>>>>    #include <asm/types.h>
>>>>
>>>> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
>>>> +{
>>>> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
>>>> +}
>>>> +
>>>>    static inline unsigned long esr_brk_comment(unsigned long esr)
>>>>    {
>>>>    	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>>> index 5d5a3bbdb95e..57ee6b53e061 100644
>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>> @@ -55,12 +55,17 @@
>>>>    #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>>>>    #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>>>>    #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
>>>> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
>>>>
>>>>    #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>>>    				     KVM_DIRTY_LOG_INITIALLY_SET)
>>>>
>>>>    #define KVM_HAVE_MMU_RWLOCK
>>>>
>>>> +/* HDBSS entry field definitions */
>>>> +#define HDBSS_ENTRY_VALID BIT(0)
>>>> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
>>>> +
>>>>    /*
>>>>     * Mode of operation configurable with kvm-arm.mode early param.
>>>>     * See Documentation/admin-guide/kernel-parameters.txt for more information.
>>>> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>>>>    u32 __attribute_const__ kvm_target_cpu(void);
>>>>    void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>>>>    void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
>>>>
>>>>    struct kvm_hyp_memcache {
>>>>    	phys_addr_t head;
>>>> @@ -405,6 +411,8 @@ struct kvm_arch {
>>>>    	 * the associated pKVM instance in the hypervisor.
>>>>    	 */
>>>>    	struct kvm_protected_vm pkvm;
>>>> +
>>>> +	bool enable_hdbss;
>>>>    };
>>>>
>>>>    struct kvm_vcpu_fault_info {
>>>> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>>>>    	bool		reset;
>>>>    };
>>>>
>>>> +struct vcpu_hdbss_state {
>>>> +	phys_addr_t base_phys;
>>>> +	u32 size;
>>>> +	u32 next_index;
>>>> +};
>>>> +
>>> IIUC this is used once both on enable/disable and massively on
>>> vcpu_put/get.
>>>
>>> What if we actually save just HDBSSBR_EL2 and HDBSSPROD_EL2 instead?
>>> That way we avoid having masking operations in put/get as well as any
>>> possible error we may have formatting those.
>>>
>>> The cost is doing those operations once for enable and once for disable,
>>> which should be fine.
> Hi Tian,
>
>>
>> Thanks for the suggestion. I actually started with storing the raw system
>> register
>>
>> values, as you proposed.
>>
>>
>> However, after discussing it with Oliver Upton in v1, we felt that keeping
>> the base address,
>>
>> size, and index as separate fields makes the state easier to understand.
>>
>>
>> Discussion
>> link:https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/
>> <https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/>
>>
>>
>> That's why I ended up changing the storage approach in the end.
>>
>>
> Humm, FWIW I disagree with the above argument.
> I would argue that vcpu_put should save the registers, and not
> actually know what they are about or how are they formatted at this point.
>
> The responsibility of understanding it's fields and usage value should be
> in the code that actually uses it.
>
> IIUC on kvm_vcpu_put_vhe and kvm_vcpu_load_vhe there are calls to other
> functions than only save the register as it is.


ok, thx! I'll update the struct to store only the raw register values.


>>>>    struct vncr_tlb;
>>>>
>>>>    struct kvm_vcpu_arch {
>>>> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
>>>>
>>>>    	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>>>>    	struct vncr_tlb	*vncr_tlb;
>>>> +
>>>> +	/* HDBSS registers info */
>>>> +	struct vcpu_hdbss_state hdbss;
>>>>    };
>>>>
>>>>    /*
>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>>> index d968aca0461a..3fea8cfe8869 100644
>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>>>
>>>>    int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>>>>    int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
>>>>
>>>>    phys_addr_t kvm_mmu_get_httbr(void);
>>>>    phys_addr_t kvm_get_idmap_vector(void);
>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>>>> index f4436ecc630c..d11f4d0dd4e7 100644
>>>> --- a/arch/arm64/include/asm/sysreg.h
>>>> +++ b/arch/arm64/include/asm/sysreg.h
>>>> @@ -1039,6 +1039,17 @@
>>>>
>>>>    #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>>>>    					       GCS_CAP_VALID_TOKEN)
>>>> +
>>>> +/*
>>>> + * Definitions for the HDBSS feature
>>>> + */
>>>> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
>>>> +
>>>> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
>>>> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
>>>> +
>>>> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
>>>> +
>>>>    /*
>>>>     * Definitions for GICv5 instructions
>>>>     */
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index 29f0326f7e00..d64da05e25c4 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>>>    	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>>>>    }
>>>>
>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct page *hdbss_pg;
>>>> +
>>>> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
>>>> +	if (hdbss_pg)
>>>> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
>>>> +
>>>> +	vcpu->arch.hdbss.size = 0;
>>>> +}
>>>> +
>>>> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
>>>> +				    struct kvm_enable_cap *cap)
>>>> +{
>>>> +	unsigned long i;
>>>> +	struct kvm_vcpu *vcpu;
>>>> +	struct page *hdbss_pg = NULL;
>>>> +	__u64 size = cap->args[0];
>>>> +	bool enable = cap->args[1] ? true : false;
>>>> +
>>>> +	if (!system_supports_hdbss())
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (size > HDBSS_MAX_SIZE)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
>>>> +		return 0;
>>>> +
>>>> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (!enable) { /* Turn it off */
>>>> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
>>>> +
>>>> +		kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> +			/* Kick vcpus to flush hdbss buffer. */
>>>> +			kvm_vcpu_kick(vcpu);
>>>> +
>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>> +		}
>>>> +
>>>> +		kvm->arch.enable_hdbss = false;
>>>> +
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	/* Turn it on */
>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
>>>> +		if (!hdbss_pg)
>>>> +			goto error_alloc;
>>>> +
>>>> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
>>>> +			.base_phys = page_to_phys(hdbss_pg),
>>>> +			.size = size,
>>>> +			.next_index = 0,
>>>> +		};
>>>> +	}
>>>> +
>>>> +	kvm->arch.enable_hdbss = true;
>>>> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
>>>> +
>>>> +	/*
>>>> +	 * We should kick vcpus out of guest mode here to load new
>>>> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
>>>> +	 */
>>>> +	kvm_for_each_vcpu(i, vcpu, kvm)
>>>> +		kvm_vcpu_kick(vcpu);
>>>> +
>>>> +	return 0;
>>>> +
>>>> +error_alloc:
>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> +		if (vcpu->arch.hdbss.base_phys)
>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>> +	}
>>>> +
>>>> +	return -ENOMEM;
>>>> +}
>>>> +
>>>>    int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>    			    struct kvm_enable_cap *cap)
>>>>    {
>>>> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>    		r = 0;
>>>>    		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>>>>    		break;
>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>> +		mutex_lock(&kvm->lock);
>>>> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
>>>> +		mutex_unlock(&kvm->lock);
>>>> +		break;
>>>>    	default:
>>>>    		break;
>>>>    	}
>>>> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>    			r = kvm_supports_cacheable_pfnmap();
>>>>    		break;
>>>>
>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>> +		r = system_supports_hdbss();
>>>> +		break;
>>>>    	default:
>>>>    		r = 0;
>>>>    	}
>>>> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>>>>    		if (kvm_dirty_ring_check_request(vcpu))
>>>>    			return 0;
>>>>
>>>> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
>>>> +			kvm_flush_hdbss_buffer(vcpu);
>>> I am curious on why we need a flush-hdbss request,
>>> Don't we have the flush function happening every time we run vcpu_put?
>>>
>>> Oh, I see, you want to check if there is anything needed inside the inner
>>> loop of vcpu_run, without having to vcpu_put. I think it makes sense.
>>>
>>> But instead of having this on guest entry, does not it make more sense to
>>> have it in guest exit? This way we flush every time (if needed) we exit the
>>> guest, and instead of having a vcpu request, we just require a vcpu kick
>>> and it should flush if needed.
>>>
>>> Maybe have vcpu_put just save the registers, and add a the flush before
>>> handle_exit.
>>>
>>> What do you think?
>>
>> Thank you for the feedback.
>>
>>
>> Indeed, in the initial version (v1), I placed the flush operation inside
>> handle_exit and
>>
>> used a vcpu_kick in kvm_arch_sync_dirty_log to trigger the flush of the
>> HDBSS buffer.
>>
>>
>> However, during the review, Marc pointed out that calling this function on
>> every exit
>>
>> event is too frequent if it's not always needed.
>>
>>
>> Discussion link:
>> _https://lore.kernel.org/linux-arm-kernel/86senjony9.wl-maz@kernel.org/_
>>
>>
>> I agreed with his assessment. Therefore, in the current version, I've
>> separated the flush
>>
>> operation into more specific and less frequent points:
>>
>>
>> 1. In vcpu_put
>>
>> 2. During dirty log synchronization, by kicking the vCPU to trigger a
>> request that flushes
>>
>> on its next exit.
>>
>>
>> 3. When handling a specific HDBSSF event.
>>
>>
>> This ensures the flush happens only when necessary, avoiding the overhead of
>> doing it
>>
>> on every guest exit.
>>
> Fair enough, calling it every time you go in the inner loop may be too
> much, even with a check to make sure it needs to run.
>
> Having it as a request means you may do that sometimes without
> leaving the inner loop. That could be useful if you want to use it with the
> IRQ handler to deal with full buffer, or any error, as well as dealing with
> a regular request in the 2nd case.
>
> While I agree it's needed to run before leaving guest context (i.e leaving
> the inner loop), I am not really sure vcpu_put is the best place to put the
> flushing. I may be wrong, but for me it looks more like of a place to save
> registers and context, as well as dropping refcounts or something like
> that. I would not expect a flush happening in vcpu_put, if I was reading
> the code.
>
> Would it be too bad if we had it into a call before vcpu_put, at
> kvm_arch_vcpu_ioctl_run()?
>

Thanks for the clarification. After looking again at the code paths, I 
agree that

kvm_vcpu_put_vhe() and kvm_arch_vcpu_put() are really meant to be pure 
save/restore

paths, so embedding HDBSS flushing there isn't ideal.


My remaining concern is that kvm_arch_vcpu_ioctl_run() doesn't cover the 
case where

the vCPU is scheduled out. In that case we still leave guest context, 
but we don't return

through the run loop, so a flush placed only in ioctl_run() wouldn't run.


Any suggestions on where this should hook in? Would introducing a small 
arch‑specific

"guest exit" helper, invoked fromm kvm_arch_vcpu_put(), be acceptable?


Thanks!

Tian


>
>>>> +
>>>>    		check_nested_vcpu_requests(vcpu);
>>>>    	}
>>>>
>>>> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
>>>>
>>>>    void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>>>>    {
>>>> +	/*
>>>> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
>>>> +	 * before reporting dirty_bitmap to userspace. Send a request with
>>>> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
>>>> +	 */
>>>> +	struct kvm_vcpu *vcpu;
>>>> +
>>>> +	if (!kvm->arch.enable_hdbss)
>>>> +		return;
>>>>
>>>> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>>>>    }
>>>>
>>>>    static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>>>> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
>>>> index 9db3f11a4754..600cbc4f8ae9 100644
>>>> --- a/arch/arm64/kvm/hyp/vhe/switch.c
>>>> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
>>>> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>>>>    	local_irq_restore(flags);
>>>>    }
>>>>
>>>> +static void __load_hdbss(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct kvm *kvm = vcpu->kvm;
>>>> +	u64 br_el2, prod_el2;
>>>> +
>>>> +	if (!kvm->arch.enable_hdbss)
>>>> +		return;
>>>> +
>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>> +	prod_el2 = vcpu->arch.hdbss.next_index;
>>>> +
>>>> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
>>>> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
>>>> +
>>>> +	isb();
>>>> +}
>>>> +
>>>>    void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>    {
>>>>    	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
>>>> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>    	__vcpu_load_switch_sysregs(vcpu);
>>>>    	__vcpu_load_activate_traps(vcpu);
>>>>    	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
>>>> +	__load_hdbss(vcpu);
>>>>    }
>>>>
>>>>    void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>>>>    {
>>>> +	kvm_flush_hdbss_buffer(vcpu);
>>>>    	__vcpu_put_deactivate_traps(vcpu);
>>>>    	__vcpu_put_switch_sysregs(vcpu);
>>>>
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index 070a01e53fcb..42b0710a16ce 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>    	if (writable)
>>>>    		prot |= KVM_PGTABLE_PROT_W;
>>>>
>>>> +	if (writable && kvm->arch.enable_hdbss && logging_active)
>>>> +		prot |= KVM_PGTABLE_PROT_DBM;
>>>> +
>>>>    	if (exec_fault)
>>>>    		prot |= KVM_PGTABLE_PROT_X;
>>>>
>>>> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>>>    	return 0;
>>>>    }
>>>>
>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	int idx, curr_idx;
>>>> +	u64 br_el2;
>>>> +	u64 *hdbss_buf;
>>>> +	struct kvm *kvm = vcpu->kvm;
>>>> +
>>>> +	if (!kvm->arch.enable_hdbss)
>>>> +		return;
>>>> +
>>>> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>> +
>>>> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
>>>> +	if (curr_idx == 0 || br_el2 == 0)
>>>> +		return;
>>>> +
>>>> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
>>>> +	if (!hdbss_buf)
>>>> +		return;
>>>> +
>>>> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
>>>> +	for (idx = 0; idx < curr_idx; idx++) {
>>>> +		u64 gpa;
>>>> +
>>>> +		gpa = hdbss_buf[idx];
>>>> +		if (!(gpa & HDBSS_ENTRY_VALID))
>>>> +			continue;
>>>> +
>>>> +		gpa &= HDBSS_ENTRY_IPA;
>>>> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
>>>> +	}
>>> This will mark a page dirty for both dirty_bitmap or dirty_ring, depending
>>> on what is in use.
>>>
>>> Out of plain curiosity, have you planned / tested for the dirty-ring as
>>> well, or just for dirty-bitmap?
>>
>> Currently, I have only tested this with dirty-bitmap mode.
>>
>>
>> I will test and ensure the HDBSS feature works correctly with dirty-ring in
>> the next version.
>>
>>
> Thanks!
>
>>>> +
>>>> +	/* reset HDBSS index */
>>>> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
>>>> +	vcpu->arch.hdbss.next_index = 0;
>>>> +	isb();
>>>> +}
>>>> +
>>>> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	u64 prod;
>>>> +	u64 fsc;
>>>> +
>>>> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
>>>> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
>>>> +
>>>> +	switch (fsc) {
>>>> +	case HDBSSPROD_EL2_FSC_OK:
>>>> +		/* Buffer full, which is reported as permission fault. */
>>>> +		kvm_flush_hdbss_buffer(vcpu);
>>>> +		return 1;
>>> Humm, flushing in a fault handler means hanging there, in IRQ context, for
>>> a while.
>>>
>>> Since we already deal with this on guest_exit (vcpu_put IIUC), why not just
>>> return in a way the vcpu has to exit the inner loop and let it flush there
>>> instead?
>>>
>>> Thanks!
>>> Leo
>>
>> Thanks for the feedback.
>>
>>
>> If we flush on every guest exit (by moving the flush before handle_exit,
>> then we can
>>
>> indeed drop the flush from the fault handler and from vcpu_put.
>>
>>
>> However, given Marc's earlier concern about not imposing this overhead on
>> all vCPUs,
>>
>> I'd rather avoid flushing on every exit.
>>
>>
>> My current plan is to set a request bit in kvm_handle_hdbss_fault (via
>> kvm_make_request),
>>
>> and move the actual flush to the normal exit path, where it can execute in a
>> safe context.
>>
>> This also allows us to remove the flush from the fault handler entirely.
>>
>>
>> Does that approach sound reasonable to you?
>>
>>
> Yes, I think it looks much better, as the fault will cause guest to exit,
> and it can run the flush on it's way back in.
>
> Thanks!
> Leo
>
>>>> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
>>>> +	case HDBSSPROD_EL2_FSC_GPF:
>>>> +		return -EFAULT;
>>>> +	default:
>>>> +		/* Unknown fault. */
>>>> +		WARN_ONCE(1,
>>>> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
>>>> +				fsc, prod, vcpu->vcpu_id);
>>>> +		return -EFAULT;
>>>> +	}
>>>> +}
>>>> +
>>>>    /**
>>>>     * kvm_handle_guest_abort - handles all 2nd stage aborts
>>>>     * @vcpu:	the VCPU pointer
>>>> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>>
>>>>    	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>>>>
>>>> +	if (esr_iss2_is_hdbssf(esr))
>>>> +		return kvm_handle_hdbss_fault(vcpu);
>>>> +
>>>>    	if (esr_fsc_is_translation_fault(esr)) {
>>>>    		/* Beyond sanitised PARange (which is the IPA limit) */
>>>>    		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
>>>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>>>> index 959532422d3a..c03a4b310b53 100644
>>>> --- a/arch/arm64/kvm/reset.c
>>>> +++ b/arch/arm64/kvm/reset.c
>>>> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>>>>    	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>>>>    	kfree(vcpu->arch.vncr_tlb);
>>>>    	kfree(vcpu->arch.ccsidr);
>>>> +
>>>> +	if (vcpu->kvm->arch.enable_hdbss)
>>>> +		kvm_arm_vcpu_free_hdbss(vcpu);
>>>>    }
>>>>
>>>>    static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>>>> --
>>>> 2.33.0
>> Thanks!
>>
>> Tian
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-03-12  6:17         ` Tian Zheng
@ 2026-03-12 12:06           ` Leonardo Bras
  2026-03-12 13:13             ` Tian Zheng
  0 siblings, 1 reply; 24+ messages in thread
From: Leonardo Bras @ 2026-03-12 12:06 UTC (permalink / raw)
  To: Tian Zheng
  Cc: Leonardo Bras, maz, oupton, catalin.marinas, corbet, pbonzini,
	will, yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron,
	yezhenyu2, linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel,
	linux-doc, linux-kernel, skhan, suzuki.poulose

On Thu, Mar 12, 2026 at 02:17:41PM +0800, Tian Zheng wrote:
> 
> On 3/6/2026 11:01 PM, Leonardo Bras wrote:
> > On Fri, Mar 06, 2026 at 05:27:58PM +0800, Tian Zheng wrote:
> > > Hi Leo,
> > > 
> > > On 3/4/2026 11:40 PM, Leonardo Bras wrote:
> > > > Hi Tian,
> > > > 
> > > > Few extra notes/questions below
> > > > 
> > > > On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
> > > > > From: eillon<yezhenyu2@huawei.com>
> > > > > 
> > > > > HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
> > > > > migration. This feature is only supported in VHE mode.
> > > > > 
> > > > > Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
> > > > > write faults are handled by user_mem_abort, which relaxes permissions
> > > > > and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
> > > > > writes no longer trap, as the hardware automatically transitions the page
> > > > > from writable-clean to writable-dirty.
> > > > > 
> > > > > KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
> > > > > enabled, the hardware observes the clean->dirty transition and records
> > > > > the corresponding page into the HDBSS buffer.
> > > > > 
> > > > > During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
> > > > > that check_vcpu_requests flushes the HDBSS buffer and propagates the
> > > > > accumulated dirty information into the userspace-visible dirty bitmap.
> > > > > 
> > > > > Add fault handling for HDBSS including buffer full, external abort, and
> > > > > general protection fault (GPF).
> > > > > 
> > > > > Signed-off-by: eillon<yezhenyu2@huawei.com>
> > > > > Signed-off-by: Tian Zheng<zhengtian10@huawei.com>
> > > > > ---
> > > > >    arch/arm64/include/asm/esr.h      |   5 ++
> > > > >    arch/arm64/include/asm/kvm_host.h |  17 +++++
> > > > >    arch/arm64/include/asm/kvm_mmu.h  |   1 +
> > > > >    arch/arm64/include/asm/sysreg.h   |  11 ++++
> > > > >    arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
> > > > >    arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
> > > > >    arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
> > > > >    arch/arm64/kvm/reset.c            |   3 +
> > > > >    8 files changed, 228 insertions(+)
> > > > > 
> > > > > diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> > > > > index 81c17320a588..2e6b679b5908 100644
> > > > > --- a/arch/arm64/include/asm/esr.h
> > > > > +++ b/arch/arm64/include/asm/esr.h
> > > > > @@ -437,6 +437,11 @@
> > > > >    #ifndef __ASSEMBLER__
> > > > >    #include <asm/types.h>
> > > > > 
> > > > > +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
> > > > > +{
> > > > > +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
> > > > > +}
> > > > > +
> > > > >    static inline unsigned long esr_brk_comment(unsigned long esr)
> > > > >    {
> > > > >    	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
> > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > > > index 5d5a3bbdb95e..57ee6b53e061 100644
> > > > > --- a/arch/arm64/include/asm/kvm_host.h
> > > > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > > > @@ -55,12 +55,17 @@
> > > > >    #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
> > > > >    #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
> > > > >    #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
> > > > > +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
> > > > > 
> > > > >    #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
> > > > >    				     KVM_DIRTY_LOG_INITIALLY_SET)
> > > > > 
> > > > >    #define KVM_HAVE_MMU_RWLOCK
> > > > > 
> > > > > +/* HDBSS entry field definitions */
> > > > > +#define HDBSS_ENTRY_VALID BIT(0)
> > > > > +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> > > > > +
> > > > >    /*
> > > > >     * Mode of operation configurable with kvm-arm.mode early param.
> > > > >     * See Documentation/admin-guide/kernel-parameters.txt for more information.
> > > > > @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
> > > > >    u32 __attribute_const__ kvm_target_cpu(void);
> > > > >    void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
> > > > >    void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
> > > > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
> > > > > 
> > > > >    struct kvm_hyp_memcache {
> > > > >    	phys_addr_t head;
> > > > > @@ -405,6 +411,8 @@ struct kvm_arch {
> > > > >    	 * the associated pKVM instance in the hypervisor.
> > > > >    	 */
> > > > >    	struct kvm_protected_vm pkvm;
> > > > > +
> > > > > +	bool enable_hdbss;
> > > > >    };
> > > > > 
> > > > >    struct kvm_vcpu_fault_info {
> > > > > @@ -816,6 +824,12 @@ struct vcpu_reset_state {
> > > > >    	bool		reset;
> > > > >    };
> > > > > 
> > > > > +struct vcpu_hdbss_state {
> > > > > +	phys_addr_t base_phys;
> > > > > +	u32 size;
> > > > > +	u32 next_index;
> > > > > +};
> > > > > +
> > > > IIUC this is used once both on enable/disable and massively on
> > > > vcpu_put/get.
> > > > 
> > > > What if we actually save just HDBSSBR_EL2 and HDBSSPROD_EL2 instead?
> > > > That way we avoid having masking operations in put/get as well as any
> > > > possible error we may have formatting those.
> > > > 
> > > > The cost is doing those operations once for enable and once for disable,
> > > > which should be fine.
> > Hi Tian,
> > 
> > > 
> > > Thanks for the suggestion. I actually started with storing the raw system
> > > register
> > > 
> > > values, as you proposed.
> > > 
> > > 
> > > However, after discussing it with Oliver Upton in v1, we felt that keeping
> > > the base address,
> > > 
> > > size, and index as separate fields makes the state easier to understand.
> > > 
> > > 
> > > Discussion
> > > link:https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/
> > > <https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/>
> > > 
> > > 
> > > That's why I ended up changing the storage approach in the end.
> > > 
> > > 
> > Humm, FWIW I disagree with the above argument.
> > I would argue that vcpu_put should save the registers, and not
> > actually know what they are about or how are they formatted at this point.
> > 
> > The responsibility of understanding it's fields and usage value should be
> > in the code that actually uses it.
> > 
> > IIUC on kvm_vcpu_put_vhe and kvm_vcpu_load_vhe there are calls to other
> > functions than only save the register as it is.
> 
> 
> ok, thx! I'll update the struct to store only the raw register values.

Awesome :)

> 
> 
> > > > >    struct vncr_tlb;
> > > > > 
> > > > >    struct kvm_vcpu_arch {
> > > > > @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
> > > > > 
> > > > >    	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
> > > > >    	struct vncr_tlb	*vncr_tlb;
> > > > > +
> > > > > +	/* HDBSS registers info */
> > > > > +	struct vcpu_hdbss_state hdbss;
> > > > >    };
> > > > > 
> > > > >    /*
> > > > > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > > > > index d968aca0461a..3fea8cfe8869 100644
> > > > > --- a/arch/arm64/include/asm/kvm_mmu.h
> > > > > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > > > > @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > > > > 
> > > > >    int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> > > > >    int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
> > > > > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
> > > > > 
> > > > >    phys_addr_t kvm_mmu_get_httbr(void);
> > > > >    phys_addr_t kvm_get_idmap_vector(void);
> > > > > diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> > > > > index f4436ecc630c..d11f4d0dd4e7 100644
> > > > > --- a/arch/arm64/include/asm/sysreg.h
> > > > > +++ b/arch/arm64/include/asm/sysreg.h
> > > > > @@ -1039,6 +1039,17 @@
> > > > > 
> > > > >    #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
> > > > >    					       GCS_CAP_VALID_TOKEN)
> > > > > +
> > > > > +/*
> > > > > + * Definitions for the HDBSS feature
> > > > > + */
> > > > > +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
> > > > > +
> > > > > +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
> > > > > +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
> > > > > +
> > > > > +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
> > > > > +
> > > > >    /*
> > > > >     * Definitions for GICv5 instructions
> > > > >     */
> > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > > index 29f0326f7e00..d64da05e25c4 100644
> > > > > --- a/arch/arm64/kvm/arm.c
> > > > > +++ b/arch/arm64/kvm/arm.c
> > > > > @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
> > > > >    	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
> > > > >    }
> > > > > 
> > > > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > +	struct page *hdbss_pg;
> > > > > +
> > > > > +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
> > > > > +	if (hdbss_pg)
> > > > > +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
> > > > > +
> > > > > +	vcpu->arch.hdbss.size = 0;
> > > > > +}
> > > > > +
> > > > > +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
> > > > > +				    struct kvm_enable_cap *cap)
> > > > > +{
> > > > > +	unsigned long i;
> > > > > +	struct kvm_vcpu *vcpu;
> > > > > +	struct page *hdbss_pg = NULL;
> > > > > +	__u64 size = cap->args[0];
> > > > > +	bool enable = cap->args[1] ? true : false;
> > > > > +
> > > > > +	if (!system_supports_hdbss())
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (size > HDBSS_MAX_SIZE)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
> > > > > +		return 0;
> > > > > +
> > > > > +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (!enable) { /* Turn it off */
> > > > > +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
> > > > > +
> > > > > +		kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > +			/* Kick vcpus to flush hdbss buffer. */
> > > > > +			kvm_vcpu_kick(vcpu);
> > > > > +
> > > > > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > > > > +		}
> > > > > +
> > > > > +		kvm->arch.enable_hdbss = false;
> > > > > +
> > > > > +		return 0;
> > > > > +	}
> > > > > +
> > > > > +	/* Turn it on */
> > > > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
> > > > > +		if (!hdbss_pg)
> > > > > +			goto error_alloc;
> > > > > +
> > > > > +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
> > > > > +			.base_phys = page_to_phys(hdbss_pg),
> > > > > +			.size = size,
> > > > > +			.next_index = 0,
> > > > > +		};
> > > > > +	}
> > > > > +
> > > > > +	kvm->arch.enable_hdbss = true;
> > > > > +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
> > > > > +
> > > > > +	/*
> > > > > +	 * We should kick vcpus out of guest mode here to load new
> > > > > +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
> > > > > +	 */
> > > > > +	kvm_for_each_vcpu(i, vcpu, kvm)
> > > > > +		kvm_vcpu_kick(vcpu);
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +error_alloc:
> > > > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > +		if (vcpu->arch.hdbss.base_phys)
> > > > > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > > > > +	}
> > > > > +
> > > > > +	return -ENOMEM;
> > > > > +}
> > > > > +
> > > > >    int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > > >    			    struct kvm_enable_cap *cap)
> > > > >    {
> > > > > @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > > >    		r = 0;
> > > > >    		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
> > > > >    		break;
> > > > > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > > > +		mutex_lock(&kvm->lock);
> > > > > +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
> > > > > +		mutex_unlock(&kvm->lock);
> > > > > +		break;
> > > > >    	default:
> > > > >    		break;
> > > > >    	}
> > > > > @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > > > >    			r = kvm_supports_cacheable_pfnmap();
> > > > >    		break;
> > > > > 
> > > > > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > > > +		r = system_supports_hdbss();
> > > > > +		break;
> > > > >    	default:
> > > > >    		r = 0;
> > > > >    	}
> > > > > @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > > > >    		if (kvm_dirty_ring_check_request(vcpu))
> > > > >    			return 0;
> > > > > 
> > > > > +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
> > > > > +			kvm_flush_hdbss_buffer(vcpu);
> > > > I am curious on why we need a flush-hdbss request,
> > > > Don't we have the flush function happening every time we run vcpu_put?
> > > > 
> > > > Oh, I see, you want to check if there is anything needed inside the inner
> > > > loop of vcpu_run, without having to vcpu_put. I think it makes sense.
> > > > 
> > > > But instead of having this on guest entry, does not it make more sense to
> > > > have it in guest exit? This way we flush every time (if needed) we exit the
> > > > guest, and instead of having a vcpu request, we just require a vcpu kick
> > > > and it should flush if needed.
> > > > 
> > > > Maybe have vcpu_put just save the registers, and add a the flush before
> > > > handle_exit.
> > > > 
> > > > What do you think?
> > > 
> > > Thank you for the feedback.
> > > 
> > > 
> > > Indeed, in the initial version (v1), I placed the flush operation inside
> > > handle_exit and
> > > 
> > > used a vcpu_kick in kvm_arch_sync_dirty_log to trigger the flush of the
> > > HDBSS buffer.
> > > 
> > > 
> > > However, during the review, Marc pointed out that calling this function on
> > > every exit
> > > 
> > > event is too frequent if it's not always needed.
> > > 
> > > 
> > > Discussion link:
> > > _https://lore.kernel.org/linux-arm-kernel/86senjony9.wl-maz@kernel.org/_
> > > 
> > > 
> > > I agreed with his assessment. Therefore, in the current version, I've
> > > separated the flush
> > > 
> > > operation into more specific and less frequent points:
> > > 
> > > 
> > > 1. In vcpu_put
> > > 
> > > 2. During dirty log synchronization, by kicking the vCPU to trigger a
> > > request that flushes
> > > 
> > > on its next exit.
> > > 
> > > 
> > > 3. When handling a specific HDBSSF event.
> > > 
> > > 
> > > This ensures the flush happens only when necessary, avoiding the overhead of
> > > doing it
> > > 
> > > on every guest exit.
> > > 
> > Fair enough, calling it every time you go in the inner loop may be too
> > much, even with a check to make sure it needs to run.
> > 
> > Having it as a request means you may do that sometimes without
> > leaving the inner loop. That could be useful if you want to use it with the
> > IRQ handler to deal with full buffer, or any error, as well as dealing with
> > a regular request in the 2nd case.
> > 
> > While I agree it's needed to run before leaving guest context (i.e leaving
> > the inner loop), I am not really sure vcpu_put is the best place to put the
> > flushing. I may be wrong, but for me it looks more like of a place to save
> > registers and context, as well as dropping refcounts or something like
> > that. I would not expect a flush happening in vcpu_put, if I was reading
> > the code.
> > 
> > Would it be too bad if we had it into a call before vcpu_put, at
> > kvm_arch_vcpu_ioctl_run()?
> > 
> 
> Thanks for the clarification. After looking again at the code paths, I agree
> that
> 
> kvm_vcpu_put_vhe() and kvm_arch_vcpu_put() are really meant to be pure
> save/restore
> 
> paths, so embedding HDBSS flushing there isn't ideal.
> 
> 
> My remaining concern is that kvm_arch_vcpu_ioctl_run() doesn't cover the
> case where
> 
> the vCPU is scheduled out. In that case we still leave guest context, but we
> don't return
> 
> through the run loop, so a flush placed only in ioctl_run() wouldn't run.
> 
> 
> Any suggestions on where this should hook in? 

You mention that on vcpu_put it works, right? maybe it's worth to track 
down which vcpu_put users would make sense to flush before it's calling.

I found that vcpu_put is called only in the vcpu_run, but it's 
arch-specific version is called in:

kvm_debug_handle_oslar : put and load in sequence, flush shouldnt be needed
kvm_emulate_nested_eret : same as above, not sure if nested will be supported
kvm_inject_nested : same as above
kvm_reset_vcpu : put and load in sequence, not needed [1]
kvm_sched_out : that's ran on sched-out, that makes sense for us
vcpu_put : called only by vcpu_run, where we already planned to use

Which brings us other benefit of not having that in vcpu_put: the flush 
could be happening on functions that should originally ran fast, and having 
it outside vcpu_put allows us to decide if we need it.

So, having the flush happening in kvm_arch_vcpu_ioctl_run() and 
kvm_sched_out() should be enough, on top of the per-request you 
mentioned before.

[1]: the vcpu_reset: not sure how often this does happen, and if it would 
be interesting flushing here as well. It seems to be called on init, so not 
the case where would be something to flush, and from a case where it's 
restoring the registers. So I think it should be safe to not flush here,  
so it should be a question of 'maybe being interesting', which I am not 
sure.  

> Would introducing a small
> arch‑specific
> 
> "guest exit" helper, invoked fromm kvm_arch_vcpu_put(), be acceptable?
> 

IIUC that would be the same as the previous one: we would have the flush 
happening inside a function that is supposed to save registers. 

Thanks!
Leo


> 
> Thanks!
> 
> Tian
> 
> 
> > 
> > > > > +
> > > > >    		check_nested_vcpu_requests(vcpu);
> > > > >    	}
> > > > > 
> > > > > @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
> > > > > 
> > > > >    void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
> > > > >    {
> > > > > +	/*
> > > > > +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
> > > > > +	 * before reporting dirty_bitmap to userspace. Send a request with
> > > > > +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
> > > > > +	 */
> > > > > +	struct kvm_vcpu *vcpu;
> > > > > +
> > > > > +	if (!kvm->arch.enable_hdbss)
> > > > > +		return;
> > > > > 
> > > > > +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
> > > > >    }
> > > > > 
> > > > >    static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> > > > > diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > index 9db3f11a4754..600cbc4f8ae9 100644
> > > > > --- a/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
> > > > >    	local_irq_restore(flags);
> > > > >    }
> > > > > 
> > > > > +static void __load_hdbss(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > +	struct kvm *kvm = vcpu->kvm;
> > > > > +	u64 br_el2, prod_el2;
> > > > > +
> > > > > +	if (!kvm->arch.enable_hdbss)
> > > > > +		return;
> > > > > +
> > > > > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > > > > +	prod_el2 = vcpu->arch.hdbss.next_index;
> > > > > +
> > > > > +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
> > > > > +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
> > > > > +
> > > > > +	isb();
> > > > > +}
> > > > > +
> > > > >    void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> > > > >    {
> > > > >    	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
> > > > > @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> > > > >    	__vcpu_load_switch_sysregs(vcpu);
> > > > >    	__vcpu_load_activate_traps(vcpu);
> > > > >    	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
> > > > > +	__load_hdbss(vcpu);
> > > > >    }
> > > > > 
> > > > >    void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
> > > > >    {
> > > > > +	kvm_flush_hdbss_buffer(vcpu);
> > > > >    	__vcpu_put_deactivate_traps(vcpu);
> > > > >    	__vcpu_put_switch_sysregs(vcpu);
> > > > > 
> > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > > index 070a01e53fcb..42b0710a16ce 100644
> > > > > --- a/arch/arm64/kvm/mmu.c
> > > > > +++ b/arch/arm64/kvm/mmu.c
> > > > > @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > > >    	if (writable)
> > > > >    		prot |= KVM_PGTABLE_PROT_W;
> > > > > 
> > > > > +	if (writable && kvm->arch.enable_hdbss && logging_active)
> > > > > +		prot |= KVM_PGTABLE_PROT_DBM;
> > > > > +
> > > > >    	if (exec_fault)
> > > > >    		prot |= KVM_PGTABLE_PROT_X;
> > > > > 
> > > > > @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > > > >    	return 0;
> > > > >    }
> > > > > 
> > > > > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > +	int idx, curr_idx;
> > > > > +	u64 br_el2;
> > > > > +	u64 *hdbss_buf;
> > > > > +	struct kvm *kvm = vcpu->kvm;
> > > > > +
> > > > > +	if (!kvm->arch.enable_hdbss)
> > > > > +		return;
> > > > > +
> > > > > +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
> > > > > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > > > > +
> > > > > +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
> > > > > +	if (curr_idx == 0 || br_el2 == 0)
> > > > > +		return;
> > > > > +
> > > > > +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
> > > > > +	if (!hdbss_buf)
> > > > > +		return;
> > > > > +
> > > > > +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
> > > > > +	for (idx = 0; idx < curr_idx; idx++) {
> > > > > +		u64 gpa;
> > > > > +
> > > > > +		gpa = hdbss_buf[idx];
> > > > > +		if (!(gpa & HDBSS_ENTRY_VALID))
> > > > > +			continue;
> > > > > +
> > > > > +		gpa &= HDBSS_ENTRY_IPA;
> > > > > +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> > > > > +	}
> > > > This will mark a page dirty for both dirty_bitmap or dirty_ring, depending
> > > > on what is in use.
> > > > 
> > > > Out of plain curiosity, have you planned / tested for the dirty-ring as
> > > > well, or just for dirty-bitmap?
> > > 
> > > Currently, I have only tested this with dirty-bitmap mode.
> > > 
> > > 
> > > I will test and ensure the HDBSS feature works correctly with dirty-ring in
> > > the next version.
> > > 
> > > 
> > Thanks!
> > 
> > > > > +
> > > > > +	/* reset HDBSS index */
> > > > > +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
> > > > > +	vcpu->arch.hdbss.next_index = 0;
> > > > > +	isb();
> > > > > +}
> > > > > +
> > > > > +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > +	u64 prod;
> > > > > +	u64 fsc;
> > > > > +
> > > > > +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
> > > > > +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
> > > > > +
> > > > > +	switch (fsc) {
> > > > > +	case HDBSSPROD_EL2_FSC_OK:
> > > > > +		/* Buffer full, which is reported as permission fault. */
> > > > > +		kvm_flush_hdbss_buffer(vcpu);
> > > > > +		return 1;
> > > > Humm, flushing in a fault handler means hanging there, in IRQ context, for
> > > > a while.
> > > > 
> > > > Since we already deal with this on guest_exit (vcpu_put IIUC), why not just
> > > > return in a way the vcpu has to exit the inner loop and let it flush there
> > > > instead?
> > > > 
> > > > Thanks!
> > > > Leo
> > > 
> > > Thanks for the feedback.
> > > 
> > > 
> > > If we flush on every guest exit (by moving the flush before handle_exit,
> > > then we can
> > > 
> > > indeed drop the flush from the fault handler and from vcpu_put.
> > > 
> > > 
> > > However, given Marc's earlier concern about not imposing this overhead on
> > > all vCPUs,
> > > 
> > > I'd rather avoid flushing on every exit.
> > > 
> > > 
> > > My current plan is to set a request bit in kvm_handle_hdbss_fault (via
> > > kvm_make_request),
> > > 
> > > and move the actual flush to the normal exit path, where it can execute in a
> > > safe context.
> > > 
> > > This also allows us to remove the flush from the fault handler entirely.
> > > 
> > > 
> > > Does that approach sound reasonable to you?
> > > 
> > > 
> > Yes, I think it looks much better, as the fault will cause guest to exit,
> > and it can run the flush on it's way back in.
> > 
> > Thanks!
> > Leo
> > 
> > > > > +	case HDBSSPROD_EL2_FSC_ExternalAbort:
> > > > > +	case HDBSSPROD_EL2_FSC_GPF:
> > > > > +		return -EFAULT;
> > > > > +	default:
> > > > > +		/* Unknown fault. */
> > > > > +		WARN_ONCE(1,
> > > > > +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
> > > > > +				fsc, prod, vcpu->vcpu_id);
> > > > > +		return -EFAULT;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > >    /**
> > > > >     * kvm_handle_guest_abort - handles all 2nd stage aborts
> > > > >     * @vcpu:	the VCPU pointer
> > > > > @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > > > 
> > > > >    	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
> > > > > 
> > > > > +	if (esr_iss2_is_hdbssf(esr))
> > > > > +		return kvm_handle_hdbss_fault(vcpu);
> > > > > +
> > > > >    	if (esr_fsc_is_translation_fault(esr)) {
> > > > >    		/* Beyond sanitised PARange (which is the IPA limit) */
> > > > >    		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
> > > > > diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> > > > > index 959532422d3a..c03a4b310b53 100644
> > > > > --- a/arch/arm64/kvm/reset.c
> > > > > +++ b/arch/arm64/kvm/reset.c
> > > > > @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
> > > > >    	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
> > > > >    	kfree(vcpu->arch.vncr_tlb);
> > > > >    	kfree(vcpu->arch.ccsidr);
> > > > > +
> > > > > +	if (vcpu->kvm->arch.enable_hdbss)
> > > > > +		kvm_arm_vcpu_free_hdbss(vcpu);
> > > > >    }
> > > > > 
> > > > >    static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
> > > > > --
> > > > > 2.33.0
> > > Thanks!
> > > 
> > > Tian
> > > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-03-12 12:06           ` Leonardo Bras
@ 2026-03-12 13:13             ` Tian Zheng
  2026-03-12 14:58               ` Leonardo Bras
  0 siblings, 1 reply; 24+ messages in thread
From: Tian Zheng @ 2026-03-12 13:13 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: maz, oupton, catalin.marinas, corbet, pbonzini, will, yuzenghui,
	wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2, linuxarm,
	joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose


On 3/12/2026 8:06 PM, Leonardo Bras wrote:
> On Thu, Mar 12, 2026 at 02:17:41PM +0800, Tian Zheng wrote:
>> On 3/6/2026 11:01 PM, Leonardo Bras wrote:
>>> On Fri, Mar 06, 2026 at 05:27:58PM +0800, Tian Zheng wrote:
>>>> Hi Leo,
>>>>
>>>> On 3/4/2026 11:40 PM, Leonardo Bras wrote:
>>>>> Hi Tian,
>>>>>
>>>>> Few extra notes/questions below
>>>>>
>>>>> On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
>>>>>> From: eillon<yezhenyu2@huawei.com>
>>>>>>
>>>>>> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
>>>>>> migration. This feature is only supported in VHE mode.
>>>>>>
>>>>>> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
>>>>>> write faults are handled by user_mem_abort, which relaxes permissions
>>>>>> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
>>>>>> writes no longer trap, as the hardware automatically transitions the page
>>>>>> from writable-clean to writable-dirty.
>>>>>>
>>>>>> KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
>>>>>> enabled, the hardware observes the clean->dirty transition and records
>>>>>> the corresponding page into the HDBSS buffer.
>>>>>>
>>>>>> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
>>>>>> that check_vcpu_requests flushes the HDBSS buffer and propagates the
>>>>>> accumulated dirty information into the userspace-visible dirty bitmap.
>>>>>>
>>>>>> Add fault handling for HDBSS including buffer full, external abort, and
>>>>>> general protection fault (GPF).
>>>>>>
>>>>>> Signed-off-by: eillon<yezhenyu2@huawei.com>
>>>>>> Signed-off-by: Tian Zheng<zhengtian10@huawei.com>
>>>>>> ---
>>>>>>     arch/arm64/include/asm/esr.h      |   5 ++
>>>>>>     arch/arm64/include/asm/kvm_host.h |  17 +++++
>>>>>>     arch/arm64/include/asm/kvm_mmu.h  |   1 +
>>>>>>     arch/arm64/include/asm/sysreg.h   |  11 ++++
>>>>>>     arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>>>>>>     arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>>>>>>     arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>>>>>>     arch/arm64/kvm/reset.c            |   3 +
>>>>>>     8 files changed, 228 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>>>>>> index 81c17320a588..2e6b679b5908 100644
>>>>>> --- a/arch/arm64/include/asm/esr.h
>>>>>> +++ b/arch/arm64/include/asm/esr.h
>>>>>> @@ -437,6 +437,11 @@
>>>>>>     #ifndef __ASSEMBLER__
>>>>>>     #include <asm/types.h>
>>>>>>
>>>>>> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
>>>>>> +{
>>>>>> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
>>>>>> +}
>>>>>> +
>>>>>>     static inline unsigned long esr_brk_comment(unsigned long esr)
>>>>>>     {
>>>>>>     	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
>>>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>>>>> index 5d5a3bbdb95e..57ee6b53e061 100644
>>>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>>>> @@ -55,12 +55,17 @@
>>>>>>     #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>>>>>>     #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>>>>>>     #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
>>>>>> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
>>>>>>
>>>>>>     #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>>>>>     				     KVM_DIRTY_LOG_INITIALLY_SET)
>>>>>>
>>>>>>     #define KVM_HAVE_MMU_RWLOCK
>>>>>>
>>>>>> +/* HDBSS entry field definitions */
>>>>>> +#define HDBSS_ENTRY_VALID BIT(0)
>>>>>> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
>>>>>> +
>>>>>>     /*
>>>>>>      * Mode of operation configurable with kvm-arm.mode early param.
>>>>>>      * See Documentation/admin-guide/kernel-parameters.txt for more information.
>>>>>> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>>>>>>     u32 __attribute_const__ kvm_target_cpu(void);
>>>>>>     void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>>>>>>     void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
>>>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
>>>>>>
>>>>>>     struct kvm_hyp_memcache {
>>>>>>     	phys_addr_t head;
>>>>>> @@ -405,6 +411,8 @@ struct kvm_arch {
>>>>>>     	 * the associated pKVM instance in the hypervisor.
>>>>>>     	 */
>>>>>>     	struct kvm_protected_vm pkvm;
>>>>>> +
>>>>>> +	bool enable_hdbss;
>>>>>>     };
>>>>>>
>>>>>>     struct kvm_vcpu_fault_info {
>>>>>> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>>>>>>     	bool		reset;
>>>>>>     };
>>>>>>
>>>>>> +struct vcpu_hdbss_state {
>>>>>> +	phys_addr_t base_phys;
>>>>>> +	u32 size;
>>>>>> +	u32 next_index;
>>>>>> +};
>>>>>> +
>>>>> IIUC this is used once both on enable/disable and massively on
>>>>> vcpu_put/get.
>>>>>
>>>>> What if we actually save just HDBSSBR_EL2 and HDBSSPROD_EL2 instead?
>>>>> That way we avoid having masking operations in put/get as well as any
>>>>> possible error we may have formatting those.
>>>>>
>>>>> The cost is doing those operations once for enable and once for disable,
>>>>> which should be fine.
>>> Hi Tian,
>>>
>>>> Thanks for the suggestion. I actually started with storing the raw system
>>>> register
>>>>
>>>> values, as you proposed.
>>>>
>>>>
>>>> However, after discussing it with Oliver Upton in v1, we felt that keeping
>>>> the base address,
>>>>
>>>> size, and index as separate fields makes the state easier to understand.
>>>>
>>>>
>>>> Discussion
>>>> link:https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/
>>>> <https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/>
>>>>
>>>>
>>>> That's why I ended up changing the storage approach in the end.
>>>>
>>>>
>>> Humm, FWIW I disagree with the above argument.
>>> I would argue that vcpu_put should save the registers, and not
>>> actually know what they are about or how are they formatted at this point.
>>>
>>> The responsibility of understanding it's fields and usage value should be
>>> in the code that actually uses it.
>>>
>>> IIUC on kvm_vcpu_put_vhe and kvm_vcpu_load_vhe there are calls to other
>>> functions than only save the register as it is.
>>
>> ok, thx! I'll update the struct to store only the raw register values.
> Awesome :)
>
>>
>>>>>>     struct vncr_tlb;
>>>>>>
>>>>>>     struct kvm_vcpu_arch {
>>>>>> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
>>>>>>
>>>>>>     	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>>>>>>     	struct vncr_tlb	*vncr_tlb;
>>>>>> +
>>>>>> +	/* HDBSS registers info */
>>>>>> +	struct vcpu_hdbss_state hdbss;
>>>>>>     };
>>>>>>
>>>>>>     /*
>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>>>>> index d968aca0461a..3fea8cfe8869 100644
>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>>>> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>>>>>
>>>>>>     int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>>>>>>     int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
>>>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
>>>>>>
>>>>>>     phys_addr_t kvm_mmu_get_httbr(void);
>>>>>>     phys_addr_t kvm_get_idmap_vector(void);
>>>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>>>>>> index f4436ecc630c..d11f4d0dd4e7 100644
>>>>>> --- a/arch/arm64/include/asm/sysreg.h
>>>>>> +++ b/arch/arm64/include/asm/sysreg.h
>>>>>> @@ -1039,6 +1039,17 @@
>>>>>>
>>>>>>     #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>>>>>>     					       GCS_CAP_VALID_TOKEN)
>>>>>> +
>>>>>> +/*
>>>>>> + * Definitions for the HDBSS feature
>>>>>> + */
>>>>>> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
>>>>>> +
>>>>>> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
>>>>>> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
>>>>>> +
>>>>>> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
>>>>>> +
>>>>>>     /*
>>>>>>      * Definitions for GICv5 instructions
>>>>>>      */
>>>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>>>> index 29f0326f7e00..d64da05e25c4 100644
>>>>>> --- a/arch/arm64/kvm/arm.c
>>>>>> +++ b/arch/arm64/kvm/arm.c
>>>>>> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>>>>>     	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>>>>>>     }
>>>>>>
>>>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	struct page *hdbss_pg;
>>>>>> +
>>>>>> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
>>>>>> +	if (hdbss_pg)
>>>>>> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
>>>>>> +
>>>>>> +	vcpu->arch.hdbss.size = 0;
>>>>>> +}
>>>>>> +
>>>>>> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
>>>>>> +				    struct kvm_enable_cap *cap)
>>>>>> +{
>>>>>> +	unsigned long i;
>>>>>> +	struct kvm_vcpu *vcpu;
>>>>>> +	struct page *hdbss_pg = NULL;
>>>>>> +	__u64 size = cap->args[0];
>>>>>> +	bool enable = cap->args[1] ? true : false;
>>>>>> +
>>>>>> +	if (!system_supports_hdbss())
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	if (size > HDBSS_MAX_SIZE)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
>>>>>> +		return 0;
>>>>>> +
>>>>>> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	if (!enable) { /* Turn it off */
>>>>>> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
>>>>>> +
>>>>>> +		kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>>> +			/* Kick vcpus to flush hdbss buffer. */
>>>>>> +			kvm_vcpu_kick(vcpu);
>>>>>> +
>>>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>>>> +		}
>>>>>> +
>>>>>> +		kvm->arch.enable_hdbss = false;
>>>>>> +
>>>>>> +		return 0;
>>>>>> +	}
>>>>>> +
>>>>>> +	/* Turn it on */
>>>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>>> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
>>>>>> +		if (!hdbss_pg)
>>>>>> +			goto error_alloc;
>>>>>> +
>>>>>> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
>>>>>> +			.base_phys = page_to_phys(hdbss_pg),
>>>>>> +			.size = size,
>>>>>> +			.next_index = 0,
>>>>>> +		};
>>>>>> +	}
>>>>>> +
>>>>>> +	kvm->arch.enable_hdbss = true;
>>>>>> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * We should kick vcpus out of guest mode here to load new
>>>>>> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
>>>>>> +	 */
>>>>>> +	kvm_for_each_vcpu(i, vcpu, kvm)
>>>>>> +		kvm_vcpu_kick(vcpu);
>>>>>> +
>>>>>> +	return 0;
>>>>>> +
>>>>>> +error_alloc:
>>>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>>> +		if (vcpu->arch.hdbss.base_phys)
>>>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>>>> +	}
>>>>>> +
>>>>>> +	return -ENOMEM;
>>>>>> +}
>>>>>> +
>>>>>>     int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>>>     			    struct kvm_enable_cap *cap)
>>>>>>     {
>>>>>> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>>>     		r = 0;
>>>>>>     		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>>>>>>     		break;
>>>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>>>> +		mutex_lock(&kvm->lock);
>>>>>> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
>>>>>> +		mutex_unlock(&kvm->lock);
>>>>>> +		break;
>>>>>>     	default:
>>>>>>     		break;
>>>>>>     	}
>>>>>> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>>>     			r = kvm_supports_cacheable_pfnmap();
>>>>>>     		break;
>>>>>>
>>>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>>>> +		r = system_supports_hdbss();
>>>>>> +		break;
>>>>>>     	default:
>>>>>>     		r = 0;
>>>>>>     	}
>>>>>> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>>>>>>     		if (kvm_dirty_ring_check_request(vcpu))
>>>>>>     			return 0;
>>>>>>
>>>>>> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
>>>>>> +			kvm_flush_hdbss_buffer(vcpu);
>>>>> I am curious on why we need a flush-hdbss request,
>>>>> Don't we have the flush function happening every time we run vcpu_put?
>>>>>
>>>>> Oh, I see, you want to check if there is anything needed inside the inner
>>>>> loop of vcpu_run, without having to vcpu_put. I think it makes sense.
>>>>>
>>>>> But instead of having this on guest entry, does not it make more sense to
>>>>> have it in guest exit? This way we flush every time (if needed) we exit the
>>>>> guest, and instead of having a vcpu request, we just require a vcpu kick
>>>>> and it should flush if needed.
>>>>>
>>>>> Maybe have vcpu_put just save the registers, and add a the flush before
>>>>> handle_exit.
>>>>>
>>>>> What do you think?
>>>> Thank you for the feedback.
>>>>
>>>>
>>>> Indeed, in the initial version (v1), I placed the flush operation inside
>>>> handle_exit and
>>>>
>>>> used a vcpu_kick in kvm_arch_sync_dirty_log to trigger the flush of the
>>>> HDBSS buffer.
>>>>
>>>>
>>>> However, during the review, Marc pointed out that calling this function on
>>>> every exit
>>>>
>>>> event is too frequent if it's not always needed.
>>>>
>>>>
>>>> Discussion link:
>>>> _https://lore.kernel.org/linux-arm-kernel/86senjony9.wl-maz@kernel.org/_
>>>>
>>>>
>>>> I agreed with his assessment. Therefore, in the current version, I've
>>>> separated the flush
>>>>
>>>> operation into more specific and less frequent points:
>>>>
>>>>
>>>> 1. In vcpu_put
>>>>
>>>> 2. During dirty log synchronization, by kicking the vCPU to trigger a
>>>> request that flushes
>>>>
>>>> on its next exit.
>>>>
>>>>
>>>> 3. When handling a specific HDBSSF event.
>>>>
>>>>
>>>> This ensures the flush happens only when necessary, avoiding the overhead of
>>>> doing it
>>>>
>>>> on every guest exit.
>>>>
>>> Fair enough, calling it every time you go in the inner loop may be too
>>> much, even with a check to make sure it needs to run.
>>>
>>> Having it as a request means you may do that sometimes without
>>> leaving the inner loop. That could be useful if you want to use it with the
>>> IRQ handler to deal with full buffer, or any error, as well as dealing with
>>> a regular request in the 2nd case.
>>>
>>> While I agree it's needed to run before leaving guest context (i.e leaving
>>> the inner loop), I am not really sure vcpu_put is the best place to put the
>>> flushing. I may be wrong, but for me it looks more like of a place to save
>>> registers and context, as well as dropping refcounts or something like
>>> that. I would not expect a flush happening in vcpu_put, if I was reading
>>> the code.
>>>
>>> Would it be too bad if we had it into a call before vcpu_put, at
>>> kvm_arch_vcpu_ioctl_run()?
>>>
>> Thanks for the clarification. After looking again at the code paths, I agree
>> that
>>
>> kvm_vcpu_put_vhe() and kvm_arch_vcpu_put() are really meant to be pure
>> save/restore
>>
>> paths, so embedding HDBSS flushing there isn't ideal.
>>
>>
>> My remaining concern is that kvm_arch_vcpu_ioctl_run() doesn't cover the
>> case where
>>
>> the vCPU is scheduled out. In that case we still leave guest context, but we
>> don't return
>>
>> through the run loop, so a flush placed only in ioctl_run() wouldn't run.
>>
>>
>> Any suggestions on where this should hook in?
> You mention that on vcpu_put it works, right? maybe it's worth to track
> down which vcpu_put users would make sense to flush before it's calling.
>
> I found that vcpu_put is called only in the vcpu_run, but it's
> arch-specific version is called in:
>
> kvm_debug_handle_oslar : put and load in sequence, flush shouldnt be needed
> kvm_emulate_nested_eret : same as above, not sure if nested will be supported
> kvm_inject_nested : same as above
> kvm_reset_vcpu : put and load in sequence, not needed [1]
> kvm_sched_out : that's ran on sched-out, that makes sense for us
> vcpu_put : called only by vcpu_run, where we already planned to use
>
> Which brings us other benefit of not having that in vcpu_put: the flush
> could be happening on functions that should originally ran fast, and having
> it outside vcpu_put allows us to decide if we need it.
>
> So, having the flush happening in kvm_arch_vcpu_ioctl_run() and
> kvm_sched_out() should be enough, on top of the per-request you
> mentioned before.
>
> [1]: the vcpu_reset: not sure how often this does happen, and if it would
> be interesting flushing here as well. It seems to be called on init, so not
> the case where would be something to flush, and from a case where it's
> restoring the registers. So I think it should be safe to not flush here,
> so it should be a question of 'maybe being interesting', which I am not
> sure.


Got it, makes sense.


HDBSS doesn't apply to nested or other complex cases, so we only need to 
flush in

kvm_arch_vcpu_ioctl_run() and kvm_sched_out().


I'll update the code accordingly and test to make sure we haven't missed 
any edge cases.


Thanks!

Tian

>> Would introducing a small
>> arch‑specific
>>
>> "guest exit" helper, invoked fromm kvm_arch_vcpu_put(), be acceptable?
>>
> IIUC that would be the same as the previous one: we would have the flush
> happening inside a function that is supposed to save registers.
>
> Thanks!
> Leo
>
>
>> Thanks!
>>
>> Tian
>>
>>
>>>>>> +
>>>>>>     		check_nested_vcpu_requests(vcpu);
>>>>>>     	}
>>>>>>
>>>>>> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
>>>>>>
>>>>>>     void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>>>>>>     {
>>>>>> +	/*
>>>>>> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
>>>>>> +	 * before reporting dirty_bitmap to userspace. Send a request with
>>>>>> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
>>>>>> +	 */
>>>>>> +	struct kvm_vcpu *vcpu;
>>>>>> +
>>>>>> +	if (!kvm->arch.enable_hdbss)
>>>>>> +		return;
>>>>>>
>>>>>> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>>>>>>     }
>>>>>>
>>>>>>     static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>>>>>> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
>>>>>> index 9db3f11a4754..600cbc4f8ae9 100644
>>>>>> --- a/arch/arm64/kvm/hyp/vhe/switch.c
>>>>>> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
>>>>>> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>>>>>>     	local_irq_restore(flags);
>>>>>>     }
>>>>>>
>>>>>> +static void __load_hdbss(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	struct kvm *kvm = vcpu->kvm;
>>>>>> +	u64 br_el2, prod_el2;
>>>>>> +
>>>>>> +	if (!kvm->arch.enable_hdbss)
>>>>>> +		return;
>>>>>> +
>>>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>>>> +	prod_el2 = vcpu->arch.hdbss.next_index;
>>>>>> +
>>>>>> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
>>>>>> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
>>>>>> +
>>>>>> +	isb();
>>>>>> +}
>>>>>> +
>>>>>>     void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>>>     {
>>>>>>     	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
>>>>>> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>>>     	__vcpu_load_switch_sysregs(vcpu);
>>>>>>     	__vcpu_load_activate_traps(vcpu);
>>>>>>     	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
>>>>>> +	__load_hdbss(vcpu);
>>>>>>     }
>>>>>>
>>>>>>     void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>>>>>>     {
>>>>>> +	kvm_flush_hdbss_buffer(vcpu);
>>>>>>     	__vcpu_put_deactivate_traps(vcpu);
>>>>>>     	__vcpu_put_switch_sysregs(vcpu);
>>>>>>
>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>>>> index 070a01e53fcb..42b0710a16ce 100644
>>>>>> --- a/arch/arm64/kvm/mmu.c
>>>>>> +++ b/arch/arm64/kvm/mmu.c
>>>>>> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>     	if (writable)
>>>>>>     		prot |= KVM_PGTABLE_PROT_W;
>>>>>>
>>>>>> +	if (writable && kvm->arch.enable_hdbss && logging_active)
>>>>>> +		prot |= KVM_PGTABLE_PROT_DBM;
>>>>>> +
>>>>>>     	if (exec_fault)
>>>>>>     		prot |= KVM_PGTABLE_PROT_X;
>>>>>>
>>>>>> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>>>>>     	return 0;
>>>>>>     }
>>>>>>
>>>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	int idx, curr_idx;
>>>>>> +	u64 br_el2;
>>>>>> +	u64 *hdbss_buf;
>>>>>> +	struct kvm *kvm = vcpu->kvm;
>>>>>> +
>>>>>> +	if (!kvm->arch.enable_hdbss)
>>>>>> +		return;
>>>>>> +
>>>>>> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
>>>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>>>> +
>>>>>> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
>>>>>> +	if (curr_idx == 0 || br_el2 == 0)
>>>>>> +		return;
>>>>>> +
>>>>>> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
>>>>>> +	if (!hdbss_buf)
>>>>>> +		return;
>>>>>> +
>>>>>> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
>>>>>> +	for (idx = 0; idx < curr_idx; idx++) {
>>>>>> +		u64 gpa;
>>>>>> +
>>>>>> +		gpa = hdbss_buf[idx];
>>>>>> +		if (!(gpa & HDBSS_ENTRY_VALID))
>>>>>> +			continue;
>>>>>> +
>>>>>> +		gpa &= HDBSS_ENTRY_IPA;
>>>>>> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
>>>>>> +	}
>>>>> This will mark a page dirty for both dirty_bitmap or dirty_ring, depending
>>>>> on what is in use.
>>>>>
>>>>> Out of plain curiosity, have you planned / tested for the dirty-ring as
>>>>> well, or just for dirty-bitmap?
>>>> Currently, I have only tested this with dirty-bitmap mode.
>>>>
>>>>
>>>> I will test and ensure the HDBSS feature works correctly with dirty-ring in
>>>> the next version.
>>>>
>>>>
>>> Thanks!
>>>
>>>>>> +
>>>>>> +	/* reset HDBSS index */
>>>>>> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
>>>>>> +	vcpu->arch.hdbss.next_index = 0;
>>>>>> +	isb();
>>>>>> +}
>>>>>> +
>>>>>> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	u64 prod;
>>>>>> +	u64 fsc;
>>>>>> +
>>>>>> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
>>>>>> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
>>>>>> +
>>>>>> +	switch (fsc) {
>>>>>> +	case HDBSSPROD_EL2_FSC_OK:
>>>>>> +		/* Buffer full, which is reported as permission fault. */
>>>>>> +		kvm_flush_hdbss_buffer(vcpu);
>>>>>> +		return 1;
>>>>> Humm, flushing in a fault handler means hanging there, in IRQ context, for
>>>>> a while.
>>>>>
>>>>> Since we already deal with this on guest_exit (vcpu_put IIUC), why not just
>>>>> return in a way the vcpu has to exit the inner loop and let it flush there
>>>>> instead?
>>>>>
>>>>> Thanks!
>>>>> Leo
>>>> Thanks for the feedback.
>>>>
>>>>
>>>> If we flush on every guest exit (by moving the flush before handle_exit,
>>>> then we can
>>>>
>>>> indeed drop the flush from the fault handler and from vcpu_put.
>>>>
>>>>
>>>> However, given Marc's earlier concern about not imposing this overhead on
>>>> all vCPUs,
>>>>
>>>> I'd rather avoid flushing on every exit.
>>>>
>>>>
>>>> My current plan is to set a request bit in kvm_handle_hdbss_fault (via
>>>> kvm_make_request),
>>>>
>>>> and move the actual flush to the normal exit path, where it can execute in a
>>>> safe context.
>>>>
>>>> This also allows us to remove the flush from the fault handler entirely.
>>>>
>>>>
>>>> Does that approach sound reasonable to you?
>>>>
>>>>
>>> Yes, I think it looks much better, as the fault will cause guest to exit,
>>> and it can run the flush on it's way back in.
>>>
>>> Thanks!
>>> Leo
>>>
>>>>>> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
>>>>>> +	case HDBSSPROD_EL2_FSC_GPF:
>>>>>> +		return -EFAULT;
>>>>>> +	default:
>>>>>> +		/* Unknown fault. */
>>>>>> +		WARN_ONCE(1,
>>>>>> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
>>>>>> +				fsc, prod, vcpu->vcpu_id);
>>>>>> +		return -EFAULT;
>>>>>> +	}
>>>>>> +}
>>>>>> +
>>>>>>     /**
>>>>>>      * kvm_handle_guest_abort - handles all 2nd stage aborts
>>>>>>      * @vcpu:	the VCPU pointer
>>>>>> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>>>>
>>>>>>     	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>>>>>>
>>>>>> +	if (esr_iss2_is_hdbssf(esr))
>>>>>> +		return kvm_handle_hdbss_fault(vcpu);
>>>>>> +
>>>>>>     	if (esr_fsc_is_translation_fault(esr)) {
>>>>>>     		/* Beyond sanitised PARange (which is the IPA limit) */
>>>>>>     		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
>>>>>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>>>>>> index 959532422d3a..c03a4b310b53 100644
>>>>>> --- a/arch/arm64/kvm/reset.c
>>>>>> +++ b/arch/arm64/kvm/reset.c
>>>>>> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>>>>>>     	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>>>>>>     	kfree(vcpu->arch.vncr_tlb);
>>>>>>     	kfree(vcpu->arch.ccsidr);
>>>>>> +
>>>>>> +	if (vcpu->kvm->arch.enable_hdbss)
>>>>>> +		kvm_arm_vcpu_free_hdbss(vcpu);
>>>>>>     }
>>>>>>
>>>>>>     static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>>>>>> --
>>>>>> 2.33.0
>>>> Thanks!
>>>>
>>>> Tian
>>>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-03-12 13:13             ` Tian Zheng
@ 2026-03-12 14:58               ` Leonardo Bras
  0 siblings, 0 replies; 24+ messages in thread
From: Leonardo Bras @ 2026-03-12 14:58 UTC (permalink / raw)
  To: Tian Zheng
  Cc: Leonardo Bras, maz, oupton, catalin.marinas, corbet, pbonzini,
	will, yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron,
	yezhenyu2, linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel,
	linux-doc, linux-kernel, skhan, suzuki.poulose

On Thu, Mar 12, 2026 at 09:13:33PM +0800, Tian Zheng wrote:
> 
> On 3/12/2026 8:06 PM, Leonardo Bras wrote:
> > On Thu, Mar 12, 2026 at 02:17:41PM +0800, Tian Zheng wrote:
> > > On 3/6/2026 11:01 PM, Leonardo Bras wrote:
> > > > On Fri, Mar 06, 2026 at 05:27:58PM +0800, Tian Zheng wrote:
> > > > > Hi Leo,
> > > > > 
> > > > > On 3/4/2026 11:40 PM, Leonardo Bras wrote:
> > > > > > Hi Tian,
> > > > > > 
> > > > > > Few extra notes/questions below
> > > > > > 
> > > > > > On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
> > > > > > > From: eillon<yezhenyu2@huawei.com>
> > > > > > > 
> > > > > > > HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
> > > > > > > migration. This feature is only supported in VHE mode.
> > > > > > > 
> > > > > > > Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
> > > > > > > write faults are handled by user_mem_abort, which relaxes permissions
> > > > > > > and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
> > > > > > > writes no longer trap, as the hardware automatically transitions the page
> > > > > > > from writable-clean to writable-dirty.
> > > > > > > 
> > > > > > > KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
> > > > > > > enabled, the hardware observes the clean->dirty transition and records
> > > > > > > the corresponding page into the HDBSS buffer.
> > > > > > > 
> > > > > > > During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
> > > > > > > that check_vcpu_requests flushes the HDBSS buffer and propagates the
> > > > > > > accumulated dirty information into the userspace-visible dirty bitmap.
> > > > > > > 
> > > > > > > Add fault handling for HDBSS including buffer full, external abort, and
> > > > > > > general protection fault (GPF).
> > > > > > > 
> > > > > > > Signed-off-by: eillon<yezhenyu2@huawei.com>
> > > > > > > Signed-off-by: Tian Zheng<zhengtian10@huawei.com>
> > > > > > > ---
> > > > > > >     arch/arm64/include/asm/esr.h      |   5 ++
> > > > > > >     arch/arm64/include/asm/kvm_host.h |  17 +++++
> > > > > > >     arch/arm64/include/asm/kvm_mmu.h  |   1 +
> > > > > > >     arch/arm64/include/asm/sysreg.h   |  11 ++++
> > > > > > >     arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
> > > > > > >     arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
> > > > > > >     arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
> > > > > > >     arch/arm64/kvm/reset.c            |   3 +
> > > > > > >     8 files changed, 228 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> > > > > > > index 81c17320a588..2e6b679b5908 100644
> > > > > > > --- a/arch/arm64/include/asm/esr.h
> > > > > > > +++ b/arch/arm64/include/asm/esr.h
> > > > > > > @@ -437,6 +437,11 @@
> > > > > > >     #ifndef __ASSEMBLER__
> > > > > > >     #include <asm/types.h>
> > > > > > > 
> > > > > > > +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
> > > > > > > +{
> > > > > > > +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
> > > > > > > +}
> > > > > > > +
> > > > > > >     static inline unsigned long esr_brk_comment(unsigned long esr)
> > > > > > >     {
> > > > > > >     	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
> > > > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > > > > > index 5d5a3bbdb95e..57ee6b53e061 100644
> > > > > > > --- a/arch/arm64/include/asm/kvm_host.h
> > > > > > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > > > > > @@ -55,12 +55,17 @@
> > > > > > >     #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
> > > > > > >     #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
> > > > > > >     #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
> > > > > > > +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
> > > > > > > 
> > > > > > >     #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
> > > > > > >     				     KVM_DIRTY_LOG_INITIALLY_SET)
> > > > > > > 
> > > > > > >     #define KVM_HAVE_MMU_RWLOCK
> > > > > > > 
> > > > > > > +/* HDBSS entry field definitions */
> > > > > > > +#define HDBSS_ENTRY_VALID BIT(0)
> > > > > > > +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> > > > > > > +
> > > > > > >     /*
> > > > > > >      * Mode of operation configurable with kvm-arm.mode early param.
> > > > > > >      * See Documentation/admin-guide/kernel-parameters.txt for more information.
> > > > > > > @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
> > > > > > >     u32 __attribute_const__ kvm_target_cpu(void);
> > > > > > >     void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
> > > > > > >     void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
> > > > > > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
> > > > > > > 
> > > > > > >     struct kvm_hyp_memcache {
> > > > > > >     	phys_addr_t head;
> > > > > > > @@ -405,6 +411,8 @@ struct kvm_arch {
> > > > > > >     	 * the associated pKVM instance in the hypervisor.
> > > > > > >     	 */
> > > > > > >     	struct kvm_protected_vm pkvm;
> > > > > > > +
> > > > > > > +	bool enable_hdbss;
> > > > > > >     };
> > > > > > > 
> > > > > > >     struct kvm_vcpu_fault_info {
> > > > > > > @@ -816,6 +824,12 @@ struct vcpu_reset_state {
> > > > > > >     	bool		reset;
> > > > > > >     };
> > > > > > > 
> > > > > > > +struct vcpu_hdbss_state {
> > > > > > > +	phys_addr_t base_phys;
> > > > > > > +	u32 size;
> > > > > > > +	u32 next_index;
> > > > > > > +};
> > > > > > > +
> > > > > > IIUC this is used once both on enable/disable and massively on
> > > > > > vcpu_put/get.
> > > > > > 
> > > > > > What if we actually save just HDBSSBR_EL2 and HDBSSPROD_EL2 instead?
> > > > > > That way we avoid having masking operations in put/get as well as any
> > > > > > possible error we may have formatting those.
> > > > > > 
> > > > > > The cost is doing those operations once for enable and once for disable,
> > > > > > which should be fine.
> > > > Hi Tian,
> > > > 
> > > > > Thanks for the suggestion. I actually started with storing the raw system
> > > > > register
> > > > > 
> > > > > values, as you proposed.
> > > > > 
> > > > > 
> > > > > However, after discussing it with Oliver Upton in v1, we felt that keeping
> > > > > the base address,
> > > > > 
> > > > > size, and index as separate fields makes the state easier to understand.
> > > > > 
> > > > > 
> > > > > Discussion
> > > > > link:https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/
> > > > > <https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/>
> > > > > 
> > > > > 
> > > > > That's why I ended up changing the storage approach in the end.
> > > > > 
> > > > > 
> > > > Humm, FWIW I disagree with the above argument.
> > > > I would argue that vcpu_put should save the registers, and not
> > > > actually know what they are about or how are they formatted at this point.
> > > > 
> > > > The responsibility of understanding it's fields and usage value should be
> > > > in the code that actually uses it.
> > > > 
> > > > IIUC on kvm_vcpu_put_vhe and kvm_vcpu_load_vhe there are calls to other
> > > > functions than only save the register as it is.
> > > 
> > > ok, thx! I'll update the struct to store only the raw register values.
> > Awesome :)
> > 
> > > 
> > > > > > >     struct vncr_tlb;
> > > > > > > 
> > > > > > >     struct kvm_vcpu_arch {
> > > > > > > @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
> > > > > > > 
> > > > > > >     	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
> > > > > > >     	struct vncr_tlb	*vncr_tlb;
> > > > > > > +
> > > > > > > +	/* HDBSS registers info */
> > > > > > > +	struct vcpu_hdbss_state hdbss;
> > > > > > >     };
> > > > > > > 
> > > > > > >     /*
> > > > > > > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > > > > > > index d968aca0461a..3fea8cfe8869 100644
> > > > > > > --- a/arch/arm64/include/asm/kvm_mmu.h
> > > > > > > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > > > > > > @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > > > > > > 
> > > > > > >     int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> > > > > > >     int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
> > > > > > > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
> > > > > > > 
> > > > > > >     phys_addr_t kvm_mmu_get_httbr(void);
> > > > > > >     phys_addr_t kvm_get_idmap_vector(void);
> > > > > > > diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> > > > > > > index f4436ecc630c..d11f4d0dd4e7 100644
> > > > > > > --- a/arch/arm64/include/asm/sysreg.h
> > > > > > > +++ b/arch/arm64/include/asm/sysreg.h
> > > > > > > @@ -1039,6 +1039,17 @@
> > > > > > > 
> > > > > > >     #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
> > > > > > >     					       GCS_CAP_VALID_TOKEN)
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Definitions for the HDBSS feature
> > > > > > > + */
> > > > > > > +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
> > > > > > > +
> > > > > > > +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
> > > > > > > +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
> > > > > > > +
> > > > > > > +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
> > > > > > > +
> > > > > > >     /*
> > > > > > >      * Definitions for GICv5 instructions
> > > > > > >      */
> > > > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > > > > index 29f0326f7e00..d64da05e25c4 100644
> > > > > > > --- a/arch/arm64/kvm/arm.c
> > > > > > > +++ b/arch/arm64/kvm/arm.c
> > > > > > > @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
> > > > > > >     	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
> > > > > > >     }
> > > > > > > 
> > > > > > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
> > > > > > > +{
> > > > > > > +	struct page *hdbss_pg;
> > > > > > > +
> > > > > > > +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
> > > > > > > +	if (hdbss_pg)
> > > > > > > +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
> > > > > > > +
> > > > > > > +	vcpu->arch.hdbss.size = 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
> > > > > > > +				    struct kvm_enable_cap *cap)
> > > > > > > +{
> > > > > > > +	unsigned long i;
> > > > > > > +	struct kvm_vcpu *vcpu;
> > > > > > > +	struct page *hdbss_pg = NULL;
> > > > > > > +	__u64 size = cap->args[0];
> > > > > > > +	bool enable = cap->args[1] ? true : false;
> > > > > > > +
> > > > > > > +	if (!system_supports_hdbss())
> > > > > > > +		return -EINVAL;
> > > > > > > +
> > > > > > > +	if (size > HDBSS_MAX_SIZE)
> > > > > > > +		return -EINVAL;
> > > > > > > +
> > > > > > > +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
> > > > > > > +		return 0;
> > > > > > > +
> > > > > > > +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
> > > > > > > +		return -EINVAL;
> > > > > > > +
> > > > > > > +	if (!enable) { /* Turn it off */
> > > > > > > +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
> > > > > > > +
> > > > > > > +		kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > > > +			/* Kick vcpus to flush hdbss buffer. */
> > > > > > > +			kvm_vcpu_kick(vcpu);
> > > > > > > +
> > > > > > > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > > > > > > +		}
> > > > > > > +
> > > > > > > +		kvm->arch.enable_hdbss = false;
> > > > > > > +
> > > > > > > +		return 0;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	/* Turn it on */
> > > > > > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > > > +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
> > > > > > > +		if (!hdbss_pg)
> > > > > > > +			goto error_alloc;
> > > > > > > +
> > > > > > > +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
> > > > > > > +			.base_phys = page_to_phys(hdbss_pg),
> > > > > > > +			.size = size,
> > > > > > > +			.next_index = 0,
> > > > > > > +		};
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	kvm->arch.enable_hdbss = true;
> > > > > > > +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * We should kick vcpus out of guest mode here to load new
> > > > > > > +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
> > > > > > > +	 */
> > > > > > > +	kvm_for_each_vcpu(i, vcpu, kvm)
> > > > > > > +		kvm_vcpu_kick(vcpu);
> > > > > > > +
> > > > > > > +	return 0;
> > > > > > > +
> > > > > > > +error_alloc:
> > > > > > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > > > +		if (vcpu->arch.hdbss.base_phys)
> > > > > > > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	return -ENOMEM;
> > > > > > > +}
> > > > > > > +
> > > > > > >     int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > > > > >     			    struct kvm_enable_cap *cap)
> > > > > > >     {
> > > > > > > @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > > > > >     		r = 0;
> > > > > > >     		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
> > > > > > >     		break;
> > > > > > > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > > > > > +		mutex_lock(&kvm->lock);
> > > > > > > +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
> > > > > > > +		mutex_unlock(&kvm->lock);
> > > > > > > +		break;
> > > > > > >     	default:
> > > > > > >     		break;
> > > > > > >     	}
> > > > > > > @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > > > > > >     			r = kvm_supports_cacheable_pfnmap();
> > > > > > >     		break;
> > > > > > > 
> > > > > > > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > > > > > +		r = system_supports_hdbss();
> > > > > > > +		break;
> > > > > > >     	default:
> > > > > > >     		r = 0;
> > > > > > >     	}
> > > > > > > @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > > > > > >     		if (kvm_dirty_ring_check_request(vcpu))
> > > > > > >     			return 0;
> > > > > > > 
> > > > > > > +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
> > > > > > > +			kvm_flush_hdbss_buffer(vcpu);
> > > > > > I am curious on why we need a flush-hdbss request,
> > > > > > Don't we have the flush function happening every time we run vcpu_put?
> > > > > > 
> > > > > > Oh, I see, you want to check if there is anything needed inside the inner
> > > > > > loop of vcpu_run, without having to vcpu_put. I think it makes sense.
> > > > > > 
> > > > > > But instead of having this on guest entry, does not it make more sense to
> > > > > > have it in guest exit? This way we flush every time (if needed) we exit the
> > > > > > guest, and instead of having a vcpu request, we just require a vcpu kick
> > > > > > and it should flush if needed.
> > > > > > 
> > > > > > Maybe have vcpu_put just save the registers, and add a the flush before
> > > > > > handle_exit.
> > > > > > 
> > > > > > What do you think?
> > > > > Thank you for the feedback.
> > > > > 
> > > > > 
> > > > > Indeed, in the initial version (v1), I placed the flush operation inside
> > > > > handle_exit and
> > > > > 
> > > > > used a vcpu_kick in kvm_arch_sync_dirty_log to trigger the flush of the
> > > > > HDBSS buffer.
> > > > > 
> > > > > 
> > > > > However, during the review, Marc pointed out that calling this function on
> > > > > every exit
> > > > > 
> > > > > event is too frequent if it's not always needed.
> > > > > 
> > > > > 
> > > > > Discussion link:
> > > > > _https://lore.kernel.org/linux-arm-kernel/86senjony9.wl-maz@kernel.org/_
> > > > > 
> > > > > 
> > > > > I agreed with his assessment. Therefore, in the current version, I've
> > > > > separated the flush
> > > > > 
> > > > > operation into more specific and less frequent points:
> > > > > 
> > > > > 
> > > > > 1. In vcpu_put
> > > > > 
> > > > > 2. During dirty log synchronization, by kicking the vCPU to trigger a
> > > > > request that flushes
> > > > > 
> > > > > on its next exit.
> > > > > 
> > > > > 
> > > > > 3. When handling a specific HDBSSF event.
> > > > > 
> > > > > 
> > > > > This ensures the flush happens only when necessary, avoiding the overhead of
> > > > > doing it
> > > > > 
> > > > > on every guest exit.
> > > > > 
> > > > Fair enough, calling it every time you go in the inner loop may be too
> > > > much, even with a check to make sure it needs to run.
> > > > 
> > > > Having it as a request means you may do that sometimes without
> > > > leaving the inner loop. That could be useful if you want to use it with the
> > > > IRQ handler to deal with full buffer, or any error, as well as dealing with
> > > > a regular request in the 2nd case.
> > > > 
> > > > While I agree it's needed to run before leaving guest context (i.e leaving
> > > > the inner loop), I am not really sure vcpu_put is the best place to put the
> > > > flushing. I may be wrong, but for me it looks more like of a place to save
> > > > registers and context, as well as dropping refcounts or something like
> > > > that. I would not expect a flush happening in vcpu_put, if I was reading
> > > > the code.
> > > > 
> > > > Would it be too bad if we had it into a call before vcpu_put, at
> > > > kvm_arch_vcpu_ioctl_run()?
> > > > 
> > > Thanks for the clarification. After looking again at the code paths, I agree
> > > that
> > > 
> > > kvm_vcpu_put_vhe() and kvm_arch_vcpu_put() are really meant to be pure
> > > save/restore
> > > 
> > > paths, so embedding HDBSS flushing there isn't ideal.
> > > 
> > > 
> > > My remaining concern is that kvm_arch_vcpu_ioctl_run() doesn't cover the
> > > case where
> > > 
> > > the vCPU is scheduled out. In that case we still leave guest context, but we
> > > don't return
> > > 
> > > through the run loop, so a flush placed only in ioctl_run() wouldn't run.
> > > 
> > > 
> > > Any suggestions on where this should hook in?
> > You mention that on vcpu_put it works, right? maybe it's worth to track
> > down which vcpu_put users would make sense to flush before it's calling.
> > 
> > I found that vcpu_put is called only in the vcpu_run, but it's
> > arch-specific version is called in:
> > 
> > kvm_debug_handle_oslar : put and load in sequence, flush shouldnt be needed
> > kvm_emulate_nested_eret : same as above, not sure if nested will be supported
> > kvm_inject_nested : same as above
> > kvm_reset_vcpu : put and load in sequence, not needed [1]
> > kvm_sched_out : that's ran on sched-out, that makes sense for us
> > vcpu_put : called only by vcpu_run, where we already planned to use
> > 
> > Which brings us other benefit of not having that in vcpu_put: the flush
> > could be happening on functions that should originally ran fast, and having
> > it outside vcpu_put allows us to decide if we need it.
> > 
> > So, having the flush happening in kvm_arch_vcpu_ioctl_run() and
> > kvm_sched_out() should be enough, on top of the per-request you
> > mentioned before.
> > 
> > [1]: the vcpu_reset: not sure how often this does happen, and if it would
> > be interesting flushing here as well. It seems to be called on init, so not
> > the case where would be something to flush, and from a case where it's
> > restoring the registers. So I think it should be safe to not flush here,
> > so it should be a question of 'maybe being interesting', which I am not
> > sure.
> 
> 
> Got it, makes sense.
> 
> 
> HDBSS doesn't apply to nested or other complex cases, so we only need to
> flush in
> 
> kvm_arch_vcpu_ioctl_run() and kvm_sched_out().
> 
> 
> I'll update the code accordingly and test to make sure we haven't missed any
> edge cases.
> 
> 
> Thanks!
> 
> Tian
>

Awesome! Thanks!
Leo
 
> > > Would introducing a small
> > > arch‑specific
> > > 
> > > "guest exit" helper, invoked fromm kvm_arch_vcpu_put(), be acceptable?
> > > 
> > IIUC that would be the same as the previous one: we would have the flush
> > happening inside a function that is supposed to save registers.
> > 
> > Thanks!
> > Leo
> > 
> > 
> > > Thanks!
> > > 
> > > Tian
> > > 
> > > 
> > > > > > > +
> > > > > > >     		check_nested_vcpu_requests(vcpu);
> > > > > > >     	}
> > > > > > > 
> > > > > > > @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
> > > > > > > 
> > > > > > >     void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
> > > > > > >     {
> > > > > > > +	/*
> > > > > > > +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
> > > > > > > +	 * before reporting dirty_bitmap to userspace. Send a request with
> > > > > > > +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
> > > > > > > +	 */
> > > > > > > +	struct kvm_vcpu *vcpu;
> > > > > > > +
> > > > > > > +	if (!kvm->arch.enable_hdbss)
> > > > > > > +		return;
> > > > > > > 
> > > > > > > +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
> > > > > > >     }
> > > > > > > 
> > > > > > >     static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> > > > > > > diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > > > index 9db3f11a4754..600cbc4f8ae9 100644
> > > > > > > --- a/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > > > +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > > > @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
> > > > > > >     	local_irq_restore(flags);
> > > > > > >     }
> > > > > > > 
> > > > > > > +static void __load_hdbss(struct kvm_vcpu *vcpu)
> > > > > > > +{
> > > > > > > +	struct kvm *kvm = vcpu->kvm;
> > > > > > > +	u64 br_el2, prod_el2;
> > > > > > > +
> > > > > > > +	if (!kvm->arch.enable_hdbss)
> > > > > > > +		return;
> > > > > > > +
> > > > > > > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > > > > > > +	prod_el2 = vcpu->arch.hdbss.next_index;
> > > > > > > +
> > > > > > > +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
> > > > > > > +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
> > > > > > > +
> > > > > > > +	isb();
> > > > > > > +}
> > > > > > > +
> > > > > > >     void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> > > > > > >     {
> > > > > > >     	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
> > > > > > > @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> > > > > > >     	__vcpu_load_switch_sysregs(vcpu);
> > > > > > >     	__vcpu_load_activate_traps(vcpu);
> > > > > > >     	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
> > > > > > > +	__load_hdbss(vcpu);
> > > > > > >     }
> > > > > > > 
> > > > > > >     void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
> > > > > > >     {
> > > > > > > +	kvm_flush_hdbss_buffer(vcpu);
> > > > > > >     	__vcpu_put_deactivate_traps(vcpu);
> > > > > > >     	__vcpu_put_switch_sysregs(vcpu);
> > > > > > > 
> > > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > > > > index 070a01e53fcb..42b0710a16ce 100644
> > > > > > > --- a/arch/arm64/kvm/mmu.c
> > > > > > > +++ b/arch/arm64/kvm/mmu.c
> > > > > > > @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > > > > >     	if (writable)
> > > > > > >     		prot |= KVM_PGTABLE_PROT_W;
> > > > > > > 
> > > > > > > +	if (writable && kvm->arch.enable_hdbss && logging_active)
> > > > > > > +		prot |= KVM_PGTABLE_PROT_DBM;
> > > > > > > +
> > > > > > >     	if (exec_fault)
> > > > > > >     		prot |= KVM_PGTABLE_PROT_X;
> > > > > > > 
> > > > > > > @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > > > > > >     	return 0;
> > > > > > >     }
> > > > > > > 
> > > > > > > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
> > > > > > > +{
> > > > > > > +	int idx, curr_idx;
> > > > > > > +	u64 br_el2;
> > > > > > > +	u64 *hdbss_buf;
> > > > > > > +	struct kvm *kvm = vcpu->kvm;
> > > > > > > +
> > > > > > > +	if (!kvm->arch.enable_hdbss)
> > > > > > > +		return;
> > > > > > > +
> > > > > > > +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
> > > > > > > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > > > > > > +
> > > > > > > +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
> > > > > > > +	if (curr_idx == 0 || br_el2 == 0)
> > > > > > > +		return;
> > > > > > > +
> > > > > > > +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
> > > > > > > +	if (!hdbss_buf)
> > > > > > > +		return;
> > > > > > > +
> > > > > > > +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
> > > > > > > +	for (idx = 0; idx < curr_idx; idx++) {
> > > > > > > +		u64 gpa;
> > > > > > > +
> > > > > > > +		gpa = hdbss_buf[idx];
> > > > > > > +		if (!(gpa & HDBSS_ENTRY_VALID))
> > > > > > > +			continue;
> > > > > > > +
> > > > > > > +		gpa &= HDBSS_ENTRY_IPA;
> > > > > > > +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> > > > > > > +	}
> > > > > > This will mark a page dirty for both dirty_bitmap or dirty_ring, depending
> > > > > > on what is in use.
> > > > > > 
> > > > > > Out of plain curiosity, have you planned / tested for the dirty-ring as
> > > > > > well, or just for dirty-bitmap?
> > > > > Currently, I have only tested this with dirty-bitmap mode.
> > > > > 
> > > > > 
> > > > > I will test and ensure the HDBSS feature works correctly with dirty-ring in
> > > > > the next version.
> > > > > 
> > > > > 
> > > > Thanks!
> > > > 
> > > > > > > +
> > > > > > > +	/* reset HDBSS index */
> > > > > > > +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
> > > > > > > +	vcpu->arch.hdbss.next_index = 0;
> > > > > > > +	isb();
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
> > > > > > > +{
> > > > > > > +	u64 prod;
> > > > > > > +	u64 fsc;
> > > > > > > +
> > > > > > > +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
> > > > > > > +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
> > > > > > > +
> > > > > > > +	switch (fsc) {
> > > > > > > +	case HDBSSPROD_EL2_FSC_OK:
> > > > > > > +		/* Buffer full, which is reported as permission fault. */
> > > > > > > +		kvm_flush_hdbss_buffer(vcpu);
> > > > > > > +		return 1;
> > > > > > Humm, flushing in a fault handler means hanging there, in IRQ context, for
> > > > > > a while.
> > > > > > 
> > > > > > Since we already deal with this on guest_exit (vcpu_put IIUC), why not just
> > > > > > return in a way the vcpu has to exit the inner loop and let it flush there
> > > > > > instead?
> > > > > > 
> > > > > > Thanks!
> > > > > > Leo
> > > > > Thanks for the feedback.
> > > > > 
> > > > > 
> > > > > If we flush on every guest exit (by moving the flush before handle_exit,
> > > > > then we can
> > > > > 
> > > > > indeed drop the flush from the fault handler and from vcpu_put.
> > > > > 
> > > > > 
> > > > > However, given Marc's earlier concern about not imposing this overhead on
> > > > > all vCPUs,
> > > > > 
> > > > > I'd rather avoid flushing on every exit.
> > > > > 
> > > > > 
> > > > > My current plan is to set a request bit in kvm_handle_hdbss_fault (via
> > > > > kvm_make_request),
> > > > > 
> > > > > and move the actual flush to the normal exit path, where it can execute in a
> > > > > safe context.
> > > > > 
> > > > > This also allows us to remove the flush from the fault handler entirely.
> > > > > 
> > > > > 
> > > > > Does that approach sound reasonable to you?
> > > > > 
> > > > > 
> > > > Yes, I think it looks much better, as the fault will cause guest to exit,
> > > > and it can run the flush on it's way back in.
> > > > 
> > > > Thanks!
> > > > Leo
> > > > 
> > > > > > > +	case HDBSSPROD_EL2_FSC_ExternalAbort:
> > > > > > > +	case HDBSSPROD_EL2_FSC_GPF:
> > > > > > > +		return -EFAULT;
> > > > > > > +	default:
> > > > > > > +		/* Unknown fault. */
> > > > > > > +		WARN_ONCE(1,
> > > > > > > +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
> > > > > > > +				fsc, prod, vcpu->vcpu_id);
> > > > > > > +		return -EFAULT;
> > > > > > > +	}
> > > > > > > +}
> > > > > > > +
> > > > > > >     /**
> > > > > > >      * kvm_handle_guest_abort - handles all 2nd stage aborts
> > > > > > >      * @vcpu:	the VCPU pointer
> > > > > > > @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > > > > > 
> > > > > > >     	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
> > > > > > > 
> > > > > > > +	if (esr_iss2_is_hdbssf(esr))
> > > > > > > +		return kvm_handle_hdbss_fault(vcpu);
> > > > > > > +
> > > > > > >     	if (esr_fsc_is_translation_fault(esr)) {
> > > > > > >     		/* Beyond sanitised PARange (which is the IPA limit) */
> > > > > > >     		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
> > > > > > > diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> > > > > > > index 959532422d3a..c03a4b310b53 100644
> > > > > > > --- a/arch/arm64/kvm/reset.c
> > > > > > > +++ b/arch/arm64/kvm/reset.c
> > > > > > > @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
> > > > > > >     	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
> > > > > > >     	kfree(vcpu->arch.vncr_tlb);
> > > > > > >     	kfree(vcpu->arch.ccsidr);
> > > > > > > +
> > > > > > > +	if (vcpu->kvm->arch.enable_hdbss)
> > > > > > > +		kvm_arm_vcpu_free_hdbss(vcpu);
> > > > > > >     }
> > > > > > > 
> > > > > > >     static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
> > > > > > > --
> > > > > > > 2.33.0
> > > > > Thanks!
> > > > > 
> > > > > Tian
> > > > > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-02-25  4:04 ` [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events Tian Zheng
  2026-02-25 17:46   ` Leonardo Bras
  2026-03-04 15:40   ` Leonardo Bras
@ 2026-03-25 18:05   ` Leonardo Bras
  2026-03-25 18:20     ` [PATCH] arm64/kvm: Enable eager hugepage splitting if HDBSS is available Leonardo Bras
                       ` (2 more replies)
  2 siblings, 3 replies; 24+ messages in thread
From: Leonardo Bras @ 2026-03-25 18:05 UTC (permalink / raw)
  To: Tian Zheng
  Cc: Leonardo Bras, maz, oupton, catalin.marinas, corbet, pbonzini,
	will, yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron,
	yezhenyu2, linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel,
	linux-doc, linux-kernel, skhan, suzuki.poulose

Hello Tian,

I am currently working on HACDBS enablement(which will be rebased on top of 
this patchset) and due to the fact HACDBS and HDBSS are kind of 
complementary I will sometimes come with some questions for issues I have 
faced myself on that part. :)

(see below)

On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
> From: eillon <yezhenyu2@huawei.com>
> 
> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
> migration. This feature is only supported in VHE mode.
> 
> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
> write faults are handled by user_mem_abort, which relaxes permissions
> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
> writes no longer trap, as the hardware automatically transitions the page
> from writable-clean to writable-dirty.
> 
> KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
> enabled, the hardware observes the clean->dirty transition and records
> the corresponding page into the HDBSS buffer.
> 
> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
> that check_vcpu_requests flushes the HDBSS buffer and propagates the
> accumulated dirty information into the userspace-visible dirty bitmap.
> 
> Add fault handling for HDBSS including buffer full, external abort, and
> general protection fault (GPF).
> 
> Signed-off-by: eillon <yezhenyu2@huawei.com>
> Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
> ---
>  arch/arm64/include/asm/esr.h      |   5 ++
>  arch/arm64/include/asm/kvm_host.h |  17 +++++
>  arch/arm64/include/asm/kvm_mmu.h  |   1 +
>  arch/arm64/include/asm/sysreg.h   |  11 ++++
>  arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>  arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>  arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>  arch/arm64/kvm/reset.c            |   3 +
>  8 files changed, 228 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> index 81c17320a588..2e6b679b5908 100644
> --- a/arch/arm64/include/asm/esr.h
> +++ b/arch/arm64/include/asm/esr.h
> @@ -437,6 +437,11 @@
>  #ifndef __ASSEMBLER__
>  #include <asm/types.h>
> 
> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
> +{
> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
> +}
> +
>  static inline unsigned long esr_brk_comment(unsigned long esr)
>  {
>  	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 5d5a3bbdb95e..57ee6b53e061 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -55,12 +55,17 @@
>  #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>  #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>  #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
> 
>  #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>  				     KVM_DIRTY_LOG_INITIALLY_SET)
> 
>  #define KVM_HAVE_MMU_RWLOCK
> 
> +/* HDBSS entry field definitions */
> +#define HDBSS_ENTRY_VALID BIT(0)
> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> +
>  /*
>   * Mode of operation configurable with kvm-arm.mode early param.
>   * See Documentation/admin-guide/kernel-parameters.txt for more information.
> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>  u32 __attribute_const__ kvm_target_cpu(void);
>  void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>  void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
> 
>  struct kvm_hyp_memcache {
>  	phys_addr_t head;
> @@ -405,6 +411,8 @@ struct kvm_arch {
>  	 * the associated pKVM instance in the hypervisor.
>  	 */
>  	struct kvm_protected_vm pkvm;
> +
> +	bool enable_hdbss;
>  };
> 
>  struct kvm_vcpu_fault_info {
> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>  	bool		reset;
>  };
> 
> +struct vcpu_hdbss_state {
> +	phys_addr_t base_phys;
> +	u32 size;
> +	u32 next_index;
> +};
> +
>  struct vncr_tlb;
> 
>  struct kvm_vcpu_arch {
> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
> 
>  	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>  	struct vncr_tlb	*vncr_tlb;
> +
> +	/* HDBSS registers info */
> +	struct vcpu_hdbss_state hdbss;
>  };
> 
>  /*
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index d968aca0461a..3fea8cfe8869 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> 
>  int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>  int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
> 
>  phys_addr_t kvm_mmu_get_httbr(void);
>  phys_addr_t kvm_get_idmap_vector(void);
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index f4436ecc630c..d11f4d0dd4e7 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -1039,6 +1039,17 @@
> 
>  #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>  					       GCS_CAP_VALID_TOKEN)
> +
> +/*
> + * Definitions for the HDBSS feature
> + */
> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
> +
> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
> +
> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
> +
>  /*
>   * Definitions for GICv5 instructions]
>   */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 29f0326f7e00..d64da05e25c4 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>  	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>  }
> 
> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
> +{
> +	struct page *hdbss_pg;
> +
> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
> +	if (hdbss_pg)
> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
> +
> +	vcpu->arch.hdbss.size = 0;
> +}
> +
> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
> +				    struct kvm_enable_cap *cap)
> +{
> +	unsigned long i;
> +	struct kvm_vcpu *vcpu;
> +	struct page *hdbss_pg = NULL;
> +	__u64 size = cap->args[0];
> +	bool enable = cap->args[1] ? true : false;
> +
> +	if (!system_supports_hdbss())
> +		return -EINVAL;
> +
> +	if (size > HDBSS_MAX_SIZE)
> +		return -EINVAL;
> +
> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
> +		return 0;
> +
> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
> +		return -EINVAL;
> +
> +	if (!enable) { /* Turn it off */
> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
> +
> +		kvm_for_each_vcpu(i, vcpu, kvm) {
> +			/* Kick vcpus to flush hdbss buffer. */
> +			kvm_vcpu_kick(vcpu);
> +
> +			kvm_arm_vcpu_free_hdbss(vcpu);
> +		}
> +
> +		kvm->arch.enable_hdbss = false;
> +
> +		return 0;
> +	}
> +
> +	/* Turn it on */
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
> +		if (!hdbss_pg)
> +			goto error_alloc;
> +
> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
> +			.base_phys = page_to_phys(hdbss_pg),
> +			.size = size,
> +			.next_index = 0,
> +		};
> +	}
> +
> +	kvm->arch.enable_hdbss = true;
> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
> +
> +	/*
> +	 * We should kick vcpus out of guest mode here to load new
> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
> +	 */
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		kvm_vcpu_kick(vcpu);
> +
> +	return 0;
> +
> +error_alloc:
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (vcpu->arch.hdbss.base_phys)
> +			kvm_arm_vcpu_free_hdbss(vcpu);
> +	}
> +
> +	return -ENOMEM;
> +}
> +
>  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  			    struct kvm_enable_cap *cap)
>  {
> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		r = 0;
>  		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>  		break;
> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> +		mutex_lock(&kvm->lock);
> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
> +		mutex_unlock(&kvm->lock);
> +		break;
>  	default:
>  		break;
>  	}
> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  			r = kvm_supports_cacheable_pfnmap();
>  		break;
> 
> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> +		r = system_supports_hdbss();
> +		break;
>  	default:
>  		r = 0;
>  	}
> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>  		if (kvm_dirty_ring_check_request(vcpu))
>  			return 0;
> 
> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
> +			kvm_flush_hdbss_buffer(vcpu);
> +
>  		check_nested_vcpu_requests(vcpu);
>  	}
> 
> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
> 
>  void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>  {
> +	/*
> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
> +	 * before reporting dirty_bitmap to userspace. Send a request with
> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
> +	 */
> +	struct kvm_vcpu *vcpu;
> +
> +	if (!kvm->arch.enable_hdbss)
> +		return;
> 
> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>  }
> 
>  static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> index 9db3f11a4754..600cbc4f8ae9 100644
> --- a/arch/arm64/kvm/hyp/vhe/switch.c
> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>  	local_irq_restore(flags);
>  }
> 
> +static void __load_hdbss(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u64 br_el2, prod_el2;
> +
> +	if (!kvm->arch.enable_hdbss)
> +		return;
> +
> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> +	prod_el2 = vcpu->arch.hdbss.next_index;
> +
> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
> +
> +	isb();
> +}
> +

I see in the code below you trust that the tracking will happen with 
PAGE_SIZE granularity (you track with PAGE_SHIFT).

That may be a problem when we have guest memory backed by hugepages or 
transparent huge pages. 

When we are using HDBSS, there is no fault happening, so we have no way of 
doing on-demand block splitting, so we need to make use of eager block 
splitting, _before_ we start to track anything, or else we may have 
different-sized pages in the HDBSS buffer, which is harder to deal with.

Suggestion: do the eager splitting before we enable HDBSS. 

For this to happen, we have to enable the EAGER_SPLIT_CHUNK_SIZE 
capability, which can only be enabled when all memslots are empty.

I suggest doing that at kvm_init_stage2_mmu(), and checking if HDBSS is 
in which case we set mmu->split_page_chunk_size to PAGESIZE.

I will send a patch you can put before this one to make sure it works :)

Thanks!
Leo


>  void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>  {
>  	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>  	__vcpu_load_switch_sysregs(vcpu);
>  	__vcpu_load_activate_traps(vcpu);
>  	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
> +	__load_hdbss(vcpu);
>  }
> 
>  void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>  {
> +	kvm_flush_hdbss_buffer(vcpu);
>  	__vcpu_put_deactivate_traps(vcpu);
>  	__vcpu_put_switch_sysregs(vcpu);
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 070a01e53fcb..42b0710a16ce 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
> 
> +	if (writable && kvm->arch.enable_hdbss && logging_active)
> +		prot |= KVM_PGTABLE_PROT_DBM;
> +
>  	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;
> 
> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
> 
> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
> +{
> +	int idx, curr_idx;
> +	u64 br_el2;
> +	u64 *hdbss_buf;
> +	struct kvm *kvm = vcpu->kvm;
> +
> +	if (!kvm->arch.enable_hdbss)
> +		return;
> +
> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> +
> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
> +	if (curr_idx == 0 || br_el2 == 0)
> +		return;
> +
> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
> +	if (!hdbss_buf)
> +		return;
> +
> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
> +	for (idx = 0; idx < curr_idx; idx++) {
> +		u64 gpa;
> +
> +		gpa = hdbss_buf[idx];
> +		if (!(gpa & HDBSS_ENTRY_VALID))
> +			continue;
> +
> +		gpa &= HDBSS_ENTRY_IPA;
> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> +	}

Here ^

> +
> +	/* reset HDBSS index */
> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
> +	vcpu->arch.hdbss.next_index = 0;
> +	isb();
> +}
> +
> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
> +{
> +	u64 prod;
> +	u64 fsc;
> +
> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
> +
> +	switch (fsc) {
> +	case HDBSSPROD_EL2_FSC_OK:
> +		/* Buffer full, which is reported as permission fault. */
> +		kvm_flush_hdbss_buffer(vcpu);
> +		return 1;
> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
> +	case HDBSSPROD_EL2_FSC_GPF:
> +		return -EFAULT;
> +	default:
> +		/* Unknown fault. */
> +		WARN_ONCE(1,
> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
> +				fsc, prod, vcpu->vcpu_id);
> +		return -EFAULT;
> +	}
> +}
> +
>  /**
>   * kvm_handle_guest_abort - handles all 2nd stage aborts
>   * @vcpu:	the VCPU pointer
> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> 
>  	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
> 
> +	if (esr_iss2_is_hdbssf(esr))
> +		return kvm_handle_hdbss_fault(vcpu);
> +
>  	if (esr_fsc_is_translation_fault(esr)) {
>  		/* Beyond sanitised PARange (which is the IPA limit) */
>  		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 959532422d3a..c03a4b310b53 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>  	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>  	kfree(vcpu->arch.vncr_tlb);
>  	kfree(vcpu->arch.ccsidr);
> +
> +	if (vcpu->kvm->arch.enable_hdbss)
> +		kvm_arm_vcpu_free_hdbss(vcpu);
>  }
> 
>  static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
> --
> 2.33.0
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] arm64/kvm: Enable eager hugepage splitting if HDBSS is available
  2026-03-25 18:05   ` Leonardo Bras
@ 2026-03-25 18:20     ` Leonardo Bras
  2026-03-27  7:40       ` Tian Zheng
  2026-03-26 14:31     ` [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events Leonardo Bras
  2026-03-27  7:35     ` Tian Zheng
  2 siblings, 1 reply; 24+ messages in thread
From: Leonardo Bras @ 2026-03-25 18:20 UTC (permalink / raw)
  To: Tian Zheng
  Cc: Leonardo Bras, maz, oupton, catalin.marinas, corbet, pbonzini,
	will, yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron,
	yezhenyu2, linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel,
	linux-doc, linux-kernel, skhan, suzuki.poulose

FEAT_HDBSS speeds up guest memory dirty tracking by avoiding a page fault
and saving the entry in a tracking structure.

That may be a problem when we have guest memory backed by hugepages or
transparent huge pages, as it's not possible to do on-demand hugepage
splitting, relying only on eager hugepage splitting.

So, at stage2 initialization, enable eager hugepage splitting with
chunk = PAGE_SIZE if the system supports HDBSS.

Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
 arch/arm64/kvm/mmu.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 070a01e53fcb..bdfa72b7c073 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -993,22 +993,26 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 
 	mmu->last_vcpu_ran = alloc_percpu(typeof(*mmu->last_vcpu_ran));
 	if (!mmu->last_vcpu_ran) {
 		err = -ENOMEM;
 		goto out_destroy_pgtable;
 	}
 
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
 
-	 /* The eager page splitting is disabled by default */
-	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
+	 /* The eager page splitting is disabled by default if system has no HDBSS */
+	if (system_supports_hacdbs())
+		mmu->split_page_chunk_size = PAGE_SIZE;
+	else
+		mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
+
 	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
 
 	mmu->pgd_phys = __pa(pgt->pgd);
 
 	if (kvm_is_nested_s2_mmu(kvm, mmu))
 		kvm_init_nested_s2_mmu(mmu);
 
 	return 0;
 
 out_destroy_pgtable:
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-03-25 18:05   ` Leonardo Bras
  2026-03-25 18:20     ` [PATCH] arm64/kvm: Enable eager hugepage splitting if HDBSS is available Leonardo Bras
@ 2026-03-26 14:31     ` Leonardo Bras
  2026-03-27  7:35     ` Tian Zheng
  2 siblings, 0 replies; 24+ messages in thread
From: Leonardo Bras @ 2026-03-26 14:31 UTC (permalink / raw)
  To: Tian Zheng
  Cc: Leonardo Bras, maz, oupton, catalin.marinas, corbet, pbonzini,
	will, yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron,
	yezhenyu2, linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel,
	linux-doc, linux-kernel, skhan, suzuki.poulose

On Wed, Mar 25, 2026 at 06:05:26PM +0000, Leonardo Bras wrote:
> Hello Tian,
> 
> I am currently working on HACDBS enablement(which will be rebased on top of 
> this patchset) and due to the fact HACDBS and HDBSS are kind of 
> complementary I will sometimes come with some questions for issues I have 
> faced myself on that part. :)
> 
> (see below)
> 
> On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
> > From: eillon <yezhenyu2@huawei.com>
> > 
> > HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
> > migration. This feature is only supported in VHE mode.
> > 
> > Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
> > write faults are handled by user_mem_abort, which relaxes permissions
> > and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
> > writes no longer trap, as the hardware automatically transitions the page
> > from writable-clean to writable-dirty.
> > 
> > KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
> > enabled, the hardware observes the clean->dirty transition and records
> > the corresponding page into the HDBSS buffer.
> > 
> > During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
> > that check_vcpu_requests flushes the HDBSS buffer and propagates the
> > accumulated dirty information into the userspace-visible dirty bitmap.
> > 
> > Add fault handling for HDBSS including buffer full, external abort, and
> > general protection fault (GPF).
> > 
> > Signed-off-by: eillon <yezhenyu2@huawei.com>
> > Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
> > ---
> >  arch/arm64/include/asm/esr.h      |   5 ++
> >  arch/arm64/include/asm/kvm_host.h |  17 +++++
> >  arch/arm64/include/asm/kvm_mmu.h  |   1 +
> >  arch/arm64/include/asm/sysreg.h   |  11 ++++
> >  arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
> >  arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
> >  arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
> >  arch/arm64/kvm/reset.c            |   3 +
> >  8 files changed, 228 insertions(+)
> > 
> > diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> > index 81c17320a588..2e6b679b5908 100644
> > --- a/arch/arm64/include/asm/esr.h
> > +++ b/arch/arm64/include/asm/esr.h
> > @@ -437,6 +437,11 @@
> >  #ifndef __ASSEMBLER__
> >  #include <asm/types.h>
> > 
> > +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
> > +{
> > +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
> > +}
> > +
> >  static inline unsigned long esr_brk_comment(unsigned long esr)
> >  {
> >  	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 5d5a3bbdb95e..57ee6b53e061 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -55,12 +55,17 @@
> >  #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
> >  #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
> >  #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
> > +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
> > 
> >  #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
> >  				     KVM_DIRTY_LOG_INITIALLY_SET)
> > 
> >  #define KVM_HAVE_MMU_RWLOCK
> > 
> > +/* HDBSS entry field definitions */
> > +#define HDBSS_ENTRY_VALID BIT(0)
> > +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> > +
> >  /*
> >   * Mode of operation configurable with kvm-arm.mode early param.
> >   * See Documentation/admin-guide/kernel-parameters.txt for more information.
> > @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
> >  u32 __attribute_const__ kvm_target_cpu(void);
> >  void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
> >  void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
> > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
> > 
> >  struct kvm_hyp_memcache {
> >  	phys_addr_t head;
> > @@ -405,6 +411,8 @@ struct kvm_arch {
> >  	 * the associated pKVM instance in the hypervisor.
> >  	 */
> >  	struct kvm_protected_vm pkvm;
> > +
> > +	bool enable_hdbss;
> >  };
> > 
> >  struct kvm_vcpu_fault_info {
> > @@ -816,6 +824,12 @@ struct vcpu_reset_state {
> >  	bool		reset;
> >  };
> > 
> > +struct vcpu_hdbss_state {
> > +	phys_addr_t base_phys;
> > +	u32 size;
> > +	u32 next_index;
> > +};
> > +
> >  struct vncr_tlb;
> > 
> >  struct kvm_vcpu_arch {
> > @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
> > 
> >  	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
> >  	struct vncr_tlb	*vncr_tlb;
> > +
> > +	/* HDBSS registers info */
> > +	struct vcpu_hdbss_state hdbss;
> >  };
> > 
> >  /*
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index d968aca0461a..3fea8cfe8869 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > 
> >  int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> >  int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
> > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
> > 
> >  phys_addr_t kvm_mmu_get_httbr(void);
> >  phys_addr_t kvm_get_idmap_vector(void);
> > diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> > index f4436ecc630c..d11f4d0dd4e7 100644
> > --- a/arch/arm64/include/asm/sysreg.h
> > +++ b/arch/arm64/include/asm/sysreg.h
> > @@ -1039,6 +1039,17 @@
> > 
> >  #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
> >  					       GCS_CAP_VALID_TOKEN)
> > +
> > +/*
> > + * Definitions for the HDBSS feature
> > + */
> > +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
> > +
> > +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
> > +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
> > +
> > +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
> > +
> >  /*
> >   * Definitions for GICv5 instructions]
> >   */
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 29f0326f7e00..d64da05e25c4 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
> >  	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
> >  }
> > 
> > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
> > +{
> > +	struct page *hdbss_pg;
> > +
> > +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
> > +	if (hdbss_pg)
> > +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
> > +
> > +	vcpu->arch.hdbss.size = 0;
> > +}
> > +
> > +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
> > +				    struct kvm_enable_cap *cap)
> > +{
> > +	unsigned long i;
> > +	struct kvm_vcpu *vcpu;
> > +	struct page *hdbss_pg = NULL;
> > +	__u64 size = cap->args[0];
> > +	bool enable = cap->args[1] ? true : false;
> > +
> > +	if (!system_supports_hdbss())
> > +		return -EINVAL;
> > +
> > +	if (size > HDBSS_MAX_SIZE)
> > +		return -EINVAL;
> > +
> > +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
> > +		return 0;
> > +
> > +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
> > +		return -EINVAL;
> > +
> > +	if (!enable) { /* Turn it off */
> > +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
> > +
> > +		kvm_for_each_vcpu(i, vcpu, kvm) {
> > +			/* Kick vcpus to flush hdbss buffer. */
> > +			kvm_vcpu_kick(vcpu);
> > +
> > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > +		}
> > +
> > +		kvm->arch.enable_hdbss = false;
> > +
> > +		return 0;
> > +	}
> > +
> > +	/* Turn it on */
> > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
> > +		if (!hdbss_pg)
> > +			goto error_alloc;
> > +
> > +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
> > +			.base_phys = page_to_phys(hdbss_pg),
> > +			.size = size,
> > +			.next_index = 0,
> > +		};
> > +	}
> > +
> > +	kvm->arch.enable_hdbss = true;
> > +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
> > +
> > +	/*
> > +	 * We should kick vcpus out of guest mode here to load new
> > +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
> > +	 */
> > +	kvm_for_each_vcpu(i, vcpu, kvm)
> > +		kvm_vcpu_kick(vcpu);
> > +
> > +	return 0;
> > +
> > +error_alloc:
> > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > +		if (vcpu->arch.hdbss.base_phys)
> > +			kvm_arm_vcpu_free_hdbss(vcpu);
> > +	}
> > +
> > +	return -ENOMEM;
> > +}
> > +
> >  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> >  			    struct kvm_enable_cap *cap)
> >  {
> > @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> >  		r = 0;
> >  		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
> >  		break;
> > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > +		mutex_lock(&kvm->lock);
> > +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
> > +		mutex_unlock(&kvm->lock);
> > +		break;
> >  	default:
> >  		break;
> >  	}
> > @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  			r = kvm_supports_cacheable_pfnmap();
> >  		break;
> > 
> > +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > +		r = system_supports_hdbss();
> > +		break;
> >  	default:
> >  		r = 0;
> >  	}
> > @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> >  		if (kvm_dirty_ring_check_request(vcpu))
> >  			return 0;
> > 
> > +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
> > +			kvm_flush_hdbss_buffer(vcpu);
> > +
> >  		check_nested_vcpu_requests(vcpu);
> >  	}
> > 
> > @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
> > 
> >  void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
> >  {
> > +	/*
> > +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
> > +	 * before reporting dirty_bitmap to userspace. Send a request with
> > +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
> > +	 */
> > +	struct kvm_vcpu *vcpu;
> > +
> > +	if (!kvm->arch.enable_hdbss)
> > +		return;
> > 
> > +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
> >  }
> > 
> >  static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> > diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> > index 9db3f11a4754..600cbc4f8ae9 100644
> > --- a/arch/arm64/kvm/hyp/vhe/switch.c
> > +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> > @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
> >  	local_irq_restore(flags);
> >  }
> > 
> > +static void __load_hdbss(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm *kvm = vcpu->kvm;
> > +	u64 br_el2, prod_el2;
> > +
> > +	if (!kvm->arch.enable_hdbss)
> > +		return;
> > +
> > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > +	prod_el2 = vcpu->arch.hdbss.next_index;
> > +
> > +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
> > +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
> > +
> > +	isb();
> > +}
> > +
> 
> I see in the code below you trust that the tracking will happen with 
> PAGE_SIZE granularity (you track with PAGE_SHIFT).

Oh, that was misleading :\
The mentioned routine is not wrong AFAICS, but without the patch I sent, if 
the VMM does not manually set eager splitting, and is using some sort of 
hugepages (even transparent, without noticing) it could end up sending just 
PAGE_SIZE instead of the whole hugepage size, which breaks migration. 

Does that make sense?

Thanks!
Leo

> 
> That may be a problem when we have guest memory backed by hugepages or 
> transparent huge pages. 
> 
> When we are using HDBSS, there is no fault happening, so we have no way of 
> doing on-demand block splitting, so we need to make use of eager block 
> splitting, _before_ we start to track anything, or else we may have 
> different-sized pages in the HDBSS buffer, which is harder to deal with.
> 
> Suggestion: do the eager splitting before we enable HDBSS. 
> 
> For this to happen, we have to enable the EAGER_SPLIT_CHUNK_SIZE 
> capability, which can only be enabled when all memslots are empty.
> 
> I suggest doing that at kvm_init_stage2_mmu(), and checking if HDBSS is 
> in which case we set mmu->split_page_chunk_size to PAGESIZE.
> 
> I will send a patch you can put before this one to make sure it works :)
> 
> Thanks!
> Leo
> 
> 
> >  void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> >  {
> >  	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
> > @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
> >  	__vcpu_load_switch_sysregs(vcpu);
> >  	__vcpu_load_activate_traps(vcpu);
> >  	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
> > +	__load_hdbss(vcpu);
> >  }
> > 
> >  void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
> >  {
> > +	kvm_flush_hdbss_buffer(vcpu);
> >  	__vcpu_put_deactivate_traps(vcpu);
> >  	__vcpu_put_switch_sysregs(vcpu);
> > 
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 070a01e53fcb..42b0710a16ce 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	if (writable)
> >  		prot |= KVM_PGTABLE_PROT_W;
> > 
> > +	if (writable && kvm->arch.enable_hdbss && logging_active)
> > +		prot |= KVM_PGTABLE_PROT_DBM;
> > +
> >  	if (exec_fault)
> >  		prot |= KVM_PGTABLE_PROT_X;
> > 
> > @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> >  	return 0;
> >  }
> > 
> > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
> > +{
> > +	int idx, curr_idx;
> > +	u64 br_el2;
> > +	u64 *hdbss_buf;
> > +	struct kvm *kvm = vcpu->kvm;
> > +
> > +	if (!kvm->arch.enable_hdbss)
> > +		return;
> > +
> > +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
> > +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > +
> > +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
> > +	if (curr_idx == 0 || br_el2 == 0)
> > +		return;
> > +
> > +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
> > +	if (!hdbss_buf)
> > +		return;
> > +
> > +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
> > +	for (idx = 0; idx < curr_idx; idx++) {
> > +		u64 gpa;
> > +
> > +		gpa = hdbss_buf[idx];
> > +		if (!(gpa & HDBSS_ENTRY_VALID))
> > +			continue;
> > +
> > +		gpa &= HDBSS_ENTRY_IPA;
> > +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> > +	}
> 
> Here ^
> 
> > +
> > +	/* reset HDBSS index */
> > +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
> > +	vcpu->arch.hdbss.next_index = 0;
> > +	isb();
> > +}
> > +
> > +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
> > +{
> > +	u64 prod;
> > +	u64 fsc;
> > +
> > +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
> > +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
> > +
> > +	switch (fsc) {
> > +	case HDBSSPROD_EL2_FSC_OK:
> > +		/* Buffer full, which is reported as permission fault. */
> > +		kvm_flush_hdbss_buffer(vcpu);
> > +		return 1;
> > +	case HDBSSPROD_EL2_FSC_ExternalAbort:
> > +	case HDBSSPROD_EL2_FSC_GPF:
> > +		return -EFAULT;
> > +	default:
> > +		/* Unknown fault. */
> > +		WARN_ONCE(1,
> > +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
> > +				fsc, prod, vcpu->vcpu_id);
> > +		return -EFAULT;
> > +	}
> > +}
> > +
> >  /**
> >   * kvm_handle_guest_abort - handles all 2nd stage aborts
> >   * @vcpu:	the VCPU pointer
> > @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > 
> >  	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
> > 
> > +	if (esr_iss2_is_hdbssf(esr))
> > +		return kvm_handle_hdbss_fault(vcpu);
> > +
> >  	if (esr_fsc_is_translation_fault(esr)) {
> >  		/* Beyond sanitised PARange (which is the IPA limit) */
> >  		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
> > diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> > index 959532422d3a..c03a4b310b53 100644
> > --- a/arch/arm64/kvm/reset.c
> > +++ b/arch/arm64/kvm/reset.c
> > @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
> >  	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
> >  	kfree(vcpu->arch.vncr_tlb);
> >  	kfree(vcpu->arch.ccsidr);
> > +
> > +	if (vcpu->kvm->arch.enable_hdbss)
> > +		kvm_arm_vcpu_free_hdbss(vcpu);
> >  }
> > 
> >  static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
> > --
> > 2.33.0
> > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
  2026-03-25 18:05   ` Leonardo Bras
  2026-03-25 18:20     ` [PATCH] arm64/kvm: Enable eager hugepage splitting if HDBSS is available Leonardo Bras
  2026-03-26 14:31     ` [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events Leonardo Bras
@ 2026-03-27  7:35     ` Tian Zheng
  2 siblings, 0 replies; 24+ messages in thread
From: Tian Zheng @ 2026-03-27  7:35 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: maz, oupton, catalin.marinas, corbet, pbonzini, will, yuzenghui,
	wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2, linuxarm,
	joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose


On 3/26/2026 2:05 AM, Leonardo Bras wrote:
> Hello Tian,
>
> I am currently working on HACDBS enablement(which will be rebased on top of
> this patchset) and due to the fact HACDBS and HDBSS are kind of
> complementary I will sometimes come with some questions for issues I have
> faced myself on that part. :)
>
> (see below)


Of course! Happy to exchange ideas and learn together.


>
> On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
>> From: eillon <yezhenyu2@huawei.com>
>>
>> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
>> migration. This feature is only supported in VHE mode.
>>
>> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
>> write faults are handled by user_mem_abort, which relaxes permissions
>> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
>> writes no longer trap, as the hardware automatically transitions the page
>> from writable-clean to writable-dirty.
>>
>> KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
>> enabled, the hardware observes the clean->dirty transition and records
>> the corresponding page into the HDBSS buffer.
>>
>> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
>> that check_vcpu_requests flushes the HDBSS buffer and propagates the
>> accumulated dirty information into the userspace-visible dirty bitmap.
>>
>> Add fault handling for HDBSS including buffer full, external abort, and
>> general protection fault (GPF).
>>
>> Signed-off-by: eillon <yezhenyu2@huawei.com>
>> Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
>> ---
>>   arch/arm64/include/asm/esr.h      |   5 ++
>>   arch/arm64/include/asm/kvm_host.h |  17 +++++
>>   arch/arm64/include/asm/kvm_mmu.h  |   1 +
>>   arch/arm64/include/asm/sysreg.h   |  11 ++++
>>   arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>>   arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>>   arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>>   arch/arm64/kvm/reset.c            |   3 +
>>   8 files changed, 228 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>> index 81c17320a588..2e6b679b5908 100644
>> --- a/arch/arm64/include/asm/esr.h
>> +++ b/arch/arm64/include/asm/esr.h
>> @@ -437,6 +437,11 @@
>>   #ifndef __ASSEMBLER__
>>   #include <asm/types.h>
>>
>> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
>> +{
>> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
>> +}
>> +
>>   static inline unsigned long esr_brk_comment(unsigned long esr)
>>   {
>>   	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 5d5a3bbdb95e..57ee6b53e061 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -55,12 +55,17 @@
>>   #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>>   #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>>   #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
>> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
>>
>>   #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>   				     KVM_DIRTY_LOG_INITIALLY_SET)
>>
>>   #define KVM_HAVE_MMU_RWLOCK
>>
>> +/* HDBSS entry field definitions */
>> +#define HDBSS_ENTRY_VALID BIT(0)
>> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
>> +
>>   /*
>>    * Mode of operation configurable with kvm-arm.mode early param.
>>    * See Documentation/admin-guide/kernel-parameters.txt for more information.
>> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>>   u32 __attribute_const__ kvm_target_cpu(void);
>>   void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>>   void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
>>
>>   struct kvm_hyp_memcache {
>>   	phys_addr_t head;
>> @@ -405,6 +411,8 @@ struct kvm_arch {
>>   	 * the associated pKVM instance in the hypervisor.
>>   	 */
>>   	struct kvm_protected_vm pkvm;
>> +
>> +	bool enable_hdbss;
>>   };
>>
>>   struct kvm_vcpu_fault_info {
>> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>>   	bool		reset;
>>   };
>>
>> +struct vcpu_hdbss_state {
>> +	phys_addr_t base_phys;
>> +	u32 size;
>> +	u32 next_index;
>> +};
>> +
>>   struct vncr_tlb;
>>
>>   struct kvm_vcpu_arch {
>> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
>>
>>   	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>>   	struct vncr_tlb	*vncr_tlb;
>> +
>> +	/* HDBSS registers info */
>> +	struct vcpu_hdbss_state hdbss;
>>   };
>>
>>   /*
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index d968aca0461a..3fea8cfe8869 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>
>>   int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>>   int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
>>
>>   phys_addr_t kvm_mmu_get_httbr(void);
>>   phys_addr_t kvm_get_idmap_vector(void);
>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>> index f4436ecc630c..d11f4d0dd4e7 100644
>> --- a/arch/arm64/include/asm/sysreg.h
>> +++ b/arch/arm64/include/asm/sysreg.h
>> @@ -1039,6 +1039,17 @@
>>
>>   #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>>   					       GCS_CAP_VALID_TOKEN)
>> +
>> +/*
>> + * Definitions for the HDBSS feature
>> + */
>> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
>> +
>> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
>> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
>> +
>> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
>> +
>>   /*
>>    * Definitions for GICv5 instructions]
>>    */
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 29f0326f7e00..d64da05e25c4 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>   	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>>   }
>>
>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
>> +{
>> +	struct page *hdbss_pg;
>> +
>> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
>> +	if (hdbss_pg)
>> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
>> +
>> +	vcpu->arch.hdbss.size = 0;
>> +}
>> +
>> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
>> +				    struct kvm_enable_cap *cap)
>> +{
>> +	unsigned long i;
>> +	struct kvm_vcpu *vcpu;
>> +	struct page *hdbss_pg = NULL;
>> +	__u64 size = cap->args[0];
>> +	bool enable = cap->args[1] ? true : false;
>> +
>> +	if (!system_supports_hdbss())
>> +		return -EINVAL;
>> +
>> +	if (size > HDBSS_MAX_SIZE)
>> +		return -EINVAL;
>> +
>> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
>> +		return 0;
>> +
>> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
>> +		return -EINVAL;
>> +
>> +	if (!enable) { /* Turn it off */
>> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
>> +
>> +		kvm_for_each_vcpu(i, vcpu, kvm) {
>> +			/* Kick vcpus to flush hdbss buffer. */
>> +			kvm_vcpu_kick(vcpu);
>> +
>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>> +		}
>> +
>> +		kvm->arch.enable_hdbss = false;
>> +
>> +		return 0;
>> +	}
>> +
>> +	/* Turn it on */
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
>> +		if (!hdbss_pg)
>> +			goto error_alloc;
>> +
>> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
>> +			.base_phys = page_to_phys(hdbss_pg),
>> +			.size = size,
>> +			.next_index = 0,
>> +		};
>> +	}
>> +
>> +	kvm->arch.enable_hdbss = true;
>> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
>> +
>> +	/*
>> +	 * We should kick vcpus out of guest mode here to load new
>> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
>> +	 */
>> +	kvm_for_each_vcpu(i, vcpu, kvm)
>> +		kvm_vcpu_kick(vcpu);
>> +
>> +	return 0;
>> +
>> +error_alloc:
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		if (vcpu->arch.hdbss.base_phys)
>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>> +	}
>> +
>> +	return -ENOMEM;
>> +}
>> +
>>   int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>   			    struct kvm_enable_cap *cap)
>>   {
>> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>   		r = 0;
>>   		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>>   		break;
>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>> +		mutex_lock(&kvm->lock);
>> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
>> +		mutex_unlock(&kvm->lock);
>> +		break;
>>   	default:
>>   		break;
>>   	}
>> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   			r = kvm_supports_cacheable_pfnmap();
>>   		break;
>>
>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>> +		r = system_supports_hdbss();
>> +		break;
>>   	default:
>>   		r = 0;
>>   	}
>> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>>   		if (kvm_dirty_ring_check_request(vcpu))
>>   			return 0;
>>
>> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
>> +			kvm_flush_hdbss_buffer(vcpu);
>> +
>>   		check_nested_vcpu_requests(vcpu);
>>   	}
>>
>> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
>>
>>   void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>>   {
>> +	/*
>> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
>> +	 * before reporting dirty_bitmap to userspace. Send a request with
>> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
>> +	 */
>> +	struct kvm_vcpu *vcpu;
>> +
>> +	if (!kvm->arch.enable_hdbss)
>> +		return;
>>
>> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>>   }
>>
>>   static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
>> index 9db3f11a4754..600cbc4f8ae9 100644
>> --- a/arch/arm64/kvm/hyp/vhe/switch.c
>> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
>> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>>   	local_irq_restore(flags);
>>   }
>>
>> +static void __load_hdbss(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm *kvm = vcpu->kvm;
>> +	u64 br_el2, prod_el2;
>> +
>> +	if (!kvm->arch.enable_hdbss)
>> +		return;
>> +
>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>> +	prod_el2 = vcpu->arch.hdbss.next_index;
>> +
>> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
>> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
>> +
>> +	isb();
>> +}
>> +
> I see in the code below you trust that the tracking will happen with
> PAGE_SIZE granularity (you track with PAGE_SHIFT).
>
> That may be a problem when we have guest memory backed by hugepages or
> transparent huge pages.
>
> When we are using HDBSS, there is no fault happening, so we have no way of
> doing on-demand block splitting, so we need to make use of eager block
> splitting, _before_ we start to track anything, or else we may have
> different-sized pages in the HDBSS buffer, which is harder to deal with.
>
> Suggestion: do the eager splitting before we enable HDBSS.
>
> For this to happen, we have to enable the EAGER_SPLIT_CHUNK_SIZE
> capability, which can only be enabled when all memslots are empty.
>
> I suggest doing that at kvm_init_stage2_mmu(), and checking if HDBSS is
> in which case we set mmu->split_page_chunk_size to PAGESIZE.
>
> I will send a patch you can put before this one to make sure it works :)
>
> Thanks!
> Leo

Hi Leo,

Thanks for the helpful suggestion. I had previously traced the 
hugepage-splitting path

during live migration and found that when migration starts, enabling 
dirty logging

triggers the splitting path. I also tested HDBSS with traditional 
hugepages and haven't

observed any issues yet.


However, your concern is valid — there may be cases not covered, 
especially when the

VMM uses transparent hugepages. I'll integrate your patch into the next 
version and

run some tests.


For reference, here's the path I traced:

```

- userspace, e.g., QEMU

kvm_log_start
+-> kvm_section_update_flags
     +-> kvm_slot_update_flags
         |
         | // For each memory region, QEMU issues a 
KVM_SET_USER_MEMORY_REGION ioctl.
         | // Before issuing it, flags are updated to include 
KVM_MEM_LOG_DIRTY_PAGES.
         +-> kvm_mem_flags
         +-> kvm_set_user_memory_region   // ioctl that enables dirty 
logging on the memslot

- KVM

KVM_SET_USER_MEMORY_REGION
+-> kvm_vm_ioctl_set_memory_region
     +-> kvm_set_memory_region / __kvm_set_memory_region
         +-> kvm_set_memslot
             +-> kvm_commit_memory_region
                 +-> kvm_arch_commit_memory_region
                     +-> kvm_mmu_split_memory_region
                         // Splits Stage-2 hugepages/contiguous mappings 
into 4KB PTEs.
                         +-> kvm_mmu_split_huge_pages
                             +-> kvm_pgtable_stage2_split

```

Thanks again for the detailed explanation and for sending the patch.

>>   void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>   {
>>   	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
>> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>   	__vcpu_load_switch_sysregs(vcpu);
>>   	__vcpu_load_activate_traps(vcpu);
>>   	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
>> +	__load_hdbss(vcpu);
>>   }
>>
>>   void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>>   {
>> +	kvm_flush_hdbss_buffer(vcpu);
>>   	__vcpu_put_deactivate_traps(vcpu);
>>   	__vcpu_put_switch_sysregs(vcpu);
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 070a01e53fcb..42b0710a16ce 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	if (writable)
>>   		prot |= KVM_PGTABLE_PROT_W;
>>
>> +	if (writable && kvm->arch.enable_hdbss && logging_active)
>> +		prot |= KVM_PGTABLE_PROT_DBM;
>> +
>>   	if (exec_fault)
>>   		prot |= KVM_PGTABLE_PROT_X;
>>
>> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>   	return 0;
>>   }
>>
>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
>> +{
>> +	int idx, curr_idx;
>> +	u64 br_el2;
>> +	u64 *hdbss_buf;
>> +	struct kvm *kvm = vcpu->kvm;
>> +
>> +	if (!kvm->arch.enable_hdbss)
>> +		return;
>> +
>> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>> +
>> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
>> +	if (curr_idx == 0 || br_el2 == 0)
>> +		return;
>> +
>> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
>> +	if (!hdbss_buf)
>> +		return;
>> +
>> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
>> +	for (idx = 0; idx < curr_idx; idx++) {
>> +		u64 gpa;
>> +
>> +		gpa = hdbss_buf[idx];
>> +		if (!(gpa & HDBSS_ENTRY_VALID))
>> +			continue;
>> +
>> +		gpa &= HDBSS_ENTRY_IPA;
>> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
>> +	}
> Here ^

Thanks!

Tian


>
>> +
>> +	/* reset HDBSS index */
>> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
>> +	vcpu->arch.hdbss.next_index = 0;
>> +	isb();
>> +}
>> +
>> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 prod;
>> +	u64 fsc;
>> +
>> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
>> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
>> +
>> +	switch (fsc) {
>> +	case HDBSSPROD_EL2_FSC_OK:
>> +		/* Buffer full, which is reported as permission fault. */
>> +		kvm_flush_hdbss_buffer(vcpu);
>> +		return 1;
>> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
>> +	case HDBSSPROD_EL2_FSC_GPF:
>> +		return -EFAULT;
>> +	default:
>> +		/* Unknown fault. */
>> +		WARN_ONCE(1,
>> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
>> +				fsc, prod, vcpu->vcpu_id);
>> +		return -EFAULT;
>> +	}
>> +}
>> +
>>   /**
>>    * kvm_handle_guest_abort - handles all 2nd stage aborts
>>    * @vcpu:	the VCPU pointer
>> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>
>>   	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>>
>> +	if (esr_iss2_is_hdbssf(esr))
>> +		return kvm_handle_hdbss_fault(vcpu);
>> +
>>   	if (esr_fsc_is_translation_fault(esr)) {
>>   		/* Beyond sanitised PARange (which is the IPA limit) */
>>   		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index 959532422d3a..c03a4b310b53 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>>   	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>>   	kfree(vcpu->arch.vncr_tlb);
>>   	kfree(vcpu->arch.ccsidr);
>> +
>> +	if (vcpu->kvm->arch.enable_hdbss)
>> +		kvm_arm_vcpu_free_hdbss(vcpu);
>>   }
>>
>>   static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>> --
>> 2.33.0
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm64/kvm: Enable eager hugepage splitting if HDBSS is available
  2026-03-25 18:20     ` [PATCH] arm64/kvm: Enable eager hugepage splitting if HDBSS is available Leonardo Bras
@ 2026-03-27  7:40       ` Tian Zheng
  0 siblings, 0 replies; 24+ messages in thread
From: Tian Zheng @ 2026-03-27  7:40 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: maz, oupton, catalin.marinas, corbet, pbonzini, will, yuzenghui,
	wangzhou1, liuyonglong, Jonathan.Cameron, yezhenyu2, linuxarm,
	joey.gouly, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, skhan, suzuki.poulose


On 3/26/2026 2:20 AM, Leonardo Bras wrote:
> FEAT_HDBSS speeds up guest memory dirty tracking by avoiding a page fault
> and saving the entry in a tracking structure.
>
> That may be a problem when we have guest memory backed by hugepages or
> transparent huge pages, as it's not possible to do on-demand hugepage
> splitting, relying only on eager hugepage splitting.
>
> So, at stage2 initialization, enable eager hugepage splitting with
> chunk = PAGE_SIZE if the system supports HDBSS.
>
> Signed-off-by: Leonardo Bras <leo.bras@arm.com>
> ---
>   arch/arm64/kvm/mmu.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 070a01e53fcb..bdfa72b7c073 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -993,22 +993,26 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>   
>   	mmu->last_vcpu_ran = alloc_percpu(typeof(*mmu->last_vcpu_ran));
>   	if (!mmu->last_vcpu_ran) {
>   		err = -ENOMEM;
>   		goto out_destroy_pgtable;
>   	}
>   
>   	for_each_possible_cpu(cpu)
>   		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
>   
> -	 /* The eager page splitting is disabled by default */
> -	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> +	 /* The eager page splitting is disabled by default if system has no HDBSS */
> +	if (system_supports_hacdbs())
> +		mmu->split_page_chunk_size = PAGE_SIZE;
> +	else
> +		mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> +
>   	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
>   
>   	mmu->pgd_phys = __pa(pgt->pgd);
>   
>   	if (kvm_is_nested_s2_mmu(kvm, mmu))
>   		kvm_init_nested_s2_mmu(mmu);
>   
>   	return 0;
>   
>   out_destroy_pgtable:


Thanks again for sending this patch. I'll integrate it into the next 
version and run some tests.



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-03-27  7:40 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-25  4:04 [PATCH v3 0/5] Support the FEAT_HDBSS introduced in Armv9.5 Tian Zheng
2026-02-25  4:04 ` [PATCH v3 1/5] arm64/sysreg: Add HDBSS related register information Tian Zheng
2026-02-25  4:04 ` [PATCH v3 2/5] KVM: arm64: Add support to set the DBM attr during memory abort Tian Zheng
2026-02-25  4:04 ` [PATCH v3 3/5] KVM: arm64: Add support for FEAT_HDBSS Tian Zheng
2026-02-25  4:04 ` [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events Tian Zheng
2026-02-25 17:46   ` Leonardo Bras
2026-02-27 10:47     ` Tian Zheng
2026-02-27 14:10       ` Leonardo Bras
2026-03-04  3:06         ` Tian Zheng
2026-03-04 12:08           ` Leonardo Bras
2026-03-05  7:37             ` Tian Zheng
2026-03-04 15:40   ` Leonardo Bras
2026-03-06  9:27     ` Tian Zheng
2026-03-06 15:01       ` Leonardo Bras
2026-03-12  6:17         ` Tian Zheng
2026-03-12 12:06           ` Leonardo Bras
2026-03-12 13:13             ` Tian Zheng
2026-03-12 14:58               ` Leonardo Bras
2026-03-25 18:05   ` Leonardo Bras
2026-03-25 18:20     ` [PATCH] arm64/kvm: Enable eager hugepage splitting if HDBSS is available Leonardo Bras
2026-03-27  7:40       ` Tian Zheng
2026-03-26 14:31     ` [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events Leonardo Bras
2026-03-27  7:35     ` Tian Zheng
2026-02-25  4:04 ` [PATCH v3 5/5] KVM: arm64: Document HDBSS ioctl Tian Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox