From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 86204105A596
	for <linux-arm-kernel@archiver.kernel.org>; Thu, 12 Mar 2026 13:14:51 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help
	:List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding:
	Content-Type:In-Reply-To:From:References:CC:To:Subject:MIME-Version:Date:
	Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner;
	bh=ROHbddhlfnSyZntOoKBZcuS4wBpqWmyBoJb90KwaOAg=; b=FsMFv5dZ9Ny4jEMBmxU0LhaMC5
	zwkenQk0h6otGtZAZ/Gtfv/UK1AaaXIRK8oBrd57XVwHZC/YwXEIbKhbTGBKxTd99j0/QZR/GPhus
	MMQLPxfm9eTonTWJuG4GVvQ5/k/BYzTSxxQJ/3WfSbiFtmpCi2Z6ftVYORmMEubNpL8fuTbzWya64
	6MWIHfbc9FDEFibwRpul4kinCstXJhPHpqA0m0/p91S9olCJSOGae2KJhJ/bGbEJHCDhDGEQk4tde
	+vsmZtAP9rJt2daZI/5Woo00lU8V6HSDAxqizdyh3OeGd/KlzCNbGKZNgHJ1LJ6s206nmGQ0NaUCw
	4MRbCYJA==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux))
	id 1w0fsB-0000000E7Jy-30Ni;
	Thu, 12 Mar 2026 13:14:43 +0000
Received: from casper.infradead.org ([2001:8b0:10b:1236::1])
	by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux))
	id 1w0fs9-0000000E7Jq-2DPx
	for linux-arm-kernel@bombadil.infradead.org;
	Thu, 12 Mar 2026 13:14:42 +0000
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:Content-Type:
	In-Reply-To:From:References:CC:To:Subject:MIME-Version:Date:Message-ID:Sender
	:Reply-To:Content-ID:Content-Description;
	bh=ROHbddhlfnSyZntOoKBZcuS4wBpqWmyBoJb90KwaOAg=; b=aH8OjQ8m/7+e2qPZVsaeUFNxOn
	4RJ2uLtMjDaZnYx7v8p1jKstZTvELEzVTWLUKIK9GoXLMqF2JLkrWouaHj3FqoXN76+8Nv7fN4Ukm
	SZCvSNu3yLXIWukQEETIG0zT6BobCqCTWFyBJAmt25xDO+Bd/bz/uFZmR/LPRZUHQru6IW5jyc9Ye
	5gzTa3+CmRtVUNjVCEzwjWASFXFx4Yl2/4xLmiDxTJIp00czTfoEhLMjP6JBjgdGOn0mTjOEowBwE
	YUaBXtKoQZCBX+ut54x8Gr9xH1vtfbOJtaQPF56tvrkrLDdc32iGg/y57H5Jnqo6/F4xi8kKI7hgX
	yKbxraVA==;
Received: from canpmsgout05.his.huawei.com ([113.46.200.220])
	by casper.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux))
	id 1w0fs5-0000000BMrM-2rPC
	for linux-arm-kernel@lists.infradead.org;
	Thu, 12 Mar 2026 13:14:40 +0000
dkim-signature: v=1; a=rsa-sha256; d=huawei.com; s=dkim;
	c=relaxed/relaxed; q=dns/txt;
	h=From;
	bh=ROHbddhlfnSyZntOoKBZcuS4wBpqWmyBoJb90KwaOAg=;
	b=X4BJVo2xYh3l6GLGIH2t1If3cH/0Jh06DnJy23c3J5e2mH/22Yps3M/+Z8DC1lfu8ECK/2sj6
	QfUMy8nxG11psB3yWqk0eVLkA4BKqV0i4IcgVpMWksH34fQ+rXM1bftXlugTQ5u2JzTQeXsZNhC
	97BTgXSVQM2Q1G2FiORigE0=
Received: from mail.maildlp.com (unknown [172.19.163.104])
	by canpmsgout05.his.huawei.com (SkyGuard) with ESMTPS id 4fWnwc0xXLz12LD4;
	Thu, 12 Mar 2026 21:08:00 +0800 (CST)
Received: from kwepemr100010.china.huawei.com (unknown [7.202.195.125])
	by mail.maildlp.com (Postfix) with ESMTPS id DA2574056A;
	Thu, 12 Mar 2026 21:13:34 +0800 (CST)
Received: from [10.67.120.103] (10.67.120.103) by
 kwepemr100010.china.huawei.com (7.202.195.125) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1544.36; Thu, 12 Mar 2026 21:13:33 +0800
Message-ID: <46657927-5608-46f0-9b4c-159407f08880@huawei.com>
Date: Thu, 12 Mar 2026 21:13:33 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird Beta
Subject: Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF
 events
To: Leonardo Bras <leo.bras@arm.com>
CC: <maz@kernel.org>, <oupton@kernel.org>, <catalin.marinas@arm.com>,
	<corbet@lwn.net>, <pbonzini@redhat.com>, <will@kernel.org>,
	<yuzenghui@huawei.com>, <wangzhou1@hisilicon.com>, <liuyonglong@huawei.com>,
	<Jonathan.Cameron@huawei.com>, <yezhenyu2@huawei.com>, <linuxarm@huawei.com>,
	<joey.gouly@arm.com>, <kvmarm@lists.linux.dev>, <kvm@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>, <linux-doc@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <skhan@linuxfoundation.org>,
	<suzuki.poulose@arm.com>
References: <20260225040421.2683931-1-zhengtian10@huawei.com>
 <20260225040421.2683931-5-zhengtian10@huawei.com>
 <aahSaJTVeMBoRbUE@devkitleo>
 <ee584a49-ce69-443b-97c0-37f24f78bdbb@huawei.com>
 <aarsQwKrNB7HQgbj@devkitleo>
 <31973faa-746d-41b5-9b9b-6e459564308d@huawei.com>
 <abKsK4otxunAXrAF@devkitleo>
From: Tian Zheng <zhengtian10@huawei.com>
In-Reply-To: <abKsK4otxunAXrAF@devkitleo>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.67.120.103]
X-ClientProxiedBy: kwepems100002.china.huawei.com (7.221.188.206) To
 kwepemr100010.china.huawei.com (7.202.195.125)
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20260312_131438_343008_D4AE8F0F 
X-CRM114-Status: GOOD (  26.79  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org


On 3/12/2026 8:06 PM, Leonardo Bras wrote:
> On Thu, Mar 12, 2026 at 02:17:41PM +0800, Tian Zheng wrote:
>> On 3/6/2026 11:01 PM, Leonardo Bras wrote:
>>> On Fri, Mar 06, 2026 at 05:27:58PM +0800, Tian Zheng wrote:
>>>> Hi Leo,
>>>>
>>>> On 3/4/2026 11:40 PM, Leonardo Bras wrote:
>>>>> Hi Tian,
>>>>>
>>>>> Few extra notes/questions below
>>>>>
>>>>> On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
>>>>>> From: eillon<yezhenyu2@huawei.com>
>>>>>>
>>>>>> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
>>>>>> migration. This feature is only supported in VHE mode.
>>>>>>
>>>>>> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
>>>>>> write faults are handled by user_mem_abort, which relaxes permissions
>>>>>> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
>>>>>> writes no longer trap, as the hardware automatically transitions the page
>>>>>> from writable-clean to writable-dirty.
>>>>>>
>>>>>> KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
>>>>>> enabled, the hardware observes the clean->dirty transition and records
>>>>>> the corresponding page into the HDBSS buffer.
>>>>>>
>>>>>> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
>>>>>> that check_vcpu_requests flushes the HDBSS buffer and propagates the
>>>>>> accumulated dirty information into the userspace-visible dirty bitmap.
>>>>>>
>>>>>> Add fault handling for HDBSS including buffer full, external abort, and
>>>>>> general protection fault (GPF).
>>>>>>
>>>>>> Signed-off-by: eillon<yezhenyu2@huawei.com>
>>>>>> Signed-off-by: Tian Zheng<zhengtian10@huawei.com>
>>>>>> ---
>>>>>>     arch/arm64/include/asm/esr.h      |   5 ++
>>>>>>     arch/arm64/include/asm/kvm_host.h |  17 +++++
>>>>>>     arch/arm64/include/asm/kvm_mmu.h  |   1 +
>>>>>>     arch/arm64/include/asm/sysreg.h   |  11 ++++
>>>>>>     arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>>>>>>     arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>>>>>>     arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>>>>>>     arch/arm64/kvm/reset.c            |   3 +
>>>>>>     8 files changed, 228 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>>>>>> index 81c17320a588..2e6b679b5908 100644
>>>>>> --- a/arch/arm64/include/asm/esr.h
>>>>>> +++ b/arch/arm64/include/asm/esr.h
>>>>>> @@ -437,6 +437,11 @@
>>>>>>     #ifndef __ASSEMBLER__
>>>>>>     #include <asm/types.h>
>>>>>>
>>>>>> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
>>>>>> +{
>>>>>> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
>>>>>> +}
>>>>>> +
>>>>>>     static inline unsigned long esr_brk_comment(unsigned long esr)
>>>>>>     {
>>>>>>     	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
>>>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>>>>> index 5d5a3bbdb95e..57ee6b53e061 100644
>>>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>>>> @@ -55,12 +55,17 @@
>>>>>>     #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>>>>>>     #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>>>>>>     #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
>>>>>> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
>>>>>>
>>>>>>     #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>>>>>     				     KVM_DIRTY_LOG_INITIALLY_SET)
>>>>>>
>>>>>>     #define KVM_HAVE_MMU_RWLOCK
>>>>>>
>>>>>> +/* HDBSS entry field definitions */
>>>>>> +#define HDBSS_ENTRY_VALID BIT(0)
>>>>>> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
>>>>>> +
>>>>>>     /*
>>>>>>      * Mode of operation configurable with kvm-arm.mode early param.
>>>>>>      * See Documentation/admin-guide/kernel-parameters.txt for more information.
>>>>>> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>>>>>>     u32 __attribute_const__ kvm_target_cpu(void);
>>>>>>     void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>>>>>>     void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
>>>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
>>>>>>
>>>>>>     struct kvm_hyp_memcache {
>>>>>>     	phys_addr_t head;
>>>>>> @@ -405,6 +411,8 @@ struct kvm_arch {
>>>>>>     	 * the associated pKVM instance in the hypervisor.
>>>>>>     	 */
>>>>>>     	struct kvm_protected_vm pkvm;
>>>>>> +
>>>>>> +	bool enable_hdbss;
>>>>>>     };
>>>>>>
>>>>>>     struct kvm_vcpu_fault_info {
>>>>>> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>>>>>>     	bool		reset;
>>>>>>     };
>>>>>>
>>>>>> +struct vcpu_hdbss_state {
>>>>>> +	phys_addr_t base_phys;
>>>>>> +	u32 size;
>>>>>> +	u32 next_index;
>>>>>> +};
>>>>>> +
>>>>> IIUC this is used once both on enable/disable and massively on
>>>>> vcpu_put/get.
>>>>>
>>>>> What if we actually save just HDBSSBR_EL2 and HDBSSPROD_EL2 instead?
>>>>> That way we avoid having masking operations in put/get as well as any
>>>>> possible error we may have formatting those.
>>>>>
>>>>> The cost is doing those operations once for enable and once for disable,
>>>>> which should be fine.
>>> Hi Tian,
>>>
>>>> Thanks for the suggestion. I actually started with storing the raw system
>>>> register
>>>>
>>>> values, as you proposed.
>>>>
>>>>
>>>> However, after discussing it with Oliver Upton in v1, we felt that keeping
>>>> the base address,
>>>>
>>>> size, and index as separate fields makes the state easier to understand.
>>>>
>>>>
>>>> Discussion
>>>> link:https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/
>>>> <https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/>
>>>>
>>>>
>>>> That's why I ended up changing the storage approach in the end.
>>>>
>>>>
>>> Humm, FWIW I disagree with the above argument.
>>> I would argue that vcpu_put should save the registers, and not
>>> actually know what they are about or how are they formatted at this point.
>>>
>>> The responsibility of understanding it's fields and usage value should be
>>> in the code that actually uses it.
>>>
>>> IIUC on kvm_vcpu_put_vhe and kvm_vcpu_load_vhe there are calls to other
>>> functions than only save the register as it is.
>>
>> ok, thx! I'll update the struct to store only the raw register values.
> Awesome :)
>
>>
>>>>>>     struct vncr_tlb;
>>>>>>
>>>>>>     struct kvm_vcpu_arch {
>>>>>> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
>>>>>>
>>>>>>     	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>>>>>>     	struct vncr_tlb	*vncr_tlb;
>>>>>> +
>>>>>> +	/* HDBSS registers info */
>>>>>> +	struct vcpu_hdbss_state hdbss;
>>>>>>     };
>>>>>>
>>>>>>     /*
>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>>>>> index d968aca0461a..3fea8cfe8869 100644
>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>>>> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>>>>>
>>>>>>     int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>>>>>>     int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
>>>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
>>>>>>
>>>>>>     phys_addr_t kvm_mmu_get_httbr(void);
>>>>>>     phys_addr_t kvm_get_idmap_vector(void);
>>>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>>>>>> index f4436ecc630c..d11f4d0dd4e7 100644
>>>>>> --- a/arch/arm64/include/asm/sysreg.h
>>>>>> +++ b/arch/arm64/include/asm/sysreg.h
>>>>>> @@ -1039,6 +1039,17 @@
>>>>>>
>>>>>>     #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>>>>>>     					       GCS_CAP_VALID_TOKEN)
>>>>>> +
>>>>>> +/*
>>>>>> + * Definitions for the HDBSS feature
>>>>>> + */
>>>>>> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
>>>>>> +
>>>>>> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
>>>>>> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
>>>>>> +
>>>>>> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
>>>>>> +
>>>>>>     /*
>>>>>>      * Definitions for GICv5 instructions
>>>>>>      */
>>>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>>>> index 29f0326f7e00..d64da05e25c4 100644
>>>>>> --- a/arch/arm64/kvm/arm.c
>>>>>> +++ b/arch/arm64/kvm/arm.c
>>>>>> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>>>>>     	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>>>>>>     }
>>>>>>
>>>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	struct page *hdbss_pg;
>>>>>> +
>>>>>> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
>>>>>> +	if (hdbss_pg)
>>>>>> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
>>>>>> +
>>>>>> +	vcpu->arch.hdbss.size = 0;
>>>>>> +}
>>>>>> +
>>>>>> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
>>>>>> +				    struct kvm_enable_cap *cap)
>>>>>> +{
>>>>>> +	unsigned long i;
>>>>>> +	struct kvm_vcpu *vcpu;
>>>>>> +	struct page *hdbss_pg = NULL;
>>>>>> +	__u64 size = cap->args[0];
>>>>>> +	bool enable = cap->args[1] ? true : false;
>>>>>> +
>>>>>> +	if (!system_supports_hdbss())
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	if (size > HDBSS_MAX_SIZE)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
>>>>>> +		return 0;
>>>>>> +
>>>>>> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	if (!enable) { /* Turn it off */
>>>>>> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
>>>>>> +
>>>>>> +		kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>>> +			/* Kick vcpus to flush hdbss buffer. */
>>>>>> +			kvm_vcpu_kick(vcpu);
>>>>>> +
>>>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>>>> +		}
>>>>>> +
>>>>>> +		kvm->arch.enable_hdbss = false;
>>>>>> +
>>>>>> +		return 0;
>>>>>> +	}
>>>>>> +
>>>>>> +	/* Turn it on */
>>>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>>> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
>>>>>> +		if (!hdbss_pg)
>>>>>> +			goto error_alloc;
>>>>>> +
>>>>>> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
>>>>>> +			.base_phys = page_to_phys(hdbss_pg),
>>>>>> +			.size = size,
>>>>>> +			.next_index = 0,
>>>>>> +		};
>>>>>> +	}
>>>>>> +
>>>>>> +	kvm->arch.enable_hdbss = true;
>>>>>> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * We should kick vcpus out of guest mode here to load new
>>>>>> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
>>>>>> +	 */
>>>>>> +	kvm_for_each_vcpu(i, vcpu, kvm)
>>>>>> +		kvm_vcpu_kick(vcpu);
>>>>>> +
>>>>>> +	return 0;
>>>>>> +
>>>>>> +error_alloc:
>>>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>>> +		if (vcpu->arch.hdbss.base_phys)
>>>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>>>> +	}
>>>>>> +
>>>>>> +	return -ENOMEM;
>>>>>> +}
>>>>>> +
>>>>>>     int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>>>     			    struct kvm_enable_cap *cap)
>>>>>>     {
>>>>>> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>>>     		r = 0;
>>>>>>     		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>>>>>>     		break;
>>>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>>>> +		mutex_lock(&kvm->lock);
>>>>>> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
>>>>>> +		mutex_unlock(&kvm->lock);
>>>>>> +		break;
>>>>>>     	default:
>>>>>>     		break;
>>>>>>     	}
>>>>>> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>>>     			r = kvm_supports_cacheable_pfnmap();
>>>>>>     		break;
>>>>>>
>>>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>>>> +		r = system_supports_hdbss();
>>>>>> +		break;
>>>>>>     	default:
>>>>>>     		r = 0;
>>>>>>     	}
>>>>>> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>>>>>>     		if (kvm_dirty_ring_check_request(vcpu))
>>>>>>     			return 0;
>>>>>>
>>>>>> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
>>>>>> +			kvm_flush_hdbss_buffer(vcpu);
>>>>> I am curious on why we need a flush-hdbss request,
>>>>> Don't we have the flush function happening every time we run vcpu_put?
>>>>>
>>>>> Oh, I see, you want to check if there is anything needed inside the inner
>>>>> loop of vcpu_run, without having to vcpu_put. I think it makes sense.
>>>>>
>>>>> But instead of having this on guest entry, does not it make more sense to
>>>>> have it in guest exit? This way we flush every time (if needed) we exit the
>>>>> guest, and instead of having a vcpu request, we just require a vcpu kick
>>>>> and it should flush if needed.
>>>>>
>>>>> Maybe have vcpu_put just save the registers, and add a the flush before
>>>>> handle_exit.
>>>>>
>>>>> What do you think?
>>>> Thank you for the feedback.
>>>>
>>>>
>>>> Indeed, in the initial version (v1), I placed the flush operation inside
>>>> handle_exit and
>>>>
>>>> used a vcpu_kick in kvm_arch_sync_dirty_log to trigger the flush of the
>>>> HDBSS buffer.
>>>>
>>>>
>>>> However, during the review, Marc pointed out that calling this function on
>>>> every exit
>>>>
>>>> event is too frequent if it's not always needed.
>>>>
>>>>
>>>> Discussion link:
>>>> _https://lore.kernel.org/linux-arm-kernel/86senjony9.wl-maz@kernel.org/_
>>>>
>>>>
>>>> I agreed with his assessment. Therefore, in the current version, I've
>>>> separated the flush
>>>>
>>>> operation into more specific and less frequent points:
>>>>
>>>>
>>>> 1. In vcpu_put
>>>>
>>>> 2. During dirty log synchronization, by kicking the vCPU to trigger a
>>>> request that flushes
>>>>
>>>> on its next exit.
>>>>
>>>>
>>>> 3. When handling a specific HDBSSF event.
>>>>
>>>>
>>>> This ensures the flush happens only when necessary, avoiding the overhead of
>>>> doing it
>>>>
>>>> on every guest exit.
>>>>
>>> Fair enough, calling it every time you go in the inner loop may be too
>>> much, even with a check to make sure it needs to run.
>>>
>>> Having it as a request means you may do that sometimes without
>>> leaving the inner loop. That could be useful if you want to use it with the
>>> IRQ handler to deal with full buffer, or any error, as well as dealing with
>>> a regular request in the 2nd case.
>>>
>>> While I agree it's needed to run before leaving guest context (i.e leaving
>>> the inner loop), I am not really sure vcpu_put is the best place to put the
>>> flushing. I may be wrong, but for me it looks more like of a place to save
>>> registers and context, as well as dropping refcounts or something like
>>> that. I would not expect a flush happening in vcpu_put, if I was reading
>>> the code.
>>>
>>> Would it be too bad if we had it into a call before vcpu_put, at
>>> kvm_arch_vcpu_ioctl_run()?
>>>
>> Thanks for the clarification. After looking again at the code paths, I agree
>> that
>>
>> kvm_vcpu_put_vhe() and kvm_arch_vcpu_put() are really meant to be pure
>> save/restore
>>
>> paths, so embedding HDBSS flushing there isn't ideal.
>>
>>
>> My remaining concern is that kvm_arch_vcpu_ioctl_run() doesn't cover the
>> case where
>>
>> the vCPU is scheduled out. In that case we still leave guest context, but we
>> don't return
>>
>> through the run loop, so a flush placed only in ioctl_run() wouldn't run.
>>
>>
>> Any suggestions on where this should hook in?
> You mention that on vcpu_put it works, right? maybe it's worth to track
> down which vcpu_put users would make sense to flush before it's calling.
>
> I found that vcpu_put is called only in the vcpu_run, but it's
> arch-specific version is called in:
>
> kvm_debug_handle_oslar : put and load in sequence, flush shouldnt be needed
> kvm_emulate_nested_eret : same as above, not sure if nested will be supported
> kvm_inject_nested : same as above
> kvm_reset_vcpu : put and load in sequence, not needed [1]
> kvm_sched_out : that's ran on sched-out, that makes sense for us
> vcpu_put : called only by vcpu_run, where we already planned to use
>
> Which brings us other benefit of not having that in vcpu_put: the flush
> could be happening on functions that should originally ran fast, and having
> it outside vcpu_put allows us to decide if we need it.
>
> So, having the flush happening in kvm_arch_vcpu_ioctl_run() and
> kvm_sched_out() should be enough, on top of the per-request you
> mentioned before.
>
> [1]: the vcpu_reset: not sure how often this does happen, and if it would
> be interesting flushing here as well. It seems to be called on init, so not
> the case where would be something to flush, and from a case where it's
> restoring the registers. So I think it should be safe to not flush here,
> so it should be a question of 'maybe being interesting', which I am not
> sure.


Got it, makes sense.


HDBSS doesn't apply to nested or other complex cases, so we only need to 
flush in

kvm_arch_vcpu_ioctl_run() and kvm_sched_out().


I'll update the code accordingly and test to make sure we haven't missed 
any edge cases.


Thanks!

Tian

>> Would introducing a small
>> arch‑specific
>>
>> "guest exit" helper, invoked fromm kvm_arch_vcpu_put(), be acceptable?
>>
> IIUC that would be the same as the previous one: we would have the flush
> happening inside a function that is supposed to save registers.
>
> Thanks!
> Leo
>
>
>> Thanks!
>>
>> Tian
>>
>>
>>>>>> +
>>>>>>     		check_nested_vcpu_requests(vcpu);
>>>>>>     	}
>>>>>>
>>>>>> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
>>>>>>
>>>>>>     void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>>>>>>     {
>>>>>> +	/*
>>>>>> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
>>>>>> +	 * before reporting dirty_bitmap to userspace. Send a request with
>>>>>> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
>>>>>> +	 */
>>>>>> +	struct kvm_vcpu *vcpu;
>>>>>> +
>>>>>> +	if (!kvm->arch.enable_hdbss)
>>>>>> +		return;
>>>>>>
>>>>>> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>>>>>>     }
>>>>>>
>>>>>>     static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>>>>>> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
>>>>>> index 9db3f11a4754..600cbc4f8ae9 100644
>>>>>> --- a/arch/arm64/kvm/hyp/vhe/switch.c
>>>>>> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
>>>>>> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>>>>>>     	local_irq_restore(flags);
>>>>>>     }
>>>>>>
>>>>>> +static void __load_hdbss(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	struct kvm *kvm = vcpu->kvm;
>>>>>> +	u64 br_el2, prod_el2;
>>>>>> +
>>>>>> +	if (!kvm->arch.enable_hdbss)
>>>>>> +		return;
>>>>>> +
>>>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>>>> +	prod_el2 = vcpu->arch.hdbss.next_index;
>>>>>> +
>>>>>> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
>>>>>> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
>>>>>> +
>>>>>> +	isb();
>>>>>> +}
>>>>>> +
>>>>>>     void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>>>     {
>>>>>>     	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
>>>>>> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>>>     	__vcpu_load_switch_sysregs(vcpu);
>>>>>>     	__vcpu_load_activate_traps(vcpu);
>>>>>>     	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
>>>>>> +	__load_hdbss(vcpu);
>>>>>>     }
>>>>>>
>>>>>>     void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>>>>>>     {
>>>>>> +	kvm_flush_hdbss_buffer(vcpu);
>>>>>>     	__vcpu_put_deactivate_traps(vcpu);
>>>>>>     	__vcpu_put_switch_sysregs(vcpu);
>>>>>>
>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>>>> index 070a01e53fcb..42b0710a16ce 100644
>>>>>> --- a/arch/arm64/kvm/mmu.c
>>>>>> +++ b/arch/arm64/kvm/mmu.c
>>>>>> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>     	if (writable)
>>>>>>     		prot |= KVM_PGTABLE_PROT_W;
>>>>>>
>>>>>> +	if (writable && kvm->arch.enable_hdbss && logging_active)
>>>>>> +		prot |= KVM_PGTABLE_PROT_DBM;
>>>>>> +
>>>>>>     	if (exec_fault)
>>>>>>     		prot |= KVM_PGTABLE_PROT_X;
>>>>>>
>>>>>> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>>>>>     	return 0;
>>>>>>     }
>>>>>>
>>>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	int idx, curr_idx;
>>>>>> +	u64 br_el2;
>>>>>> +	u64 *hdbss_buf;
>>>>>> +	struct kvm *kvm = vcpu->kvm;
>>>>>> +
>>>>>> +	if (!kvm->arch.enable_hdbss)
>>>>>> +		return;
>>>>>> +
>>>>>> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
>>>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>>>> +
>>>>>> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
>>>>>> +	if (curr_idx == 0 || br_el2 == 0)
>>>>>> +		return;
>>>>>> +
>>>>>> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
>>>>>> +	if (!hdbss_buf)
>>>>>> +		return;
>>>>>> +
>>>>>> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
>>>>>> +	for (idx = 0; idx < curr_idx; idx++) {
>>>>>> +		u64 gpa;
>>>>>> +
>>>>>> +		gpa = hdbss_buf[idx];
>>>>>> +		if (!(gpa & HDBSS_ENTRY_VALID))
>>>>>> +			continue;
>>>>>> +
>>>>>> +		gpa &= HDBSS_ENTRY_IPA;
>>>>>> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
>>>>>> +	}
>>>>> This will mark a page dirty for both dirty_bitmap or dirty_ring, depending
>>>>> on what is in use.
>>>>>
>>>>> Out of plain curiosity, have you planned / tested for the dirty-ring as
>>>>> well, or just for dirty-bitmap?
>>>> Currently, I have only tested this with dirty-bitmap mode.
>>>>
>>>>
>>>> I will test and ensure the HDBSS feature works correctly with dirty-ring in
>>>> the next version.
>>>>
>>>>
>>> Thanks!
>>>
>>>>>> +
>>>>>> +	/* reset HDBSS index */
>>>>>> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
>>>>>> +	vcpu->arch.hdbss.next_index = 0;
>>>>>> +	isb();
>>>>>> +}
>>>>>> +
>>>>>> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	u64 prod;
>>>>>> +	u64 fsc;
>>>>>> +
>>>>>> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
>>>>>> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
>>>>>> +
>>>>>> +	switch (fsc) {
>>>>>> +	case HDBSSPROD_EL2_FSC_OK:
>>>>>> +		/* Buffer full, which is reported as permission fault. */
>>>>>> +		kvm_flush_hdbss_buffer(vcpu);
>>>>>> +		return 1;
>>>>> Humm, flushing in a fault handler means hanging there, in IRQ context, for
>>>>> a while.
>>>>>
>>>>> Since we already deal with this on guest_exit (vcpu_put IIUC), why not just
>>>>> return in a way the vcpu has to exit the inner loop and let it flush there
>>>>> instead?
>>>>>
>>>>> Thanks!
>>>>> Leo
>>>> Thanks for the feedback.
>>>>
>>>>
>>>> If we flush on every guest exit (by moving the flush before handle_exit,
>>>> then we can
>>>>
>>>> indeed drop the flush from the fault handler and from vcpu_put.
>>>>
>>>>
>>>> However, given Marc's earlier concern about not imposing this overhead on
>>>> all vCPUs,
>>>>
>>>> I'd rather avoid flushing on every exit.
>>>>
>>>>
>>>> My current plan is to set a request bit in kvm_handle_hdbss_fault (via
>>>> kvm_make_request),
>>>>
>>>> and move the actual flush to the normal exit path, where it can execute in a
>>>> safe context.
>>>>
>>>> This also allows us to remove the flush from the fault handler entirely.
>>>>
>>>>
>>>> Does that approach sound reasonable to you?
>>>>
>>>>
>>> Yes, I think it looks much better, as the fault will cause guest to exit,
>>> and it can run the flush on it's way back in.
>>>
>>> Thanks!
>>> Leo
>>>
>>>>>> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
>>>>>> +	case HDBSSPROD_EL2_FSC_GPF:
>>>>>> +		return -EFAULT;
>>>>>> +	default:
>>>>>> +		/* Unknown fault. */
>>>>>> +		WARN_ONCE(1,
>>>>>> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
>>>>>> +				fsc, prod, vcpu->vcpu_id);
>>>>>> +		return -EFAULT;
>>>>>> +	}
>>>>>> +}
>>>>>> +
>>>>>>     /**
>>>>>>      * kvm_handle_guest_abort - handles all 2nd stage aborts
>>>>>>      * @vcpu:	the VCPU pointer
>>>>>> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>>>>
>>>>>>     	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>>>>>>
>>>>>> +	if (esr_iss2_is_hdbssf(esr))
>>>>>> +		return kvm_handle_hdbss_fault(vcpu);
>>>>>> +
>>>>>>     	if (esr_fsc_is_translation_fault(esr)) {
>>>>>>     		/* Beyond sanitised PARange (which is the IPA limit) */
>>>>>>     		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
>>>>>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>>>>>> index 959532422d3a..c03a4b310b53 100644
>>>>>> --- a/arch/arm64/kvm/reset.c
>>>>>> +++ b/arch/arm64/kvm/reset.c
>>>>>> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>>>>>>     	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>>>>>>     	kfree(vcpu->arch.vncr_tlb);
>>>>>>     	kfree(vcpu->arch.ccsidr);
>>>>>> +
>>>>>> +	if (vcpu->kvm->arch.enable_hdbss)
>>>>>> +		kvm_arm_vcpu_free_hdbss(vcpu);
>>>>>>     }
>>>>>>
>>>>>>     static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>>>>>> --
>>>>>> 2.33.0
>>>> Thanks!
>>>>
>>>> Tian
>>>>