From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from canpmsgout11.his.huawei.com (canpmsgout11.his.huawei.com [113.46.200.226])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2FF9E2FDC3C;
	Thu, 12 Mar 2026 06:17:50 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=113.46.200.226
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773296275; cv=none; b=DjR74rJAwy3q6Q44R+OM2fFWYkaqjnpFLRfWPv8doO8g3l2HKHwZf3NUYmMGOLOlfytIuEnzBSCqHWf+YLdAc5O0NOYMILCMfIRAcwIVp7xgQhTCE5DCT2Syqn4fV3BrzH88JipZsD8o1thaotXUK/CmdLG6aduL1ri2Af8AGiY=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773296275; c=relaxed/simple;
	bh=DxtPYyAlHMhy1HOkmbYcoKd02XqaKkAhns8WLBSLba0=;
	h=Message-ID:Date:MIME-Version:Subject:To:CC:References:From:
	 In-Reply-To:Content-Type; b=GLQlVOju4V8CKJbaGS/9I9IxpMG2v9ZzA+SZkOJIpVsQmV/3rXw7AlIF8VvIZ4htpm6nAAD5paAJneFJXOQcTKFg150q+wqxmgwakbQUXuLhhSCguTIIR3FctEBLOroFDsRXEFgkYcl7X2bSFY9qsRwM94EE3NoPl1HOm9hZ1BA=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b=RcJB6OM9; arc=none smtp.client-ip=113.46.200.226
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b="RcJB6OM9"
dkim-signature: v=1; a=rsa-sha256; d=huawei.com; s=dkim;
	c=relaxed/relaxed; q=dns/txt;
	h=From;
	bh=cEiv6veRosd669WJh6XjPYOO3Bb4hu/Li75uJNpUgbk=;
	b=RcJB6OM9MycIGPa1UuGQFN4NLlG5MbC+fnnWSmeMqv+6yClie20x6whoLu5W6tLLfcQEOxKuD
	plYq18YLE+0vhijTrIdY7a06gu5d2zySfmTQe0+ZAMLlXfgjdeDuJzRo9b957FGW3V5AwMOGHBU
	raaN7/xNhMISKwPD9KbdWu8=
Received: from mail.maildlp.com (unknown [172.19.163.127])
	by canpmsgout11.his.huawei.com (SkyGuard) with ESMTPS id 4fWcjV0MNVzKm6q;
	Thu, 12 Mar 2026 14:12:46 +0800 (CST)
Received: from kwepemr100010.china.huawei.com (unknown [7.202.195.125])
	by mail.maildlp.com (Postfix) with ESMTPS id 7FD0B40363;
	Thu, 12 Mar 2026 14:17:42 +0800 (CST)
Received: from [10.67.120.103] (10.67.120.103) by
 kwepemr100010.china.huawei.com (7.202.195.125) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1544.36; Thu, 12 Mar 2026 14:17:41 +0800
Message-ID: <31973faa-746d-41b5-9b9b-6e459564308d@huawei.com>
Date: Thu, 12 Mar 2026 14:17:41 +0800
Precedence: bulk
X-Mailing-List: linux-doc@vger.kernel.org
List-Id: <linux-doc.vger.kernel.org>
List-Subscribe: <mailto:linux-doc+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-doc+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird Beta
Subject: Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF
 events
To: Leonardo Bras <leo.bras@arm.com>
CC: <maz@kernel.org>, <oupton@kernel.org>, <catalin.marinas@arm.com>,
	<corbet@lwn.net>, <pbonzini@redhat.com>, <will@kernel.org>,
	<yuzenghui@huawei.com>, <wangzhou1@hisilicon.com>, <liuyonglong@huawei.com>,
	<Jonathan.Cameron@huawei.com>, <yezhenyu2@huawei.com>, <linuxarm@huawei.com>,
	<joey.gouly@arm.com>, <kvmarm@lists.linux.dev>, <kvm@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>, <linux-doc@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <skhan@linuxfoundation.org>,
	<suzuki.poulose@arm.com>
References: <20260225040421.2683931-1-zhengtian10@huawei.com>
 <20260225040421.2683931-5-zhengtian10@huawei.com>
 <aahSaJTVeMBoRbUE@devkitleo>
 <ee584a49-ce69-443b-97c0-37f24f78bdbb@huawei.com>
 <aarsQwKrNB7HQgbj@devkitleo>
From: Tian Zheng <zhengtian10@huawei.com>
In-Reply-To: <aarsQwKrNB7HQgbj@devkitleo>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-ClientProxiedBy: kwepems500002.china.huawei.com (7.221.188.17) To
 kwepemr100010.china.huawei.com (7.202.195.125)


On 3/6/2026 11:01 PM, Leonardo Bras wrote:
> On Fri, Mar 06, 2026 at 05:27:58PM +0800, Tian Zheng wrote:
>> Hi Leo,
>>
>> On 3/4/2026 11:40 PM, Leonardo Bras wrote:
>>> Hi Tian,
>>>
>>> Few extra notes/questions below
>>>
>>> On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
>>>> From: eillon<yezhenyu2@huawei.com>
>>>>
>>>> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
>>>> migration. This feature is only supported in VHE mode.
>>>>
>>>> Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
>>>> write faults are handled by user_mem_abort, which relaxes permissions
>>>> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
>>>> writes no longer trap, as the hardware automatically transitions the page
>>>> from writable-clean to writable-dirty.
>>>>
>>>> KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
>>>> enabled, the hardware observes the clean->dirty transition and records
>>>> the corresponding page into the HDBSS buffer.
>>>>
>>>> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
>>>> that check_vcpu_requests flushes the HDBSS buffer and propagates the
>>>> accumulated dirty information into the userspace-visible dirty bitmap.
>>>>
>>>> Add fault handling for HDBSS including buffer full, external abort, and
>>>> general protection fault (GPF).
>>>>
>>>> Signed-off-by: eillon<yezhenyu2@huawei.com>
>>>> Signed-off-by: Tian Zheng<zhengtian10@huawei.com>
>>>> ---
>>>>    arch/arm64/include/asm/esr.h      |   5 ++
>>>>    arch/arm64/include/asm/kvm_host.h |  17 +++++
>>>>    arch/arm64/include/asm/kvm_mmu.h  |   1 +
>>>>    arch/arm64/include/asm/sysreg.h   |  11 ++++
>>>>    arch/arm64/kvm/arm.c              | 102 ++++++++++++++++++++++++++++++
>>>>    arch/arm64/kvm/hyp/vhe/switch.c   |  19 ++++++
>>>>    arch/arm64/kvm/mmu.c              |  70 ++++++++++++++++++++
>>>>    arch/arm64/kvm/reset.c            |   3 +
>>>>    8 files changed, 228 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>>>> index 81c17320a588..2e6b679b5908 100644
>>>> --- a/arch/arm64/include/asm/esr.h
>>>> +++ b/arch/arm64/include/asm/esr.h
>>>> @@ -437,6 +437,11 @@
>>>>    #ifndef __ASSEMBLER__
>>>>    #include <asm/types.h>
>>>>
>>>> +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
>>>> +{
>>>> +	return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
>>>> +}
>>>> +
>>>>    static inline unsigned long esr_brk_comment(unsigned long esr)
>>>>    {
>>>>    	return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>>> index 5d5a3bbdb95e..57ee6b53e061 100644
>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>> @@ -55,12 +55,17 @@
>>>>    #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
>>>>    #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
>>>>    #define KVM_REQ_VGIC_PROCESS_UPDATE	KVM_ARCH_REQ(11)
>>>> +#define KVM_REQ_FLUSH_HDBSS			KVM_ARCH_REQ(12)
>>>>
>>>>    #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>>>    				     KVM_DIRTY_LOG_INITIALLY_SET)
>>>>
>>>>    #define KVM_HAVE_MMU_RWLOCK
>>>>
>>>> +/* HDBSS entry field definitions */
>>>> +#define HDBSS_ENTRY_VALID BIT(0)
>>>> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
>>>> +
>>>>    /*
>>>>     * Mode of operation configurable with kvm-arm.mode early param.
>>>>     * See Documentation/admin-guide/kernel-parameters.txt for more information.
>>>> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
>>>>    u32 __attribute_const__ kvm_target_cpu(void);
>>>>    void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>>>>    void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
>>>>
>>>>    struct kvm_hyp_memcache {
>>>>    	phys_addr_t head;
>>>> @@ -405,6 +411,8 @@ struct kvm_arch {
>>>>    	 * the associated pKVM instance in the hypervisor.
>>>>    	 */
>>>>    	struct kvm_protected_vm pkvm;
>>>> +
>>>> +	bool enable_hdbss;
>>>>    };
>>>>
>>>>    struct kvm_vcpu_fault_info {
>>>> @@ -816,6 +824,12 @@ struct vcpu_reset_state {
>>>>    	bool		reset;
>>>>    };
>>>>
>>>> +struct vcpu_hdbss_state {
>>>> +	phys_addr_t base_phys;
>>>> +	u32 size;
>>>> +	u32 next_index;
>>>> +};
>>>> +
>>> IIUC this is used once both on enable/disable and massively on
>>> vcpu_put/get.
>>>
>>> What if we actually save just HDBSSBR_EL2 and HDBSSPROD_EL2 instead?
>>> That way we avoid having masking operations in put/get as well as any
>>> possible error we may have formatting those.
>>>
>>> The cost is doing those operations once for enable and once for disable,
>>> which should be fine.
> Hi Tian,
>
>>
>> Thanks for the suggestion. I actually started with storing the raw system
>> register
>>
>> values, as you proposed.
>>
>>
>> However, after discussing it with Oliver Upton in v1, we felt that keeping
>> the base address,
>>
>> size, and index as separate fields makes the state easier to understand.
>>
>>
>> Discussion
>> link:https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/
>> <https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/>
>>
>>
>> That's why I ended up changing the storage approach in the end.
>>
>>
> Humm, FWIW I disagree with the above argument.
> I would argue that vcpu_put should save the registers, and not
> actually know what they are about or how are they formatted at this point.
>
> The responsibility of understanding it's fields and usage value should be
> in the code that actually uses it.
>
> IIUC on kvm_vcpu_put_vhe and kvm_vcpu_load_vhe there are calls to other
> functions than only save the register as it is.


ok, thx! I'll update the struct to store only the raw register values.


>>>>    struct vncr_tlb;
>>>>
>>>>    struct kvm_vcpu_arch {
>>>> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
>>>>
>>>>    	/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
>>>>    	struct vncr_tlb	*vncr_tlb;
>>>> +
>>>> +	/* HDBSS registers info */
>>>> +	struct vcpu_hdbss_state hdbss;
>>>>    };
>>>>
>>>>    /*
>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>>> index d968aca0461a..3fea8cfe8869 100644
>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>>>
>>>>    int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>>>>    int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
>>>>
>>>>    phys_addr_t kvm_mmu_get_httbr(void);
>>>>    phys_addr_t kvm_get_idmap_vector(void);
>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>>>> index f4436ecc630c..d11f4d0dd4e7 100644
>>>> --- a/arch/arm64/include/asm/sysreg.h
>>>> +++ b/arch/arm64/include/asm/sysreg.h
>>>> @@ -1039,6 +1039,17 @@
>>>>
>>>>    #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
>>>>    					       GCS_CAP_VALID_TOKEN)
>>>> +
>>>> +/*
>>>> + * Definitions for the HDBSS feature
>>>> + */
>>>> +#define HDBSS_MAX_SIZE		HDBSSBR_EL2_SZ_2MB
>>>> +
>>>> +#define HDBSSBR_EL2(baddr, sz)	(((baddr) & GENMASK(55, 12 + sz)) | \
>>>> +				 FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
>>>> +
>>>> +#define HDBSSPROD_IDX(prod)	FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
>>>> +
>>>>    /*
>>>>     * Definitions for GICv5 instructions
>>>>     */
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index 29f0326f7e00..d64da05e25c4 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>>>    	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>>>>    }
>>>>
>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct page *hdbss_pg;
>>>> +
>>>> +	hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
>>>> +	if (hdbss_pg)
>>>> +		__free_pages(hdbss_pg, vcpu->arch.hdbss.size);
>>>> +
>>>> +	vcpu->arch.hdbss.size = 0;
>>>> +}
>>>> +
>>>> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
>>>> +				    struct kvm_enable_cap *cap)
>>>> +{
>>>> +	unsigned long i;
>>>> +	struct kvm_vcpu *vcpu;
>>>> +	struct page *hdbss_pg = NULL;
>>>> +	__u64 size = cap->args[0];
>>>> +	bool enable = cap->args[1] ? true : false;
>>>> +
>>>> +	if (!system_supports_hdbss())
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (size > HDBSS_MAX_SIZE)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
>>>> +		return 0;
>>>> +
>>>> +	if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (!enable) { /* Turn it off */
>>>> +		kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
>>>> +
>>>> +		kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> +			/* Kick vcpus to flush hdbss buffer. */
>>>> +			kvm_vcpu_kick(vcpu);
>>>> +
>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>> +		}
>>>> +
>>>> +		kvm->arch.enable_hdbss = false;
>>>> +
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	/* Turn it on */
>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> +		hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
>>>> +		if (!hdbss_pg)
>>>> +			goto error_alloc;
>>>> +
>>>> +		vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
>>>> +			.base_phys = page_to_phys(hdbss_pg),
>>>> +			.size = size,
>>>> +			.next_index = 0,
>>>> +		};
>>>> +	}
>>>> +
>>>> +	kvm->arch.enable_hdbss = true;
>>>> +	kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
>>>> +
>>>> +	/*
>>>> +	 * We should kick vcpus out of guest mode here to load new
>>>> +	 * vtcr value to vtcr_el2 register when re-enter guest mode.
>>>> +	 */
>>>> +	kvm_for_each_vcpu(i, vcpu, kvm)
>>>> +		kvm_vcpu_kick(vcpu);
>>>> +
>>>> +	return 0;
>>>> +
>>>> +error_alloc:
>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> +		if (vcpu->arch.hdbss.base_phys)
>>>> +			kvm_arm_vcpu_free_hdbss(vcpu);
>>>> +	}
>>>> +
>>>> +	return -ENOMEM;
>>>> +}
>>>> +
>>>>    int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>    			    struct kvm_enable_cap *cap)
>>>>    {
>>>> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>    		r = 0;
>>>>    		set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
>>>>    		break;
>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>> +		mutex_lock(&kvm->lock);
>>>> +		r = kvm_cap_arm_enable_hdbss(kvm, cap);
>>>> +		mutex_unlock(&kvm->lock);
>>>> +		break;
>>>>    	default:
>>>>    		break;
>>>>    	}
>>>> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>    			r = kvm_supports_cacheable_pfnmap();
>>>>    		break;
>>>>
>>>> +	case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
>>>> +		r = system_supports_hdbss();
>>>> +		break;
>>>>    	default:
>>>>    		r = 0;
>>>>    	}
>>>> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>>>>    		if (kvm_dirty_ring_check_request(vcpu))
>>>>    			return 0;
>>>>
>>>> +		if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
>>>> +			kvm_flush_hdbss_buffer(vcpu);
>>> I am curious on why we need a flush-hdbss request,
>>> Don't we have the flush function happening every time we run vcpu_put?
>>>
>>> Oh, I see, you want to check if there is anything needed inside the inner
>>> loop of vcpu_run, without having to vcpu_put. I think it makes sense.
>>>
>>> But instead of having this on guest entry, does not it make more sense to
>>> have it in guest exit? This way we flush every time (if needed) we exit the
>>> guest, and instead of having a vcpu request, we just require a vcpu kick
>>> and it should flush if needed.
>>>
>>> Maybe have vcpu_put just save the registers, and add a the flush before
>>> handle_exit.
>>>
>>> What do you think?
>>
>> Thank you for the feedback.
>>
>>
>> Indeed, in the initial version (v1), I placed the flush operation inside
>> handle_exit and
>>
>> used a vcpu_kick in kvm_arch_sync_dirty_log to trigger the flush of the
>> HDBSS buffer.
>>
>>
>> However, during the review, Marc pointed out that calling this function on
>> every exit
>>
>> event is too frequent if it's not always needed.
>>
>>
>> Discussion link:
>> _https://lore.kernel.org/linux-arm-kernel/86senjony9.wl-maz@kernel.org/_
>>
>>
>> I agreed with his assessment. Therefore, in the current version, I've
>> separated the flush
>>
>> operation into more specific and less frequent points:
>>
>>
>> 1. In vcpu_put
>>
>> 2. During dirty log synchronization, by kicking the vCPU to trigger a
>> request that flushes
>>
>> on its next exit.
>>
>>
>> 3. When handling a specific HDBSSF event.
>>
>>
>> This ensures the flush happens only when necessary, avoiding the overhead of
>> doing it
>>
>> on every guest exit.
>>
> Fair enough, calling it every time you go in the inner loop may be too
> much, even with a check to make sure it needs to run.
>
> Having it as a request means you may do that sometimes without
> leaving the inner loop. That could be useful if you want to use it with the
> IRQ handler to deal with full buffer, or any error, as well as dealing with
> a regular request in the 2nd case.
>
> While I agree it's needed to run before leaving guest context (i.e leaving
> the inner loop), I am not really sure vcpu_put is the best place to put the
> flushing. I may be wrong, but for me it looks more like of a place to save
> registers and context, as well as dropping refcounts or something like
> that. I would not expect a flush happening in vcpu_put, if I was reading
> the code.
>
> Would it be too bad if we had it into a call before vcpu_put, at
> kvm_arch_vcpu_ioctl_run()?
>

Thanks for the clarification. After looking again at the code paths, I 
agree that

kvm_vcpu_put_vhe() and kvm_arch_vcpu_put() are really meant to be pure 
save/restore

paths, so embedding HDBSS flushing there isn't ideal.


My remaining concern is that kvm_arch_vcpu_ioctl_run() doesn't cover the 
case where

the vCPU is scheduled out. In that case we still leave guest context, 
but we don't return

through the run loop, so a flush placed only in ioctl_run() wouldn't run.


Any suggestions on where this should hook in? Would introducing a small 
arch‑specific

"guest exit" helper, invoked fromm kvm_arch_vcpu_put(), be acceptable?


Thanks!

Tian


>
>>>> +
>>>>    		check_nested_vcpu_requests(vcpu);
>>>>    	}
>>>>
>>>> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
>>>>
>>>>    void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>>>>    {
>>>> +	/*
>>>> +	 * Flush all CPUs' dirty log buffers to the dirty_bitmap.  Called
>>>> +	 * before reporting dirty_bitmap to userspace. Send a request with
>>>> +	 * KVM_REQUEST_WAIT to flush buffer synchronously.
>>>> +	 */
>>>> +	struct kvm_vcpu *vcpu;
>>>> +
>>>> +	if (!kvm->arch.enable_hdbss)
>>>> +		return;
>>>>
>>>> +	kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
>>>>    }
>>>>
>>>>    static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>>>> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
>>>> index 9db3f11a4754..600cbc4f8ae9 100644
>>>> --- a/arch/arm64/kvm/hyp/vhe/switch.c
>>>> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
>>>> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
>>>>    	local_irq_restore(flags);
>>>>    }
>>>>
>>>> +static void __load_hdbss(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct kvm *kvm = vcpu->kvm;
>>>> +	u64 br_el2, prod_el2;
>>>> +
>>>> +	if (!kvm->arch.enable_hdbss)
>>>> +		return;
>>>> +
>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>> +	prod_el2 = vcpu->arch.hdbss.next_index;
>>>> +
>>>> +	write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
>>>> +	write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
>>>> +
>>>> +	isb();
>>>> +}
>>>> +
>>>>    void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>    {
>>>>    	host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
>>>> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
>>>>    	__vcpu_load_switch_sysregs(vcpu);
>>>>    	__vcpu_load_activate_traps(vcpu);
>>>>    	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
>>>> +	__load_hdbss(vcpu);
>>>>    }
>>>>
>>>>    void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
>>>>    {
>>>> +	kvm_flush_hdbss_buffer(vcpu);
>>>>    	__vcpu_put_deactivate_traps(vcpu);
>>>>    	__vcpu_put_switch_sysregs(vcpu);
>>>>
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index 070a01e53fcb..42b0710a16ce 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>    	if (writable)
>>>>    		prot |= KVM_PGTABLE_PROT_W;
>>>>
>>>> +	if (writable && kvm->arch.enable_hdbss && logging_active)
>>>> +		prot |= KVM_PGTABLE_PROT_DBM;
>>>> +
>>>>    	if (exec_fault)
>>>>    		prot |= KVM_PGTABLE_PROT_X;
>>>>
>>>> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>>>    	return 0;
>>>>    }
>>>>
>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	int idx, curr_idx;
>>>> +	u64 br_el2;
>>>> +	u64 *hdbss_buf;
>>>> +	struct kvm *kvm = vcpu->kvm;
>>>> +
>>>> +	if (!kvm->arch.enable_hdbss)
>>>> +		return;
>>>> +
>>>> +	curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
>>>> +	br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
>>>> +
>>>> +	/* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
>>>> +	if (curr_idx == 0 || br_el2 == 0)
>>>> +		return;
>>>> +
>>>> +	hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
>>>> +	if (!hdbss_buf)
>>>> +		return;
>>>> +
>>>> +	guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
>>>> +	for (idx = 0; idx < curr_idx; idx++) {
>>>> +		u64 gpa;
>>>> +
>>>> +		gpa = hdbss_buf[idx];
>>>> +		if (!(gpa & HDBSS_ENTRY_VALID))
>>>> +			continue;
>>>> +
>>>> +		gpa &= HDBSS_ENTRY_IPA;
>>>> +		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
>>>> +	}
>>> This will mark a page dirty for both dirty_bitmap or dirty_ring, depending
>>> on what is in use.
>>>
>>> Out of plain curiosity, have you planned / tested for the dirty-ring as
>>> well, or just for dirty-bitmap?
>>
>> Currently, I have only tested this with dirty-bitmap mode.
>>
>>
>> I will test and ensure the HDBSS feature works correctly with dirty-ring in
>> the next version.
>>
>>
> Thanks!
>
>>>> +
>>>> +	/* reset HDBSS index */
>>>> +	write_sysreg_s(0, SYS_HDBSSPROD_EL2);
>>>> +	vcpu->arch.hdbss.next_index = 0;
>>>> +	isb();
>>>> +}
>>>> +
>>>> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	u64 prod;
>>>> +	u64 fsc;
>>>> +
>>>> +	prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
>>>> +	fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
>>>> +
>>>> +	switch (fsc) {
>>>> +	case HDBSSPROD_EL2_FSC_OK:
>>>> +		/* Buffer full, which is reported as permission fault. */
>>>> +		kvm_flush_hdbss_buffer(vcpu);
>>>> +		return 1;
>>> Humm, flushing in a fault handler means hanging there, in IRQ context, for
>>> a while.
>>>
>>> Since we already deal with this on guest_exit (vcpu_put IIUC), why not just
>>> return in a way the vcpu has to exit the inner loop and let it flush there
>>> instead?
>>>
>>> Thanks!
>>> Leo
>>
>> Thanks for the feedback.
>>
>>
>> If we flush on every guest exit (by moving the flush before handle_exit,
>> then we can
>>
>> indeed drop the flush from the fault handler and from vcpu_put.
>>
>>
>> However, given Marc's earlier concern about not imposing this overhead on
>> all vCPUs,
>>
>> I'd rather avoid flushing on every exit.
>>
>>
>> My current plan is to set a request bit in kvm_handle_hdbss_fault (via
>> kvm_make_request),
>>
>> and move the actual flush to the normal exit path, where it can execute in a
>> safe context.
>>
>> This also allows us to remove the flush from the fault handler entirely.
>>
>>
>> Does that approach sound reasonable to you?
>>
>>
> Yes, I think it looks much better, as the fault will cause guest to exit,
> and it can run the flush on it's way back in.
>
> Thanks!
> Leo
>
>>>> +	case HDBSSPROD_EL2_FSC_ExternalAbort:
>>>> +	case HDBSSPROD_EL2_FSC_GPF:
>>>> +		return -EFAULT;
>>>> +	default:
>>>> +		/* Unknown fault. */
>>>> +		WARN_ONCE(1,
>>>> +				"Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
>>>> +				fsc, prod, vcpu->vcpu_id);
>>>> +		return -EFAULT;
>>>> +	}
>>>> +}
>>>> +
>>>>    /**
>>>>     * kvm_handle_guest_abort - handles all 2nd stage aborts
>>>>     * @vcpu:	the VCPU pointer
>>>> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>>
>>>>    	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>>>>
>>>> +	if (esr_iss2_is_hdbssf(esr))
>>>> +		return kvm_handle_hdbss_fault(vcpu);
>>>> +
>>>>    	if (esr_fsc_is_translation_fault(esr)) {
>>>>    		/* Beyond sanitised PARange (which is the IPA limit) */
>>>>    		if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
>>>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>>>> index 959532422d3a..c03a4b310b53 100644
>>>> --- a/arch/arm64/kvm/reset.c
>>>> +++ b/arch/arm64/kvm/reset.c
>>>> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>>>>    	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>>>>    	kfree(vcpu->arch.vncr_tlb);
>>>>    	kfree(vcpu->arch.ccsidr);
>>>> +
>>>> +	if (vcpu->kvm->arch.enable_hdbss)
>>>> +		kvm_arm_vcpu_free_hdbss(vcpu);
>>>>    }
>>>>
>>>>    static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>>>> --
>>>> 2.33.0
>> Thanks!
>>
>> Tian
>>