From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 86204105A596 for ; Thu, 12 Mar 2026 13:14:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:CC:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=ROHbddhlfnSyZntOoKBZcuS4wBpqWmyBoJb90KwaOAg=; b=FsMFv5dZ9Ny4jEMBmxU0LhaMC5 zwkenQk0h6otGtZAZ/Gtfv/UK1AaaXIRK8oBrd57XVwHZC/YwXEIbKhbTGBKxTd99j0/QZR/GPhus MMQLPxfm9eTonTWJuG4GVvQ5/k/BYzTSxxQJ/3WfSbiFtmpCi2Z6ftVYORmMEubNpL8fuTbzWya64 6MWIHfbc9FDEFibwRpul4kinCstXJhPHpqA0m0/p91S9olCJSOGae2KJhJ/bGbEJHCDhDGEQk4tde +vsmZtAP9rJt2daZI/5Woo00lU8V6HSDAxqizdyh3OeGd/KlzCNbGKZNgHJ1LJ6s206nmGQ0NaUCw 4MRbCYJA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1w0fsB-0000000E7Jy-30Ni; Thu, 12 Mar 2026 13:14:43 +0000 Received: from casper.infradead.org ([2001:8b0:10b:1236::1]) by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux)) id 1w0fs9-0000000E7Jq-2DPx for linux-arm-kernel@bombadil.infradead.org; Thu, 12 Mar 2026 13:14:42 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:Content-Type: In-Reply-To:From:References:CC:To:Subject:MIME-Version:Date:Message-ID:Sender :Reply-To:Content-ID:Content-Description; bh=ROHbddhlfnSyZntOoKBZcuS4wBpqWmyBoJb90KwaOAg=; b=aH8OjQ8m/7+e2qPZVsaeUFNxOn 4RJ2uLtMjDaZnYx7v8p1jKstZTvELEzVTWLUKIK9GoXLMqF2JLkrWouaHj3FqoXN76+8Nv7fN4Ukm SZCvSNu3yLXIWukQEETIG0zT6BobCqCTWFyBJAmt25xDO+Bd/bz/uFZmR/LPRZUHQru6IW5jyc9Ye 5gzTa3+CmRtVUNjVCEzwjWASFXFx4Yl2/4xLmiDxTJIp00czTfoEhLMjP6JBjgdGOn0mTjOEowBwE YUaBXtKoQZCBX+ut54x8Gr9xH1vtfbOJtaQPF56tvrkrLDdc32iGg/y57H5Jnqo6/F4xi8kKI7hgX yKbxraVA==; Received: from canpmsgout05.his.huawei.com ([113.46.200.220]) by casper.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux)) id 1w0fs5-0000000BMrM-2rPC for linux-arm-kernel@lists.infradead.org; Thu, 12 Mar 2026 13:14:40 +0000 dkim-signature: v=1; a=rsa-sha256; d=huawei.com; s=dkim; c=relaxed/relaxed; q=dns/txt; h=From; bh=ROHbddhlfnSyZntOoKBZcuS4wBpqWmyBoJb90KwaOAg=; b=X4BJVo2xYh3l6GLGIH2t1If3cH/0Jh06DnJy23c3J5e2mH/22Yps3M/+Z8DC1lfu8ECK/2sj6 QfUMy8nxG11psB3yWqk0eVLkA4BKqV0i4IcgVpMWksH34fQ+rXM1bftXlugTQ5u2JzTQeXsZNhC 97BTgXSVQM2Q1G2FiORigE0= Received: from mail.maildlp.com (unknown [172.19.163.104]) by canpmsgout05.his.huawei.com (SkyGuard) with ESMTPS id 4fWnwc0xXLz12LD4; Thu, 12 Mar 2026 21:08:00 +0800 (CST) Received: from kwepemr100010.china.huawei.com (unknown [7.202.195.125]) by mail.maildlp.com (Postfix) with ESMTPS id DA2574056A; Thu, 12 Mar 2026 21:13:34 +0800 (CST) Received: from [10.67.120.103] (10.67.120.103) by kwepemr100010.china.huawei.com (7.202.195.125) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.36; Thu, 12 Mar 2026 21:13:33 +0800 Message-ID: <46657927-5608-46f0-9b4c-159407f08880@huawei.com> Date: Thu, 12 Mar 2026 21:13:33 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Beta Subject: Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events To: Leonardo Bras CC: , , , , , , , , , , , , , , , , , , , References: <20260225040421.2683931-1-zhengtian10@huawei.com> <20260225040421.2683931-5-zhengtian10@huawei.com> <31973faa-746d-41b5-9b9b-6e459564308d@huawei.com> From: Tian Zheng In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.120.103] X-ClientProxiedBy: kwepems100002.china.huawei.com (7.221.188.206) To kwepemr100010.china.huawei.com (7.202.195.125) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20260312_131438_343008_D4AE8F0F X-CRM114-Status: GOOD ( 26.79 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 3/12/2026 8:06 PM, Leonardo Bras wrote: > On Thu, Mar 12, 2026 at 02:17:41PM +0800, Tian Zheng wrote: >> On 3/6/2026 11:01 PM, Leonardo Bras wrote: >>> On Fri, Mar 06, 2026 at 05:27:58PM +0800, Tian Zheng wrote: >>>> Hi Leo, >>>> >>>> On 3/4/2026 11:40 PM, Leonardo Bras wrote: >>>>> Hi Tian, >>>>> >>>>> Few extra notes/questions below >>>>> >>>>> On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote: >>>>>> From: eillon >>>>>> >>>>>> HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of >>>>>> migration. This feature is only supported in VHE mode. >>>>>> >>>>>> Initially, S2 PTEs doesn't contain the DBM attribute. During migration, >>>>>> write faults are handled by user_mem_abort, which relaxes permissions >>>>>> and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent >>>>>> writes no longer trap, as the hardware automatically transitions the page >>>>>> from writable-clean to writable-dirty. >>>>>> >>>>>> KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is >>>>>> enabled, the hardware observes the clean->dirty transition and records >>>>>> the corresponding page into the HDBSS buffer. >>>>>> >>>>>> During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring >>>>>> that check_vcpu_requests flushes the HDBSS buffer and propagates the >>>>>> accumulated dirty information into the userspace-visible dirty bitmap. >>>>>> >>>>>> Add fault handling for HDBSS including buffer full, external abort, and >>>>>> general protection fault (GPF). >>>>>> >>>>>> Signed-off-by: eillon >>>>>> Signed-off-by: Tian Zheng >>>>>> --- >>>>>> arch/arm64/include/asm/esr.h | 5 ++ >>>>>> arch/arm64/include/asm/kvm_host.h | 17 +++++ >>>>>> arch/arm64/include/asm/kvm_mmu.h | 1 + >>>>>> arch/arm64/include/asm/sysreg.h | 11 ++++ >>>>>> arch/arm64/kvm/arm.c | 102 ++++++++++++++++++++++++++++++ >>>>>> arch/arm64/kvm/hyp/vhe/switch.c | 19 ++++++ >>>>>> arch/arm64/kvm/mmu.c | 70 ++++++++++++++++++++ >>>>>> arch/arm64/kvm/reset.c | 3 + >>>>>> 8 files changed, 228 insertions(+) >>>>>> >>>>>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h >>>>>> index 81c17320a588..2e6b679b5908 100644 >>>>>> --- a/arch/arm64/include/asm/esr.h >>>>>> +++ b/arch/arm64/include/asm/esr.h >>>>>> @@ -437,6 +437,11 @@ >>>>>> #ifndef __ASSEMBLER__ >>>>>> #include >>>>>> >>>>>> +static inline bool esr_iss2_is_hdbssf(unsigned long esr) >>>>>> +{ >>>>>> + return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF; >>>>>> +} >>>>>> + >>>>>> static inline unsigned long esr_brk_comment(unsigned long esr) >>>>>> { >>>>>> return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK; >>>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h >>>>>> index 5d5a3bbdb95e..57ee6b53e061 100644 >>>>>> --- a/arch/arm64/include/asm/kvm_host.h >>>>>> +++ b/arch/arm64/include/asm/kvm_host.h >>>>>> @@ -55,12 +55,17 @@ >>>>>> #define KVM_REQ_GUEST_HYP_IRQ_PENDING KVM_ARCH_REQ(9) >>>>>> #define KVM_REQ_MAP_L1_VNCR_EL2 KVM_ARCH_REQ(10) >>>>>> #define KVM_REQ_VGIC_PROCESS_UPDATE KVM_ARCH_REQ(11) >>>>>> +#define KVM_REQ_FLUSH_HDBSS KVM_ARCH_REQ(12) >>>>>> >>>>>> #define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \ >>>>>> KVM_DIRTY_LOG_INITIALLY_SET) >>>>>> >>>>>> #define KVM_HAVE_MMU_RWLOCK >>>>>> >>>>>> +/* HDBSS entry field definitions */ >>>>>> +#define HDBSS_ENTRY_VALID BIT(0) >>>>>> +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12) >>>>>> + >>>>>> /* >>>>>> * Mode of operation configurable with kvm-arm.mode early param. >>>>>> * See Documentation/admin-guide/kernel-parameters.txt for more information. >>>>>> @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void); >>>>>> u32 __attribute_const__ kvm_target_cpu(void); >>>>>> void kvm_reset_vcpu(struct kvm_vcpu *vcpu); >>>>>> void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu); >>>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu); >>>>>> >>>>>> struct kvm_hyp_memcache { >>>>>> phys_addr_t head; >>>>>> @@ -405,6 +411,8 @@ struct kvm_arch { >>>>>> * the associated pKVM instance in the hypervisor. >>>>>> */ >>>>>> struct kvm_protected_vm pkvm; >>>>>> + >>>>>> + bool enable_hdbss; >>>>>> }; >>>>>> >>>>>> struct kvm_vcpu_fault_info { >>>>>> @@ -816,6 +824,12 @@ struct vcpu_reset_state { >>>>>> bool reset; >>>>>> }; >>>>>> >>>>>> +struct vcpu_hdbss_state { >>>>>> + phys_addr_t base_phys; >>>>>> + u32 size; >>>>>> + u32 next_index; >>>>>> +}; >>>>>> + >>>>> IIUC this is used once both on enable/disable and massively on >>>>> vcpu_put/get. >>>>> >>>>> What if we actually save just HDBSSBR_EL2 and HDBSSPROD_EL2 instead? >>>>> That way we avoid having masking operations in put/get as well as any >>>>> possible error we may have formatting those. >>>>> >>>>> The cost is doing those operations once for enable and once for disable, >>>>> which should be fine. >>> Hi Tian, >>> >>>> Thanks for the suggestion. I actually started with storing the raw system >>>> register >>>> >>>> values, as you proposed. >>>> >>>> >>>> However, after discussing it with Oliver Upton in v1, we felt that keeping >>>> the base address, >>>> >>>> size, and index as separate fields makes the state easier to understand. >>>> >>>> >>>> Discussion >>>> link:https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@linux.dev/ >>>> >>>> >>>> >>>> That's why I ended up changing the storage approach in the end. >>>> >>>> >>> Humm, FWIW I disagree with the above argument. >>> I would argue that vcpu_put should save the registers, and not >>> actually know what they are about or how are they formatted at this point. >>> >>> The responsibility of understanding it's fields and usage value should be >>> in the code that actually uses it. >>> >>> IIUC on kvm_vcpu_put_vhe and kvm_vcpu_load_vhe there are calls to other >>> functions than only save the register as it is. >> >> ok, thx! I'll update the struct to store only the raw register values. > Awesome :) > >> >>>>>> struct vncr_tlb; >>>>>> >>>>>> struct kvm_vcpu_arch { >>>>>> @@ -920,6 +934,9 @@ struct kvm_vcpu_arch { >>>>>> >>>>>> /* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */ >>>>>> struct vncr_tlb *vncr_tlb; >>>>>> + >>>>>> + /* HDBSS registers info */ >>>>>> + struct vcpu_hdbss_state hdbss; >>>>>> }; >>>>>> >>>>>> /* >>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h >>>>>> index d968aca0461a..3fea8cfe8869 100644 >>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>> @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, >>>>>> >>>>>> int kvm_handle_guest_sea(struct kvm_vcpu *vcpu); >>>>>> int kvm_handle_guest_abort(struct kvm_vcpu *vcpu); >>>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu); >>>>>> >>>>>> phys_addr_t kvm_mmu_get_httbr(void); >>>>>> phys_addr_t kvm_get_idmap_vector(void); >>>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h >>>>>> index f4436ecc630c..d11f4d0dd4e7 100644 >>>>>> --- a/arch/arm64/include/asm/sysreg.h >>>>>> +++ b/arch/arm64/include/asm/sysreg.h >>>>>> @@ -1039,6 +1039,17 @@ >>>>>> >>>>>> #define GCS_CAP(x) ((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \ >>>>>> GCS_CAP_VALID_TOKEN) >>>>>> + >>>>>> +/* >>>>>> + * Definitions for the HDBSS feature >>>>>> + */ >>>>>> +#define HDBSS_MAX_SIZE HDBSSBR_EL2_SZ_2MB >>>>>> + >>>>>> +#define HDBSSBR_EL2(baddr, sz) (((baddr) & GENMASK(55, 12 + sz)) | \ >>>>>> + FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz)) >>>>>> + >>>>>> +#define HDBSSPROD_IDX(prod) FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod) >>>>>> + >>>>>> /* >>>>>> * Definitions for GICv5 instructions >>>>>> */ >>>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c >>>>>> index 29f0326f7e00..d64da05e25c4 100644 >>>>>> --- a/arch/arm64/kvm/arm.c >>>>>> +++ b/arch/arm64/kvm/arm.c >>>>>> @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu) >>>>>> return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE; >>>>>> } >>>>>> >>>>>> +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu) >>>>>> +{ >>>>>> + struct page *hdbss_pg; >>>>>> + >>>>>> + hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys); >>>>>> + if (hdbss_pg) >>>>>> + __free_pages(hdbss_pg, vcpu->arch.hdbss.size); >>>>>> + >>>>>> + vcpu->arch.hdbss.size = 0; >>>>>> +} >>>>>> + >>>>>> +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm, >>>>>> + struct kvm_enable_cap *cap) >>>>>> +{ >>>>>> + unsigned long i; >>>>>> + struct kvm_vcpu *vcpu; >>>>>> + struct page *hdbss_pg = NULL; >>>>>> + __u64 size = cap->args[0]; >>>>>> + bool enable = cap->args[1] ? true : false; >>>>>> + >>>>>> + if (!system_supports_hdbss()) >>>>>> + return -EINVAL; >>>>>> + >>>>>> + if (size > HDBSS_MAX_SIZE) >>>>>> + return -EINVAL; >>>>>> + >>>>>> + if (!enable && !kvm->arch.enable_hdbss) /* Already Off */ >>>>>> + return 0; >>>>>> + >>>>>> + if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */ >>>>>> + return -EINVAL; >>>>>> + >>>>>> + if (!enable) { /* Turn it off */ >>>>>> + kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA); >>>>>> + >>>>>> + kvm_for_each_vcpu(i, vcpu, kvm) { >>>>>> + /* Kick vcpus to flush hdbss buffer. */ >>>>>> + kvm_vcpu_kick(vcpu); >>>>>> + >>>>>> + kvm_arm_vcpu_free_hdbss(vcpu); >>>>>> + } >>>>>> + >>>>>> + kvm->arch.enable_hdbss = false; >>>>>> + >>>>>> + return 0; >>>>>> + } >>>>>> + >>>>>> + /* Turn it on */ >>>>>> + kvm_for_each_vcpu(i, vcpu, kvm) { >>>>>> + hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size); >>>>>> + if (!hdbss_pg) >>>>>> + goto error_alloc; >>>>>> + >>>>>> + vcpu->arch.hdbss = (struct vcpu_hdbss_state) { >>>>>> + .base_phys = page_to_phys(hdbss_pg), >>>>>> + .size = size, >>>>>> + .next_index = 0, >>>>>> + }; >>>>>> + } >>>>>> + >>>>>> + kvm->arch.enable_hdbss = true; >>>>>> + kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA; >>>>>> + >>>>>> + /* >>>>>> + * We should kick vcpus out of guest mode here to load new >>>>>> + * vtcr value to vtcr_el2 register when re-enter guest mode. >>>>>> + */ >>>>>> + kvm_for_each_vcpu(i, vcpu, kvm) >>>>>> + kvm_vcpu_kick(vcpu); >>>>>> + >>>>>> + return 0; >>>>>> + >>>>>> +error_alloc: >>>>>> + kvm_for_each_vcpu(i, vcpu, kvm) { >>>>>> + if (vcpu->arch.hdbss.base_phys) >>>>>> + kvm_arm_vcpu_free_hdbss(vcpu); >>>>>> + } >>>>>> + >>>>>> + return -ENOMEM; >>>>>> +} >>>>>> + >>>>>> int kvm_vm_ioctl_enable_cap(struct kvm *kvm, >>>>>> struct kvm_enable_cap *cap) >>>>>> { >>>>>> @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, >>>>>> r = 0; >>>>>> set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags); >>>>>> break; >>>>>> + case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK: >>>>>> + mutex_lock(&kvm->lock); >>>>>> + r = kvm_cap_arm_enable_hdbss(kvm, cap); >>>>>> + mutex_unlock(&kvm->lock); >>>>>> + break; >>>>>> default: >>>>>> break; >>>>>> } >>>>>> @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) >>>>>> r = kvm_supports_cacheable_pfnmap(); >>>>>> break; >>>>>> >>>>>> + case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK: >>>>>> + r = system_supports_hdbss(); >>>>>> + break; >>>>>> default: >>>>>> r = 0; >>>>>> } >>>>>> @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) >>>>>> if (kvm_dirty_ring_check_request(vcpu)) >>>>>> return 0; >>>>>> >>>>>> + if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu)) >>>>>> + kvm_flush_hdbss_buffer(vcpu); >>>>> I am curious on why we need a flush-hdbss request, >>>>> Don't we have the flush function happening every time we run vcpu_put? >>>>> >>>>> Oh, I see, you want to check if there is anything needed inside the inner >>>>> loop of vcpu_run, without having to vcpu_put. I think it makes sense. >>>>> >>>>> But instead of having this on guest entry, does not it make more sense to >>>>> have it in guest exit? This way we flush every time (if needed) we exit the >>>>> guest, and instead of having a vcpu request, we just require a vcpu kick >>>>> and it should flush if needed. >>>>> >>>>> Maybe have vcpu_put just save the registers, and add a the flush before >>>>> handle_exit. >>>>> >>>>> What do you think? >>>> Thank you for the feedback. >>>> >>>> >>>> Indeed, in the initial version (v1), I placed the flush operation inside >>>> handle_exit and >>>> >>>> used a vcpu_kick in kvm_arch_sync_dirty_log to trigger the flush of the >>>> HDBSS buffer. >>>> >>>> >>>> However, during the review, Marc pointed out that calling this function on >>>> every exit >>>> >>>> event is too frequent if it's not always needed. >>>> >>>> >>>> Discussion link: >>>> _https://lore.kernel.org/linux-arm-kernel/86senjony9.wl-maz@kernel.org/_ >>>> >>>> >>>> I agreed with his assessment. Therefore, in the current version, I've >>>> separated the flush >>>> >>>> operation into more specific and less frequent points: >>>> >>>> >>>> 1. In vcpu_put >>>> >>>> 2. During dirty log synchronization, by kicking the vCPU to trigger a >>>> request that flushes >>>> >>>> on its next exit. >>>> >>>> >>>> 3. When handling a specific HDBSSF event. >>>> >>>> >>>> This ensures the flush happens only when necessary, avoiding the overhead of >>>> doing it >>>> >>>> on every guest exit. >>>> >>> Fair enough, calling it every time you go in the inner loop may be too >>> much, even with a check to make sure it needs to run. >>> >>> Having it as a request means you may do that sometimes without >>> leaving the inner loop. That could be useful if you want to use it with the >>> IRQ handler to deal with full buffer, or any error, as well as dealing with >>> a regular request in the 2nd case. >>> >>> While I agree it's needed to run before leaving guest context (i.e leaving >>> the inner loop), I am not really sure vcpu_put is the best place to put the >>> flushing. I may be wrong, but for me it looks more like of a place to save >>> registers and context, as well as dropping refcounts or something like >>> that. I would not expect a flush happening in vcpu_put, if I was reading >>> the code. >>> >>> Would it be too bad if we had it into a call before vcpu_put, at >>> kvm_arch_vcpu_ioctl_run()? >>> >> Thanks for the clarification. After looking again at the code paths, I agree >> that >> >> kvm_vcpu_put_vhe() and kvm_arch_vcpu_put() are really meant to be pure >> save/restore >> >> paths, so embedding HDBSS flushing there isn't ideal. >> >> >> My remaining concern is that kvm_arch_vcpu_ioctl_run() doesn't cover the >> case where >> >> the vCPU is scheduled out. In that case we still leave guest context, but we >> don't return >> >> through the run loop, so a flush placed only in ioctl_run() wouldn't run. >> >> >> Any suggestions on where this should hook in? > You mention that on vcpu_put it works, right? maybe it's worth to track > down which vcpu_put users would make sense to flush before it's calling. > > I found that vcpu_put is called only in the vcpu_run, but it's > arch-specific version is called in: > > kvm_debug_handle_oslar : put and load in sequence, flush shouldnt be needed > kvm_emulate_nested_eret : same as above, not sure if nested will be supported > kvm_inject_nested : same as above > kvm_reset_vcpu : put and load in sequence, not needed [1] > kvm_sched_out : that's ran on sched-out, that makes sense for us > vcpu_put : called only by vcpu_run, where we already planned to use > > Which brings us other benefit of not having that in vcpu_put: the flush > could be happening on functions that should originally ran fast, and having > it outside vcpu_put allows us to decide if we need it. > > So, having the flush happening in kvm_arch_vcpu_ioctl_run() and > kvm_sched_out() should be enough, on top of the per-request you > mentioned before. > > [1]: the vcpu_reset: not sure how often this does happen, and if it would > be interesting flushing here as well. It seems to be called on init, so not > the case where would be something to flush, and from a case where it's > restoring the registers. So I think it should be safe to not flush here, > so it should be a question of 'maybe being interesting', which I am not > sure. Got it, makes sense. HDBSS doesn't apply to nested or other complex cases, so we only need to flush in kvm_arch_vcpu_ioctl_run() and kvm_sched_out(). I'll update the code accordingly and test to make sure we haven't missed any edge cases. Thanks! Tian >> Would introducing a small >> arch‑specific >> >> "guest exit" helper, invoked fromm kvm_arch_vcpu_put(), be acceptable? >> > IIUC that would be the same as the previous one: we would have the flush > happening inside a function that is supposed to save registers. > > Thanks! > Leo > > >> Thanks! >> >> Tian >> >> >>>>>> + >>>>>> check_nested_vcpu_requests(vcpu); >>>>>> } >>>>>> >>>>>> @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl, >>>>>> >>>>>> void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot) >>>>>> { >>>>>> + /* >>>>>> + * Flush all CPUs' dirty log buffers to the dirty_bitmap. Called >>>>>> + * before reporting dirty_bitmap to userspace. Send a request with >>>>>> + * KVM_REQUEST_WAIT to flush buffer synchronously. >>>>>> + */ >>>>>> + struct kvm_vcpu *vcpu; >>>>>> + >>>>>> + if (!kvm->arch.enable_hdbss) >>>>>> + return; >>>>>> >>>>>> + kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS); >>>>>> } >>>>>> >>>>>> static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm, >>>>>> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c >>>>>> index 9db3f11a4754..600cbc4f8ae9 100644 >>>>>> --- a/arch/arm64/kvm/hyp/vhe/switch.c >>>>>> +++ b/arch/arm64/kvm/hyp/vhe/switch.c >>>>>> @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu) >>>>>> local_irq_restore(flags); >>>>>> } >>>>>> >>>>>> +static void __load_hdbss(struct kvm_vcpu *vcpu) >>>>>> +{ >>>>>> + struct kvm *kvm = vcpu->kvm; >>>>>> + u64 br_el2, prod_el2; >>>>>> + >>>>>> + if (!kvm->arch.enable_hdbss) >>>>>> + return; >>>>>> + >>>>>> + br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size); >>>>>> + prod_el2 = vcpu->arch.hdbss.next_index; >>>>>> + >>>>>> + write_sysreg_s(br_el2, SYS_HDBSSBR_EL2); >>>>>> + write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2); >>>>>> + >>>>>> + isb(); >>>>>> +} >>>>>> + >>>>>> void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu) >>>>>> { >>>>>> host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu; >>>>>> @@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu) >>>>>> __vcpu_load_switch_sysregs(vcpu); >>>>>> __vcpu_load_activate_traps(vcpu); >>>>>> __load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch); >>>>>> + __load_hdbss(vcpu); >>>>>> } >>>>>> >>>>>> void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu) >>>>>> { >>>>>> + kvm_flush_hdbss_buffer(vcpu); >>>>>> __vcpu_put_deactivate_traps(vcpu); >>>>>> __vcpu_put_switch_sysregs(vcpu); >>>>>> >>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>> index 070a01e53fcb..42b0710a16ce 100644 >>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>> @@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>>>> if (writable) >>>>>> prot |= KVM_PGTABLE_PROT_W; >>>>>> >>>>>> + if (writable && kvm->arch.enable_hdbss && logging_active) >>>>>> + prot |= KVM_PGTABLE_PROT_DBM; >>>>>> + >>>>>> if (exec_fault) >>>>>> prot |= KVM_PGTABLE_PROT_X; >>>>>> >>>>>> @@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) >>>>>> return 0; >>>>>> } >>>>>> >>>>>> +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu) >>>>>> +{ >>>>>> + int idx, curr_idx; >>>>>> + u64 br_el2; >>>>>> + u64 *hdbss_buf; >>>>>> + struct kvm *kvm = vcpu->kvm; >>>>>> + >>>>>> + if (!kvm->arch.enable_hdbss) >>>>>> + return; >>>>>> + >>>>>> + curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2)); >>>>>> + br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size); >>>>>> + >>>>>> + /* Do nothing if HDBSS buffer is empty or br_el2 is NULL */ >>>>>> + if (curr_idx == 0 || br_el2 == 0) >>>>>> + return; >>>>>> + >>>>>> + hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys)); >>>>>> + if (!hdbss_buf) >>>>>> + return; >>>>>> + >>>>>> + guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock); >>>>>> + for (idx = 0; idx < curr_idx; idx++) { >>>>>> + u64 gpa; >>>>>> + >>>>>> + gpa = hdbss_buf[idx]; >>>>>> + if (!(gpa & HDBSS_ENTRY_VALID)) >>>>>> + continue; >>>>>> + >>>>>> + gpa &= HDBSS_ENTRY_IPA; >>>>>> + kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT); >>>>>> + } >>>>> This will mark a page dirty for both dirty_bitmap or dirty_ring, depending >>>>> on what is in use. >>>>> >>>>> Out of plain curiosity, have you planned / tested for the dirty-ring as >>>>> well, or just for dirty-bitmap? >>>> Currently, I have only tested this with dirty-bitmap mode. >>>> >>>> >>>> I will test and ensure the HDBSS feature works correctly with dirty-ring in >>>> the next version. >>>> >>>> >>> Thanks! >>> >>>>>> + >>>>>> + /* reset HDBSS index */ >>>>>> + write_sysreg_s(0, SYS_HDBSSPROD_EL2); >>>>>> + vcpu->arch.hdbss.next_index = 0; >>>>>> + isb(); >>>>>> +} >>>>>> + >>>>>> +static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu) >>>>>> +{ >>>>>> + u64 prod; >>>>>> + u64 fsc; >>>>>> + >>>>>> + prod = read_sysreg_s(SYS_HDBSSPROD_EL2); >>>>>> + fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod); >>>>>> + >>>>>> + switch (fsc) { >>>>>> + case HDBSSPROD_EL2_FSC_OK: >>>>>> + /* Buffer full, which is reported as permission fault. */ >>>>>> + kvm_flush_hdbss_buffer(vcpu); >>>>>> + return 1; >>>>> Humm, flushing in a fault handler means hanging there, in IRQ context, for >>>>> a while. >>>>> >>>>> Since we already deal with this on guest_exit (vcpu_put IIUC), why not just >>>>> return in a way the vcpu has to exit the inner loop and let it flush there >>>>> instead? >>>>> >>>>> Thanks! >>>>> Leo >>>> Thanks for the feedback. >>>> >>>> >>>> If we flush on every guest exit (by moving the flush before handle_exit, >>>> then we can >>>> >>>> indeed drop the flush from the fault handler and from vcpu_put. >>>> >>>> >>>> However, given Marc's earlier concern about not imposing this overhead on >>>> all vCPUs, >>>> >>>> I'd rather avoid flushing on every exit. >>>> >>>> >>>> My current plan is to set a request bit in kvm_handle_hdbss_fault (via >>>> kvm_make_request), >>>> >>>> and move the actual flush to the normal exit path, where it can execute in a >>>> safe context. >>>> >>>> This also allows us to remove the flush from the fault handler entirely. >>>> >>>> >>>> Does that approach sound reasonable to you? >>>> >>>> >>> Yes, I think it looks much better, as the fault will cause guest to exit, >>> and it can run the flush on it's way back in. >>> >>> Thanks! >>> Leo >>> >>>>>> + case HDBSSPROD_EL2_FSC_ExternalAbort: >>>>>> + case HDBSSPROD_EL2_FSC_GPF: >>>>>> + return -EFAULT; >>>>>> + default: >>>>>> + /* Unknown fault. */ >>>>>> + WARN_ONCE(1, >>>>>> + "Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n", >>>>>> + fsc, prod, vcpu->vcpu_id); >>>>>> + return -EFAULT; >>>>>> + } >>>>>> +} >>>>>> + >>>>>> /** >>>>>> * kvm_handle_guest_abort - handles all 2nd stage aborts >>>>>> * @vcpu: the VCPU pointer >>>>>> @@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) >>>>>> >>>>>> is_iabt = kvm_vcpu_trap_is_iabt(vcpu); >>>>>> >>>>>> + if (esr_iss2_is_hdbssf(esr)) >>>>>> + return kvm_handle_hdbss_fault(vcpu); >>>>>> + >>>>>> if (esr_fsc_is_translation_fault(esr)) { >>>>>> /* Beyond sanitised PARange (which is the IPA limit) */ >>>>>> if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) { >>>>>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c >>>>>> index 959532422d3a..c03a4b310b53 100644 >>>>>> --- a/arch/arm64/kvm/reset.c >>>>>> +++ b/arch/arm64/kvm/reset.c >>>>>> @@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu) >>>>>> free_page((unsigned long)vcpu->arch.ctxt.vncr_array); >>>>>> kfree(vcpu->arch.vncr_tlb); >>>>>> kfree(vcpu->arch.ccsidr); >>>>>> + >>>>>> + if (vcpu->kvm->arch.enable_hdbss) >>>>>> + kvm_arm_vcpu_free_hdbss(vcpu); >>>>>> } >>>>>> >>>>>> static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu) >>>>>> -- >>>>>> 2.33.0 >>>> Thanks! >>>> >>>> Tian >>>>