From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 81D87EF06E1 for ; Sun, 8 Feb 2026 18:37:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=kAVjMilgP7Cobl7oJaZI+06vjmsSnkfpZXOf2OAmtCE=; b=kJcS4jSNS5dfohrBsO3Ozljjkd zA0/D15/v52cGgNJ82ZJQcaHrEaAdTbPBkwiDGz8zhk9Rx4tCZpsnuZ3A+kCQhKgau+m13Qmc1lI2 RuU/famwxGWROia06NNfEoEmAKb0f9+0jBXv//zcBbZyKGkNTRgcs55IN/vrjUBYAtbbyy12Q+h2Z ZzIbq5qWBjium6VCMNJySCJw9FXosL1SC/f9RQ0RHMF3/8+Pzt9pjIs4Owjg2KMSciRqhfXxWczSE oWDl02J3OYP5CgyAHHpbftbe9IiVrSaZPxB3eBQlvFBmCNl1UATEvDuNbbFpUhR+KoXaExXxxWu73 X0WmWAGA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1vp9ei-0000000DYZI-3Okg; Sun, 08 Feb 2026 18:37:12 +0000 Received: from out-173.mta1.migadu.com ([95.215.58.173]) by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux)) id 1vp9ef-0000000DYW6-0rKo for linux-arm-kernel@lists.infradead.org; Sun, 08 Feb 2026 18:37:10 +0000 Message-ID: <8b5f128c-14e4-4b15-8aba-cf8dfcb51c54@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1770575701; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kAVjMilgP7Cobl7oJaZI+06vjmsSnkfpZXOf2OAmtCE=; b=J0VzRsFGlp2mDU0G1ZgNoUq+3/r48SYB8f0iDNGTotmv3sWFca+jZIX+oY7rNhijmoCTKL LduWXrma7/3rhVxJLkIpFbipMncRJWxXacsUlcL/mB9yozoM84xZSszAoZO01HCysNHVFa ofD7Y1xlFvp3VdQkDMGO70lAFjXIqIs= Date: Mon, 9 Feb 2026 02:34:53 +0800 MIME-Version: 1.0 Subject: Re: [PATCH v3 02/16] KVM: arm64: nv: Implement nested Stage-2 page table walk logic To: Marc Zyngier Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, James Morse , Suzuki K Poulose , Oliver Upton , Zenghui Yu , Joey Gouly , Alexandru Elisei , Christoffer Dall , Ganapatrao Kulkarni , wanghaibin.wang@huawei.com References: <20240614144552.2773592-1-maz@kernel.org> <20240614144552.2773592-3-maz@kernel.org> <3f88cd49-68f1-4276-a067-b7c6beadb27c@linux.dev> <86ms1m9lp3.wl-maz@kernel.org> Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Zenghui Yu In-Reply-To: <86ms1m9lp3.wl-maz@kernel.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20260208_103709_405460_641967A7 X-CRM114-Status: GOOD ( 46.61 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 2/6/26 7:05 PM, Marc Zyngier wrote: > On Wed, 04 Feb 2026 08:28:57 +0000, > Zenghui Yu wrote: > > > > Hi Marc, > > > > [ chewing through the NV code.. ;-) ] > > You fool! :) > > > > > On 6/14/24 10:45 PM, Marc Zyngier wrote: > > > From: Christoffer Dall > > > > > > Based on the pseudo-code in the ARM ARM, implement a stage 2 software > > > page table walker. > > > > [...] > > > > > +static u32 compute_fsc(int level, u32 fsc) > > > +{ > > > + return fsc | (level & 0x3); > > > +} > > > + > > > +static int get_ia_size(struct s2_walk_info *wi) > > > +{ > > > + return 64 - wi->t0sz; > > > +} > > > + > > > +static int check_base_s2_limits(struct s2_walk_info *wi, > > > + int level, int input_size, int stride) > > > +{ > > > + int start_size, ia_size; > > > + > > > + ia_size = get_ia_size(wi); > > > + > > > + /* Check translation limits */ > > > + switch (BIT(wi->pgshift)) { > > > + case SZ_64K: > > > + if (level == 0 || (level == 1 && ia_size <= 42)) > > > > It looks broken as the pseudocode checks the limits based on > > *implemented PA size*, rather than on ia_size, which is essentially the > > input address size (64 - T0SZ) programmed by L1 hypervisor. They're > > different. > > > > We can probably get the implemented PA size by: > > > > AArch64.PAMax() > > { > > parange = get_idreg_field_enum(kvm, ID_AA64MMFR0_EL1, PARANGE); > > return id_aa64mmfr0_parange_to_phys_shift(parange); > > } > > > > Not sure if I've read the spec correctly. > > I think that's also the way I read AArch64_S2InvalidSL(), which more > or less mirrors the above. > > The question is what should we limit it to? Is it PARange, as you > suggest? Or the IPA range defined by userspace at VM creation (the > type argument, which ends up in kvm->arch.mmu.pgt->ia_bits)? > > I think this is the former, but we probably also need to handle the > later on actual access (when reading the descriptor). Failure to read > the descriptor (because it is outside of a memslot) should result in a > SEA being injected in the guest. Yup. Do you suggest something like diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c index cdeeb8f09e72..cdd900305047 100644 --- a/arch/arm64/kvm/nested.c +++ b/arch/arm64/kvm/nested.c @@ -293,8 +293,10 @@ static int walk_nested_s2_pgd(struct kvm_vcpu *vcpu, phys_addr_t ipa, paddr = base_addr | index; ret = read_guest_s2_desc(vcpu, paddr, &desc, wi); - if (ret < 0) + if (ret < 0) { + out->esr = ESR_ELx_FSC_SEA_TTW(level); return ret; + } new_desc = desc; The current code doesn't specify the FSC when we fail to read the descriptor. > > > > > + return -EFAULT; > > > + break; > > > + case SZ_16K: > > > + if (level == 0 || (level == 1 && ia_size <= 40)) > > > + return -EFAULT; > > > + break; > > > + case SZ_4K: > > > + if (level < 0 || (level == 0 && ia_size <= 42)) > > > + return -EFAULT; > > > + break; > > > + } > > > + > > > + /* Check input size limits */ > > > + if (input_size > ia_size) > > > > This is always false for the current code. ;-) > > Yup. At least that doesn't introduce any extra bug! :) > > > > > > + return -EFAULT; > > > + > > > + /* Check number of entries in starting level table */ > > > + start_size = input_size - ((3 - level) * stride + wi->pgshift); > > > + if (start_size < 1 || start_size > stride + 4) > > > + return -EFAULT; > > > + > > > + return 0; > > > +} > > > + > > > +/* Check if output is within boundaries */ > > > +static int check_output_size(struct s2_walk_info *wi, phys_addr_t output) > > > +{ > > > + unsigned int output_size = wi->max_oa_bits; > > > + > > > + if (output_size != 48 && (output & GENMASK_ULL(47, output_size))) > > > + return -1; > > > + > > > + return 0; > > > +} > > > + > > > +/* > > > + * This is essentially a C-version of the pseudo code from the ARM ARM > > > + * AArch64.TranslationTableWalk function. I strongly recommend looking at > > > + * that pseudocode in trying to understand this. > > > + * > > > + * Must be called with the kvm->srcu read lock held > > > + */ > > > +static int walk_nested_s2_pgd(phys_addr_t ipa, > > > + struct s2_walk_info *wi, struct kvm_s2_trans *out) > > > +{ > > > + int first_block_level, level, stride, input_size, base_lower_bound; > > > + phys_addr_t base_addr; > > > + unsigned int addr_top, addr_bottom; > > > + u64 desc; /* page table entry */ > > > + int ret; > > > + phys_addr_t paddr; > > > + > > > + switch (BIT(wi->pgshift)) { > > > + default: > > > + case SZ_64K: > > > + case SZ_16K: > > > + level = 3 - wi->sl; > > > + first_block_level = 2; > > > + break; > > > + case SZ_4K: > > > + level = 2 - wi->sl; > > > + first_block_level = 1; > > > + break; > > > + } > > > + > > > + stride = wi->pgshift - 3; > > > + input_size = get_ia_size(wi); > > > + if (input_size > 48 || input_size < 25) > > > + return -EFAULT; > > > + > > > + ret = check_base_s2_limits(wi, level, input_size, stride); > > > + if (WARN_ON(ret)) > > > + return ret; > > > + > > > + base_lower_bound = 3 + input_size - ((3 - level) * stride + > > > + wi->pgshift); > > > + base_addr = wi->baddr & GENMASK_ULL(47, base_lower_bound); > > > + > > > + if (check_output_size(wi, base_addr)) { > > > + out->esr = compute_fsc(level, ESR_ELx_FSC_ADDRSZ); > > > > With a wrongly programmed base address, we should report the ADDRSZ > > fault at level 0 (as per R_BFHQH and the pseudocode). It's easy to fix. > > > > Yup. Although this rule describe S1 rather than S2 (we don't seem to > have anything saying the same thing for S2), but I expect the > behaviour to be exactly the same. The rule does describe s2. :) R_BFHQH: " When an Address size fault is generated, the reported fault code indicates one of the following: If the fault was generated due to the TTBR_ELx used in the translation having nonzero address bits above the OA size, then a fault at level 0. " where TTBR_ELx is the general name which also applies to VTTBR_EL2. AArch64.S2Walk also clearly describes this behaviour (I read DDI0478G.b J1-8122). > > > +static void vtcr_to_walk_info(u64 vtcr, struct s2_walk_info *wi) > > > +{ > > > + wi->t0sz = vtcr & TCR_EL2_T0SZ_MASK; > > > + > > > + switch (vtcr & VTCR_EL2_TG0_MASK) { > > > + case VTCR_EL2_TG0_4K: > > > + wi->pgshift = 12; break; > > > + case VTCR_EL2_TG0_16K: > > > + wi->pgshift = 14; break; > > > + case VTCR_EL2_TG0_64K: > > > + default: /* IMPDEF: treat any other value as 64k */ > > > + wi->pgshift = 16; break; > > > + } > > > + > > > + wi->sl = FIELD_GET(VTCR_EL2_SL0_MASK, vtcr); > > > + /* Global limit for now, should eventually be per-VM */ > > > + wi->max_oa_bits = min(get_kvm_ipa_limit(), > > ^^^ > > > > Should we use AArch64.PAMax() instead? As the output address size is > > never larger than the implemented PA size. > > > > Now I'm wondering if we can let kvm_get_pa_bits() just return PAMax for > > (based on the exposed (to-L1) AA64MFR0.PARange value) and use it when > > possible. > > Yes, that was the plan all along, but I got sidetracked. OK. The concern is that PARange is writable from userspace, init_nested_s2_mmu() can not get a fixed PAMax vaule (kvm_get_pa_bits()) until AA64MMFR0_EL1 has became immutable (i.e., VM has started). I need to think it over. Thanks, Zenghui