From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-186.mta1.migadu.com (out-186.mta1.migadu.com [95.215.58.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A1FA61990A8 for ; Tue, 1 Oct 2024 20:41:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.186 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727815304; cv=none; b=nWYieIVOp8hBMJOj1kvCJTvJ0lzW00KLFBO6bCfWfxgmcXjFb9S6ULViLBulRv/Cr++9Vf3jKblIXBHQBmMZYjEOgJwxxG6JJMmKJyfMkfmZ5nh8nL/5h2AQdq8Tv3yIORsmN2XEGnR/6r/Imrh1XfIRG0m+YJqbN2yJCJ8Qlvo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727815304; c=relaxed/simple; bh=VarOVGYDD9N0J/h3yDhdjjHfqXzF1DB4XZXWrEbJihU=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=m0AoR+M7iH1fO+m6Q51VRPCju8UDYiL2bmzDde6URR+sSehSNAs/FQcoSUWxDunri5X0gJQpNVg1o6RPekfRUsfgrHrh2my7l5J0PIt7vg2Ew1EvSMYiBA0o1iCjw8Q9TGJ4q6mIcEAh98H5GZC4CCZZL1mCouDGlP793J/CcJM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=QIr4F7gP; arc=none smtp.client-ip=95.215.58.186 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="QIr4F7gP" Date: Tue, 1 Oct 2024 13:41:29 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1727815299; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=75I+P7eof1YL1Fgyo6OlTobTwQwJKciYu6JUYyV9SFs=; b=QIr4F7gPiSrsYdbGpgVBm3DAqehpPddIEtNQNQkW4rQfmf1aZc1qTpX5fS8y/u2ywViHcQ I9XImgApdbeYqyMkWX1Nwjo2PIRVwm9nUuifCO3cvzq2J+WGky7fMVfVbijVT4bjjEVse9 e5/EsEiHBhOZL6HqhO2xccMnm5ikcBY= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Oliver Upton To: Sean Christopherson Cc: kvmarm@lists.linux.dev, Marc Zyngier , Joey Gouly , Suzuki K Poulose , Zenghui Yu Subject: Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Message-ID: References: <20241001001709.1303668-1-oliver.upton@linux.dev> <20241001001709.1303668-4-oliver.upton@linux.dev> Precedence: bulk X-Mailing-List: kvmarm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT Hey, sidebar: I was a bit confused by the diff for a second, since it looks like your email client lowercased some stuff :) On Tue, Oct 01, 2024 at 12:05:01PM -0700, Sean Christopherson wrote: > On Tue, Oct 01, 2024, Oliver Upton wrote: > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > index 367f0d3d3194..ab7d387379a0 100644 > > --- a/arch/arm64/kvm/arm.c > > +++ b/arch/arm64/kvm/arm.c > > @@ -1031,6 +1031,8 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) > > > > if (kvm_dirty_ring_check_request(vcpu)) > > return 0; > > + > > + check_nested_vcpu_requests(vcpu); > > } > > > > return 1; > > diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c > > index 1ba211541290..749babdbda81 100644 > > --- a/arch/arm64/kvm/nested.c > > +++ b/arch/arm64/kvm/nested.c > > @@ -632,9 +632,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu) > > /* set the scene for the next search */ > > kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size; > > > > - /* clear the old state */ > > + /* make sure we don't forget to do the laundry */ > > if (kvm_s2_mmu_valid(s2_mmu)) > > - kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false); > > + s2_mmu->pending_unmap = true; > > Question time. The code that's just out of sight modifies the nested MMU's tlb_vttbr: > > s2_mmu->tlb_vttbr = vcpu_read_sys_reg(vcpu, vttbr_el2) & ~vttbr_cnp_bit; > s2_mmu->tlb_vtcr = vcpu_read_sys_reg(vcpu, vtcr_el2); > s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, hcr_el2) & hcr_vm; > > which is consumed by kvm_s2_mmu_iterate_by_vmid(). IIUC, kvm_s2_mmu_iterate_by_vmid() > is used only when emulating some form of tlbi that's executed by the guest? I.e. > is guaranteed to do the right thing because KVM ensures the old state is purged > prior to entering the guest? Heh, yeah I meant to add some comments on why iterators that walk the nested_mmus array are safe. But it's basically what you've said, if the current MMU gets reclaimed when we're in the middle of unmapping it for TLBI, then the intent of the TLBI is still upheld since all the old crap gets kicked out before reuse. The architecture requires the guest put its own stage-2 into a consistent state before issuing a TLBI, so even if we wind up doing more shadow stage-2 fills after relinquishing the lock it is still legal. And until the TLBI instruction retires (i.e. ERET back to the guest), other vCPUs can use stale 'TLB entries' all they want. I'll happily add some comments to clarify all this. > > /* > > * the virtual vmid (modulo cnp) will be used as a key when matching > > @@ -649,6 +649,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu) > > s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, hcr_el2) & hcr_vm; > > > > out: > > + if (s2_mmu->pending_unmap) > > + kvm_make_request(kvm_req_nested_s2_unmap, vcpu); > > If I followed everything correctly, I don't think a request is needed. the > request will never be cross-vCPU, and each vCPU holds a reference to the MMU, so > the MMU can't be recycled, i.e. pending_unmap is guaranteed to be relevant to the > vCPU's usage of the MMU. More thoughts below in check_nested_vcpu_requests(). I'm (ab)using the request to prevent the vCPU thread from actually entering the VM without first having done the laundry. We have other examples of strictly per-vCPU tasks that are tracked with a request so this doesn't stick out that much. Otherwise we'd need an open-coded check in kvm_vcpu_exit_request() to catch a 'dirty' MMU or take a pin on it from the point we check the dirtiness to the point we disable preemption. > Somewhat of a side topic, the handling of xfer_to_guest_mode_handle_work() in > kvm_arch_vcpu_ioctl_run() is confusing. If a signal is pending (ret == -eintr), > KVM will skip checking requests, e.g. not unmap the nested stage-2, but then continue > with other pieces of the entry flow. I'm guessing that's intentional as it looks > like KVM needs to flush hardware state in case that info is exposed to userspace. > > But it really warps the mind when trying to reason about what will happen if > there is a relevant request, e.g. if the below code needs to load a new stage-2 > for a new vmid, it could do so with the to-be-unmapped contents. What you describe sounds worrying, but is still completely fine. Even without VMID rollover, we're loading a potentially-stale MMU at the point of vcpu_load(). What saves us is the fact that the EL1&0 translation regime is out-of-context at EL2, meaning that hardware is disallowed from speculatively using the translation regime. Ignoring the ARM64_WORKAROUND_SPECULATIVE_AT turd :) The point at which the MMU and translation tables *must* be consistent is the ERET into EL1/EL0. > > +void check_nested_vcpu_requests(struct kvm_vcpu *vcpu) > > +{ > > + if (kvm_check_request(kvm_req_nested_s2_unmap, vcpu)) { > > + CLASS(vcpu_hw_mmu, mmu)(vcpu); > > + > > + write_lock(&vcpu->kvm->mmu_lock); > > + if (mmu->pending_unmap) { > > I doubt it's a functional issue, but if multiple vCPUs are purging the same nested > stage-2 MMU, and the first vCPU yields while unmapping, then KVM will end up with > multiple vCPUs doing the unmap. And irrespective of whether or not the first vCPU > (or vCPUs) yields, they'll all contend for mmu_lock, which is especially problematic > since pending writers will block future readers. > > Would it instead make sense to do something like the below? Checking for the > need to unmap outside of mmu_lock would also more or less eliminate the need for > a request. Maybe? The intent of this change is to restore functional correctness without _too_ much interest in the performance of it. The general aim of the nested work on arm64 is getting something functional upstreamed first, then fixing performance after. You'll notice we have no concept of an rmap yet either... I don't really see anything wrong with your approach, it's just that we aren't necessarily committed to this allocation scheme and we may throw the whole damn thing out in favor of something else in the future. -- Thanks, Oliver