From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 70B605D471
	for <kvmarm@lists.linux.dev>; Mon, 15 Apr 2024 11:36:49 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1713181009; cv=none; b=KGu3QcuDLTd37fnvg7Twc6+Lc6uA1XZoJbz6cPmPF6cD9vLkgiUvk8p1iDg5N28FZ64MUMuYZv3P+dK47/V4xyu1c7OuG8Q8q5ilrTGF9rJNfp70pqId3HfC1KZa/7j8mHBTh0qpaqP37ruyxsepAXTYMA92GTdpQ9roGCGadXg=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1713181009; c=relaxed/simple;
	bh=0cdbyCSwrsqtUoF+KktwxZCPvkEAKMuSFqB2mfADYl0=;
	h=Date:Message-ID:From:To:Cc:Subject:In-Reply-To:References:
	 MIME-Version:Content-Type; b=njC79xo9vVqX7HhnlMX4S2CM3YLhhRzy3u/2efS4dGCqtGV+RmIzq1S1GYkV7MefK1B2cDslsNUsNPAe8ryIRpZRBpljo2HRuavUWEa/e1uyw+y033I4w6jVwCciz4/Q1aI4aQAIAQ8hK6JqMTvEWoS31pAr66FqRPdRSR6985M=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=D8LApa04; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="D8LApa04"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id EB392C113CC;
	Mon, 15 Apr 2024 11:36:48 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1713181009;
	bh=0cdbyCSwrsqtUoF+KktwxZCPvkEAKMuSFqB2mfADYl0=;
	h=Date:From:To:Cc:Subject:In-Reply-To:References:From;
	b=D8LApa04tVpwXzAUwH6hkwIYLeq9CZLsvzKEVDgiTi8vfw1PunvWxjqqCG9VtA+Nc
	 eOUXDFsgF3feFz/0VHOHoft7Q7coKINCfNgqvLlcNRmw7pIEIYrprhGZ0Sc55vpDx+
	 k0rldq0A6O4YC2A0D2cuZbSibfANhD7KXEwYYNK4qoGUqkAXaao4okwUOclyqY1kYr
	 hANsCHj0H8ljSNrwnxWEpld1YKUTSzruSvB+SUaHPIU13Hj59E2r/nj1MOQCOYNGQA
	 8WNYQTiaG4JiRUWg01Hc6MBcf1TpYRtcYFgTBTCK3I68PfvvyFxDT7GbJXITnBLymD
	 lj5ZSNfWTbUMQ==
Received: from sofa.misterjones.org ([185.219.108.64] helo=goblin-girl.misterjones.org)
	by disco-boy.misterjones.org with esmtpsa  (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.95)
	(envelope-from <maz@kernel.org>)
	id 1rwKdj-004bdB-0m;
	Mon, 15 Apr 2024 12:36:47 +0100
Date: Mon, 15 Apr 2024 12:36:45 +0100
Message-ID: <867cgysvqq.wl-maz@kernel.org>
From: Marc Zyngier <maz@kernel.org>
To: Fuad Tabba <tabba@google.com>
Cc: kvmarm@lists.linux.dev,
	will@kernel.org,
	qperret@google.com,
	seanjc@google.com,
	alexandru.elisei@arm.com,
	catalin.marinas@arm.com,
	philmd@linaro.org,
	james.morse@arm.com,
	suzuki.poulose@arm.com,
	oliver.upton@linux.dev,
	mark.rutland@arm.com,
	broonie@kernel.org,
	joey.gouly@arm.com,
	rananta@google.com
Subject: Re: [PATCH v1 07/44] KVM: arm64: Support TLB invalidation in guest context
In-Reply-To: <20240327173531.1379685-8-tabba@google.com>
References: <20240327173531.1379685-1-tabba@google.com>
	<20240327173531.1379685-8-tabba@google.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue)
 FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/29.2
 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO)
Precedence: bulk
X-Mailing-List: kvmarm@lists.linux.dev
List-Id: <kvmarm.lists.linux.dev>
List-Subscribe: <mailto:kvmarm+subscribe@lists.linux.dev>
List-Unsubscribe: <mailto:kvmarm+unsubscribe@lists.linux.dev>
MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
Content-Type: text/plain; charset=US-ASCII
X-SA-Exim-Connect-IP: 185.219.108.64
X-SA-Exim-Rcpt-To: tabba@google.com, kvmarm@lists.linux.dev, will@kernel.org, qperret@google.com, seanjc@google.com, alexandru.elisei@arm.com, catalin.marinas@arm.com, philmd@linaro.org, james.morse@arm.com, suzuki.poulose@arm.com, oliver.upton@linux.dev, mark.rutland@arm.com, broonie@kernel.org, joey.gouly@arm.com, rananta@google.com
X-SA-Exim-Mail-From: maz@kernel.org
X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false

On Wed, 27 Mar 2024 17:34:54 +0000,
Fuad Tabba <tabba@google.com> wrote:
> 
> From: Will Deacon <will@kernel.org>
> 
> Typically, TLB invalidation of guest stage-2 mappings using nVHE is
> performed by a hypercall originating from the host. For the invalidation
> instruction to be effective, therefore, __tlb_switch_to_{guest,host}()
> swizzle the active stage-2 context around the TLBI instruction.
> 
> With guest-to-host memory sharing and unsharing hypercalls
> originating from the guest under pKVM, there is need to support
> both guest and host VMID invalidations issued from guest context.
> 
> Replace the __tlb_switch_to_{guest,host}() functions with a more general
> {enter,exit}_vmid_context() implementation which supports being invoked
> from guest context and acts as a no-op if the target context matches the
> running context.
> 
> Signed-off-by: Will Deacon <will@kernel.org>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  arch/arm64/kvm/hyp/nvhe/tlb.c | 114 +++++++++++++++++++++++++++-------
>  1 file changed, 90 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
> index a60fb13e2192..05a66b2ed76d 100644
> --- a/arch/arm64/kvm/hyp/nvhe/tlb.c
> +++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
> @@ -11,13 +11,23 @@
>  #include <nvhe/mem_protect.h>
>  
>  struct tlb_inv_context {
> -	u64		tcr;
> +	struct kvm_s2_mmu	*mmu;
> +	u64			tcr;
> +	u64			sctlr;
>  };
>  
> -static void __tlb_switch_to_guest(struct kvm_s2_mmu *mmu,
> -				  struct tlb_inv_context *cxt,
> -				  bool nsh)
> +static void enter_vmid_context(struct kvm_s2_mmu *mmu,
> +			       struct tlb_inv_context *cxt,
> +			       bool nsh)
>  {
> +	struct kvm_s2_mmu *host_s2_mmu = &host_mmu.arch.mmu;
> +	struct kvm_cpu_context *host_ctxt;
> +	struct kvm_vcpu *vcpu;
> +
> +	host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt;
> +	vcpu = host_ctxt->__hyp_running_vcpu;
> +	cxt->mmu = NULL;
> +
>  	/*
>  	 * We have two requirements:
>  	 *
> @@ -40,20 +50,52 @@ static void __tlb_switch_to_guest(struct kvm_s2_mmu *mmu,
>  	else
>  		dsb(ish);
>  
> +	/*
> +	 * If we're already in the desired context, then there's nothing
> +	 * to do.
> +	 */
> +	if (vcpu) {
> +		/* We're in guest context */
> +		if (mmu == vcpu->arch.hw_mmu || WARN_ON(mmu != host_s2_mmu))
> +			return;

I'm a bit concerned about this one, not so much for what it does, but
because it outlines an inconsistency we have.

Under memory pressure, we can end-up unmapping a page via the MMU
notifiers, and will provide a s2_mmu context for the TLBI. This can
happen while *another* context is loaded (a vcpu from a different VM)
and that vcpu faults.

You'd end up with a scenario very similar to the one I debugged here:

https://lore.kernel.org/kvmarm/86y1gfn67v.wl-maz@kernel.org

Now, this doesn't break here because __hyp_running_vcpu is set to NULL
on each exit from the HYP code. But that only happens on nVHE, and not
on VHE, which bizarrely only sets this on entry and leaves a dangling
pointer...

I think we need to clarify how and when this pointer is considered
valid.

> +
> +		cxt->mmu = vcpu->arch.hw_mmu;
> +	} else {
> +		/* We're in host context */
> +		if (mmu == host_s2_mmu)
> +			return;
> +
> +		cxt->mmu = host_s2_mmu;
> +	}
> +
>  	if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
>  		u64 val;
>  
>  		/*
>  		 * For CPUs that are affected by ARM 1319367, we need to
> -		 * avoid a host Stage-1 walk while we have the guest's
> -		 * VMID set in the VTTBR in order to invalidate TLBs.
> -		 * We're guaranteed that the S1 MMU is enabled, so we can
> -		 * simply set the EPD bits to avoid any further TLB fill.
> +		 * avoid a Stage-1 walk with the old VMID while we have
> +		 * the new VMID set in the VTTBR in order to invalidate TLBs.
> +		 * We're guaranteed that the host S1 MMU is enabled, so
> +		 * we can simply set the EPD bits to avoid any further
> +		 * TLB fill. For guests, we ensure that the S1 MMU is
> +		 * temporarily enabled in the next context.
>  		 */
>  		val = cxt->tcr = read_sysreg_el1(SYS_TCR);
>  		val |= TCR_EPD1_MASK | TCR_EPD0_MASK;
>  		write_sysreg_el1(val, SYS_TCR);
>  		isb();
> +
> +		if (vcpu) {
> +			val = cxt->sctlr = read_sysreg_el1(SYS_SCTLR);
> +			if (!(val & SCTLR_ELx_M)) {
> +				val |= SCTLR_ELx_M;
> +				write_sysreg_el1(val, SYS_SCTLR);
> +				isb();
> +			}
> +		} else {
> +			/* The host S1 MMU is always enabled. */
> +			cxt->sctlr = SCTLR_ELx_M;
> +		}
>  	}
>  
>  	/*
> @@ -62,20 +104,44 @@ static void __tlb_switch_to_guest(struct kvm_s2_mmu *mmu,
>  	 * ensuring that we always have an ISB, but not two ISBs back
>  	 * to back.
>  	 */
> -	__load_stage2(mmu, kern_hyp_va(mmu->arch));
> +	if (vcpu)
> +		__load_host_stage2();
> +	else
> +		__load_stage2(mmu, kern_hyp_va(mmu->arch));
> +
>  	asm(ALTERNATIVE("isb", "nop", ARM64_WORKAROUND_SPECULATIVE_AT));
>  }
>  
> -static void __tlb_switch_to_host(struct tlb_inv_context *cxt)
> +static void exit_vmid_context(struct tlb_inv_context *cxt)
>  {
> -	__load_host_stage2();
> +	struct kvm_s2_mmu *mmu = cxt->mmu;
> +	struct kvm_cpu_context *host_ctxt;
> +	struct kvm_vcpu *vcpu;
> +
> +	host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt;
> +	vcpu = host_ctxt->__hyp_running_vcpu;
> +
> +	if (!mmu)
> +		return;
> +
> +	if (vcpu)
> +		__load_stage2(mmu, kern_hyp_va(mmu->arch));
> +	else
> +		__load_host_stage2();
>  
>  	if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
> -		/* Ensure write of the host VMID */
> +		/* Ensure write of the old VMID */
>  		isb();
> -		/* Restore the host's TCR_EL1 */
> +
> +		if (!(cxt->sctlr & SCTLR_ELx_M)) {
> +			write_sysreg_el1(cxt->sctlr, SYS_SCTLR);
> +			isb();
> +		}
> +
>  		write_sysreg_el1(cxt->tcr, SYS_TCR);
>  	}
> +
> +	cxt->mmu = NULL;

nit: do we actually need this last line?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.