Re: [PATCH v1 2/3] KVM: x86: Handle the case of 5-level shadow page table

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Wei Huang <wei.huang2@amd.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, vkuznets@redhat.com, wanpengli@tencent.com,
	jmattson@google.com, joro@8bytes.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com
Subject: Re: [PATCH v1 2/3] KVM: x86: Handle the case of 5-level shadow page table
Date: Fri, 6 Aug 2021 17:58:30 +0000	[thread overview]
Message-ID: <YQ14RmuYxlAydmOu@google.com> (raw)
In-Reply-To: <20210805205504.2647362-3-wei.huang2@amd.com>

On Thu, Aug 05, 2021, Wei Huang wrote:
> When the 5-level page table CPU flag is exposed, KVM code needs to handle
> this case by pointing mmu->root_hpa to a properly-constructed 5-level page
> table.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Wei Huang <wei.huang2@amd.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu/mmu.c          | 46 +++++++++++++++++++++++----------
>  2 files changed, 34 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 20ddfbac966e..8586ffdf4de8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -447,6 +447,7 @@ struct kvm_mmu {
>  
>  	u64 *pae_root;
>  	u64 *pml4_root;
> +	u64 *pml5_root;
>  
>  	/*
>  	 * check zero bits on shadow page table entries, these
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 44e4561e41f5..b162c3e530aa 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3428,7 +3428,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	 * the shadow page table may be a PAE or a long mode page table.
>  	 */
>  	pm_mask = PT_PRESENT_MASK | shadow_me_mask;
> -	if (mmu->shadow_root_level == PT64_ROOT_4LEVEL) {
> +	if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
>  		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
>  
>  		if (WARN_ON_ONCE(!mmu->pml4_root)) {
> @@ -3454,11 +3454,17 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  				      PT32_ROOT_LEVEL, false);
>  		mmu->pae_root[i] = root | pm_mask;
>  	}
> +	mmu->root_hpa = __pa(mmu->pae_root);
>  
> -	if (mmu->shadow_root_level == PT64_ROOT_4LEVEL)
> +	if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
> +		mmu->pml4_root[0] = mmu->root_hpa | pm_mask;
>  		mmu->root_hpa = __pa(mmu->pml4_root);
> -	else
> -		mmu->root_hpa = __pa(mmu->pae_root);
> +	}
> +
> +	if (mmu->shadow_root_level == PT64_ROOT_5LEVEL) {
> +		mmu->pml5_root[0] = mmu->root_hpa | pm_mask;
> +		mmu->root_hpa = __pa(mmu->pml5_root);
> +	}

Ouch, the root_hpa chaining is subtle.  That's my fault :-)  I think it would be
better to explicitly chain pae->pml4->pml5?  E.g.

	if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
		mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask;

		if (mmu->shadow_root_level == PT64_ROOT_5LEVEL) {
			mmu->pml5_root[0] = __pa(mmu->pml4_root) | pm_mask;
			mmu->root_hpa = __pa(mmu->pml5_root);
		} else {
			mmu->root_hpa = __pa(mmu->pml4_root);
		}
	} else {
		mmu->root_hpa = __pa(mmu->pae_root);
	}

It'd require more churn if we get to 6-level paging, but that's a risk I'm willing
to take ;-)

>  
>  set_root_pgd:
>  	mmu->root_pgd = root_pgd;
> @@ -3471,7 +3477,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
> -	u64 *pml4_root, *pae_root;
> +	u64 *pml5_root, *pml4_root, *pae_root;
>  
>  	/*
>  	 * When shadowing 32-bit or PAE NPT with 64-bit NPT, the PML4 and PDP
> @@ -3487,17 +3493,18 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
>  	 * This mess only works with 4-level paging and needs to be updated to
>  	 * work with 5-level paging.
>  	 */
> -	if (WARN_ON_ONCE(mmu->shadow_root_level != PT64_ROOT_4LEVEL))
> +	if (WARN_ON_ONCE(mmu->shadow_root_level < PT64_ROOT_4LEVEL)) {

This is amusingly wrong.  The check above this is:

	if (mmu->direct_map || mmu->root_level >= PT64_ROOT_4LEVEL ||
	    mmu->shadow_root_level < PT64_ROOT_4LEVEL)  <--------
		return 0;

meaning this is dead code.  It should simply deleted.  If we reaaaaaly wanted to
future proof the code, we could do:

	if (WARN_ON_ONCE(mmu->shadow_root_level > PT64_ROOT_5LEVEL)
		return -EIO;

but at that point we're looking at a completely different architecture, so I don't
think we need to be that paranoid :-)

>  		return -EIO;
> +	}
>  
> -	if (mmu->pae_root && mmu->pml4_root)
> +	if (mmu->pae_root && mmu->pml4_root && mmu->pml5_root)
>  		return 0;
>  
>  	/*
>  	 * The special roots should always be allocated in concert.  Yell and
>  	 * bail if KVM ends up in a state where only one of the roots is valid.
>  	 */
> -	if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root))
> +	if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root || mmu->pml5_root))
>  		return -EIO;
>  
>  	/*
> @@ -3506,18 +3513,30 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
>  	 */
>  	pae_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
>  	if (!pae_root)
> -		return -ENOMEM;
> +		goto err_out;

Branching to the error handling here is silly, it's the first allocation.

>  
>  	pml4_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> -	if (!pml4_root) {
> -		free_page((unsigned long)pae_root);
> -		return -ENOMEM;
> -	}
> +	if (!pml4_root)
> +		goto err_out;
> +
> +	pml5_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);

This should be guarded by "mmu->shadow_root_level > PT64_ROOT_4LEVEL", there's no
need to waste a page on PML5 if it can't exist.

> +	if (!pml5_root)
> +		goto err_out;
>  
>  	mmu->pae_root = pae_root;
>  	mmu->pml4_root = pml4_root;
> +	mmu->pml5_root = pml5_root;
>  
>  	return 0;
> +err_out:
> +	if (pae_root)
> +		free_page((unsigned long)pae_root);
> +	if (pml4_root)
> +		free_page((unsigned long)pml4_root);
> +	if (pml5_root)
> +		free_page((unsigned long)pml5_root);

This is flawed as failure to allocate pml4_root will consume an uninitialized
pml5_root.  There's also no need to check for non-NULL values as free_page plays
nice with NULL pointers.

If you drop the unnecessary goto for pae_root allocation failure, than this can
become:

err_out:
	free_page((unsigned long)pml4_root);
	free_page((unsigned long)pae_root);

since pml4_root will be NULL if pml4_root allocation failures.  IMO that's
unnecessarily clever though, and a more standard:

err_pml5:
	free_page((unsigned long)pml4_root);
err_pml4:
	free_page((unsigned long)pae_root);
	return -ENOMEM;

would be far easier to read/maintain.

> +
> +	return -ENOMEM;
>  }
>  
>  void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
> @@ -5320,6 +5339,7 @@ static void free_mmu_pages(struct kvm_mmu *mmu)
>  		set_memory_encrypted((unsigned long)mmu->pae_root, 1);
>  	free_page((unsigned long)mmu->pae_root);
>  	free_page((unsigned long)mmu->pml4_root);
> +	free_page((unsigned long)mmu->pml5_root);
>  }
>  
>  static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> -- 
> 2.31.1
>

next prev parent reply	other threads:[~2021-08-06 17:58 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-05 20:55 [PATCH v1 0/3] SVM 5-level page table support Wei Huang
2021-08-05 20:55 ` [PATCH v1 1/3] KVM: x86: Convert TDP level calculation to vendor's specific code Wei Huang
2021-08-05 21:51   ` Sean Christopherson
2021-08-05 22:26     ` Wei Huang
2021-08-08 19:30     ` Wei Huang
2021-08-05 22:35   ` Jim Mattson
2021-08-05 22:44     ` Sean Christopherson
2021-08-05 20:55 ` [PATCH v1 2/3] KVM: x86: Handle the case of 5-level shadow page table Wei Huang
2021-08-06 17:58   ` Sean Christopherson [this message]
2021-08-08 17:49     ` Wei Huang
2021-08-05 20:55 ` [PATCH v1 3/3] KVM: SVM: Add 5-level page table support for SVM Wei Huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YQ14RmuYxlAydmOu@google.com \
    --to=seanjc@google.com \
    --cc=bp@alien8.de \
    --cc=hpa@zytor.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=wei.huang2@amd.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.