public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	vkuznets@redhat.com, mlevitsk@redhat.com, dmatlack@google.com
Subject: Re: [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes
Date: Fri, 11 Feb 2022 18:48:16 +0000	[thread overview]
Message-ID: <YgavcP/jb5njjKKn@google.com> (raw)
In-Reply-To: <20220209170020.1775368-13-pbonzini@redhat.com>

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> kvm_mmu_reset_context is called on all role changes and right now it
> calls kvm_mmu_unload.  With the legacy MMU this is a relatively cheap
> operation; the previous PGDs remains in the hash table and is picked
> up immediately on the next page fault.  With the TDP MMU, however, the
> roots are thrown away for good and a full rebuild of the page tables is
> necessary, which is many times more expensive.
> 
> Fortunately, throwing away the roots is not necessary except when
> the manual says a TLB flush is required:
> 
> - changing CR0.PG from 1 to 0 (because it flushes the TLB according to
>   the x86 architecture specification)
> 
> - changing CPUID (which changes the interpretation of page tables in
>   ways not reflected by the role).
> 
> - changing CR4.SMEP from 0 to 1 (not doing so actually breaks access.c!)
> 
> Except for these cases, once the MMU has updated the CPU/MMU roles
> and metadata it is enough to force-reload the current value of CR3.
> KVM will look up the cached roots for an entry with the right role and
> PGD, and only if the cache misses a new root will be created.
> 
> Measuring with vmexit.flat from kvm-unit-tests shows the following
> improvement:
> 
>              TDP         legacy       shadow
>    before    46754       5096         5150
>    after     4879        4875         5006
> 
> which is for very small page tables.  The impact is however much larger
> when running as an L1 hypervisor, because the new page tables cause
> extra work for L0 to shadow them.
> 
> Reported-by: Brad Spengler <spender@grsecurity.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c |  7 ++++---
>  arch/x86/kvm/x86.c     | 27 ++++++++++++++++++---------
>  2 files changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 38b40ddcaad7..dbd4e98ba426 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5020,8 +5020,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
>  void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  {
>  	/*
> -	 * Invalidate all MMU roles to force them to reinitialize as CPUID
> -	 * information is factored into reserved bit calculations.
> +	 * Invalidate all MMU roles and roots to force them to reinitialize,
> +	 * as CPUID information is factored into reserved bit calculations.
>  	 *
>  	 * Correctly handling multiple vCPU models with respect to paging and
>  	 * physical address properties) in a single VM would require tracking
> @@ -5034,6 +5034,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  	vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
>  	vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
>  	vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
> +	kvm_mmu_unload(vcpu);
>  	kvm_mmu_reset_context(vcpu);
>  
>  	/*
> @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  
>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
>  {
> -	kvm_mmu_unload(vcpu);
>  	kvm_init_mmu(vcpu);
> +	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);

This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
affected, e.g. SMM transitions, KVM_SET_SREG, etc...

Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
The call to kvm_mmu_new_pgd() is also 

To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}().  In
the future we can/should work on avoiding unload in all paths, but again, future
problem.

>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0d3646535cc5..97c4f5fc291f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -873,8 +873,12 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  		kvm_async_pf_hash_reset(vcpu);
>  	}
>  
> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> +	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
> +		/* Flush the TLB if CR0 is changed 1 -> 0.  */
> +		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
> +			kvm_mmu_unload(vcpu);

Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
to the comment, or with SMEP handling.  And the SMEP handling isn't coherent with
respect to the changelog.  Please elaborate :-)

>  		kvm_mmu_reset_context(vcpu);
> +	}
>  
>  	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
>  	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> @@ -1067,15 +1071,18 @@ void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon
>  	 * free them all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
>  	 * is slow, but changing CR4.PCIDE is a rare case.
>  	 *
> -	 * If CR4.PGE is changed, the guest TLB must be flushed.
> +	 * Setting SMEP also needs to flush the TLB, in addition to resetting
> +	 * the MMU.
>  	 *
> -	 * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
> -	 * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
> -	 * the usage of "else if".
> +	 * If CR4.PGE is changed, the guest TLB must be flushed.  Because
> +	 * the shadow MMU ignores global pages, this bit is not part of
> +	 * KVM_MMU_CR4_ROLE_BITS.
>  	 */
> -	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> +	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS) {
>  		kvm_mmu_reset_context(vcpu);
> -	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
> +		if ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP))
> +			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);

This mishandles CR4.PGE.  Per the comment above, the if-elif-elif sequence relies
on kvm_mmu_reset_context being a superset of KVM_REQ_TLB_FLUSH_GUEST.

For both CR0 and CR4, I think we should disassociate the TLB flush logic from the
MMU role logic, e.g. CR4.PGE _could_ be part of the role, but it's not because KVM
doesn't emulate global pages.

This is what I'm thinking, assuming CR0.PG 1=>0 really only needs a flush.


---
 arch/x86/kvm/mmu/mmu.c |  4 ++--
 arch/x86/kvm/x86.c     | 42 +++++++++++++++++++++++++++++-------------
 2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e41834748d52..c477c519c784 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5041,8 +5041,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	/*
-	 * Invalidate all MMU roles to force them to reinitialize as CPUID
-	 * information is factored into reserved bit calculations.
+	 * Invalidate all MMU roles and roots to force them to reinitialize,
+	 * as CPUID information is factored into reserved bit calculations.
 	 *
 	 * Correctly handling multiple vCPU models with respect to paging and
 	 * physical address properties) in a single VM would require tracking
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 782dc9cd31d8..b8dad04301ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -863,15 +863,28 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
 }
 EXPORT_SYMBOL_GPL(load_pdptrs);

+static void kvm_post_set_cr_reinit_mmu(struct kvm_vcpu *vcpu)
+{
+	kvm_mmu_init(vcpu);
+	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
+}
+
 void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
 {
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
 		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_async_pf_hash_reset(vcpu);
+
+		/*
+		 * Clearing CR0.PG is architecturally defined as flushing the
+		 * TLB from the guest's perspective.
+		 */
+		if (!(cr0 & X86_CR0_PG))
+			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 	}

 	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
+		kvm_post_set_cr_reinit_mmu(vcpu);

 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
 	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
@@ -1055,26 +1068,29 @@ EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
 void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
 {
 	/*
-	 * If any role bit is changed, the MMU needs to be reset.
-	 *
 	 * If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
 	 * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
 	 * according to the SDM; however, stale prev_roots could be reused
 	 * incorrectly in the future after a MOV to CR3 with NOFLUSH=1, so we
 	 * free them all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
 	 * is slow, but changing CR4.PCIDE is a rare case.
-	 *
-	 * If CR4.PGE is changed, the guest TLB must be flushed.
-	 *
-	 * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
-	 * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
-	 * the usage of "else if".
 	 */
-	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
-	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
+	if ((cr4 ^ old_cr4) & X86_CR4_PCIDE) {
 		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
-	else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
+		return;
+	}
+
+	/* If any role bit is changed, the MMU needs to be reinitialized. */
+	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
+		kvm_post_set_cr_reinit_mmu(vcpu);
+
+	/*
+	 * Setting SMEP or toggling PGE is architecturally defined as flushing
+	 * the TLB from the guest's perspective.  Note, because the shadow MMU
+	 * ignores global pages, CR4.PGE is not part of KVM_MMU_CR4_ROLE_BITS.
+	 */
+	if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
+	    ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP)))
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr4);

base-commit: a8c36d04d70d0b15e696561e1a2134fcbdd3a3bd
--


  parent reply	other threads:[~2022-02-11 18:48 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
2022-02-09 17:00 ` [PATCH 01/12] KVM: x86: host-initiated EFER.LME write affects the MMU Paolo Bonzini
2022-02-10 22:49   ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 02/12] KVM: MMU: move MMU role accessors to header Paolo Bonzini
2022-02-10 23:00   ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 03/12] KVM: x86: do not deliver asynchronous page faults if CR0.PG=0 Paolo Bonzini
2022-02-10 23:10   ` Sean Christopherson
2022-02-10 23:14     ` Sean Christopherson
2022-02-10 23:16       ` Sean Christopherson
2022-02-11 11:16         ` Paolo Bonzini
2022-02-09 17:00 ` [PATCH 04/12] KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload Paolo Bonzini
2022-02-10 23:20   ` Sean Christopherson
2022-02-11 11:18     ` Paolo Bonzini
2022-02-09 17:00 ` [PATCH 05/12] KVM: MMU: avoid NULL-pointer dereference on page freeing bugs Paolo Bonzini
2022-02-11  0:24   ` Sean Christopherson
2022-02-11 11:21     ` Paolo Bonzini
2022-02-09 17:00 ` [PATCH 06/12] KVM: MMU: rename kvm_mmu_reload Paolo Bonzini
2022-02-11  0:27   ` Sean Christopherson
2022-02-11 10:07     ` Paolo Bonzini
2022-02-11 16:16       ` Sean Christopherson
2022-02-11 16:52         ` Paolo Bonzini
2022-02-09 17:00 ` [PATCH 07/12] KVM: x86: use struct kvm_mmu_root_info for mmu->root Paolo Bonzini
2022-02-11 17:39   ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots Paolo Bonzini
2022-02-11  0:41   ` Sean Christopherson
2022-02-11  0:54     ` Sean Christopherson
2022-02-11  1:07       ` Paolo Bonzini
2022-02-11  1:35         ` Sean Christopherson
2022-02-11  1:44           ` Sean Christopherson
2022-02-11  2:20             ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 09/12] KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit Paolo Bonzini
2022-02-11  1:32   ` Sean Christopherson
2022-02-11  1:37     ` Sean Christopherson
2022-02-11 10:09       ` Paolo Bonzini
2022-02-11 11:45     ` Paolo Bonzini
2022-02-11 17:38       ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 10/12] KVM: MMU: load new PGD after the shadow MMU is initialized Paolo Bonzini
2022-02-11 17:45   ` Sean Christopherson
2022-02-11 17:47     ` Paolo Bonzini
2022-02-09 17:00 ` [PATCH 11/12] KVM: MMU: remove kvm_mmu_calc_root_page_role Paolo Bonzini
2022-02-11 17:53   ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes Paolo Bonzini
2022-02-11  9:08   ` Nikunj A. Dadhania
2022-02-11 18:48   ` Sean Christopherson [this message]
2022-02-14 16:34     ` Paolo Bonzini
2022-02-14 19:24       ` Sean Christopherson
2022-02-15  8:17         ` Paolo Bonzini
2022-02-09 17:07 ` [PATCH 00/12] KVM: MMU: " Sean Christopherson
2022-02-09 17:11   ` Paolo Bonzini
2022-02-09 17:16     ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YgavcP/jb5njjKKn@google.com \
    --to=seanjc@google.com \
    --cc=dmatlack@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mlevitsk@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=vkuznets@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox