Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Maxim Levitsky <mlevitsk@redhat.com>
To: Sean Christopherson <seanjc@google.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>,
	Lai Jiangshan <laijs@linux.alibaba.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Avi Kivity <avi@redhat.com>,
	kvm@vger.kernel.org
Subject: Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
Date: Mon, 13 Sep 2021 14:30:51 +0300	[thread overview]
Message-ID: <0103c8b2cccea601bd3474f47d982b37e9536921.camel@redhat.com> (raw)
In-Reply-To: <YTFhCt87vzo4xDrc@google.com>

On Thu, 2021-09-02 at 23:40 +0000, Sean Christopherson wrote:
> On Tue, Aug 24, 2021, Lai Jiangshan wrote:
> > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > 
> > If gpte is changed from non-present to present, the guest doesn't need
> > to flush tlb per SDM.  So the host must synchronze sp before
> > link it.  Otherwise the guest might use a wrong mapping.
> > 
> > For example: the guest first changes a level-1 pagetable, and then
> > links its parent to a new place where the original gpte is non-present.
> > Finally the guest can access the remapped area without flushing
> > the tlb.  The guest's behavior should be allowed per SDM, but the host
> > kvm mmu makes it wrong.
> 
> Ah, are you saying, given:
> 
> VA_x = PML4_A -> PDP_B -> PD_C -> PT_D
> 
> the guest can modify PT_D, then link it with
> 
> VA_y = PML4_A -> PDP_B -> PD_E -> PT_D
> 
> and access it via VA_y without flushing, and so KVM must sync PT_D.  Is that
> correct?

Looks like this. However 


> 
> > Fixes: 4731d4c7a077 ("KVM: MMU: out of sync shadow core")
> > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > ---
> 
> ...
> 
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 50ade6450ace..48c7fe1b2d50 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -664,7 +664,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
> >   * emulate this operation, return 1 to indicate this case.
> >   */
> >  static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > -			 struct guest_walker *gw)
> > +			 struct guest_walker *gw, unsigned long mmu_seq)
> >  {
> >  	struct kvm_mmu_page *sp = NULL;
> >  	struct kvm_shadow_walk_iterator it;
> > @@ -678,6 +678,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  	top_level = vcpu->arch.mmu->root_level;
> >  	if (top_level == PT32E_ROOT_LEVEL)
> >  		top_level = PT32_ROOT_LEVEL;
> > +
> > +again:
> >  	/*
> >  	 * Verify that the top-level gpte is still there.  Since the page
> >  	 * is a root page, it is either write protected (and cannot be
> > @@ -713,8 +715,28 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
> >  			goto out_gpte_changed;
> >  
> > -		if (sp)
> > +		if (sp) {
> > +			/*
> > +			 * We must synchronize the pagetable before link it
> > +			 * because the guest doens't need to flush tlb when
> > +			 * gpte is changed from non-present to present.
> > +			 * Otherwise, the guest may use the wrong mapping.
> > +			 *
> > +			 * For PG_LEVEL_4K, kvm_mmu_get_page() has already
> > +			 * synchronized it transiently via kvm_sync_page().
> > +			 *
> > +			 * For higher level pagetable, we synchronize it
> > +			 * via slower mmu_sync_children().  If it once
> > +			 * released the mmu_lock, we need to restart from
> > +			 * the root since we don't have reference to @sp.
> > +			 */
> > +			if (sp->unsync_children && !mmu_sync_children(vcpu, sp, false)) {
> 
> I don't like dropping mmu_lock in the page fault path.  I agree that it's not
> all that different than grabbing various things in kvm_mmu_do_page_fault() long
> before acquiring mmu_lock, but I'm not 100% convinced we don't have a latent
> bug hiding somehwere in there :-), and (b) there's a possibility, however small,
> that something in FNAME(fetch) that we're missing.  Case in point, this technically
> needs to do make_mmu_pages_available().
> 
> And I believe kvm_mmu_get_page() already tries to handle this case by requesting
> KVM_REQ_MMU_SYNC if it uses a sp with unsync_children, it just doesn't handle SMP
> interaction, e.g. can link a sp that's immediately available to other vCPUs before
> the sync.
> 
> Rather than force the sync here, what about kicking all vCPUs and retrying the
> page fault?  The only gross part is that kvm_mmu_get_page() can now fail :-(
> 
> ---
>  arch/x86/include/asm/kvm_host.h | 3 ++-
>  arch/x86/kvm/mmu/mmu.c          | 9 +++++++--
>  arch/x86/kvm/mmu/paging_tmpl.h  | 4 ++++
>  3 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 09b256db394a..332b9fb3454c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -57,7 +57,8 @@
>  #define KVM_REQ_MIGRATE_TIMER		KVM_ARCH_REQ(0)
>  #define KVM_REQ_REPORT_TPR_ACCESS	KVM_ARCH_REQ(1)
>  #define KVM_REQ_TRIPLE_FAULT		KVM_ARCH_REQ(2)
> -#define KVM_REQ_MMU_SYNC		KVM_ARCH_REQ(3)
> +#define KVM_REQ_MMU_SYNC \
> +	KVM_ARCH_REQ_FLAGS(3, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_CLOCK_UPDATE		KVM_ARCH_REQ(4)
>  #define KVM_REQ_LOAD_MMU_PGD		KVM_ARCH_REQ(5)
>  #define KVM_REQ_EVENT			KVM_ARCH_REQ(6)
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4853c033e6ce..03293cd3c7ae 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2143,8 +2143,10 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  			kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>  		}
> 
> -		if (sp->unsync_children)
> -			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> +		if (sp->unsync_children) {
> +			kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);

I don't know the KVM mmu well so I miss something here most likely,
but why to switch to kvm_make_all_cpus_request?

MMU is shared by all VCPUs, and the process of its syncing should also do remote TLB flushes
when needed?


Another thing I don't fully understand is why this patch is needed. If we link an SP which has unsync
children, we raise KVM_REQ_MMU_SYNC, which I think means that *this* vCPU will sync the whole MMU on next
guest entry, including these unsync child SPs. Could you explain this?

I am just curious, and I would like to understand the KVM's mmu better.

Best regards,
	Maxim Levitsky


> +			return NULL;
> +		}
> 
>  		__clear_sp_write_flooding_count(sp);
> 
> @@ -2999,6 +3001,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> 
>  		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
>  				      it.level - 1, true, ACC_ALL);
> +		BUG_ON(!sp);
> 
>  		link_shadow_page(vcpu, it.sptep, sp);
>  		if (fault->is_tdp && fault->huge_page_disallowed &&
> @@ -3383,6 +3386,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
>  	struct kvm_mmu_page *sp;
> 
>  	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
> +	BUG_ON(!sp);
> +
>  	++sp->root_count;
> 
>  	return __pa(sp->spt);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 50ade6450ace..f573d45e2c6f 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -704,6 +704,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  			access = gw->pt_access[it.level - 2];
>  			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
>  					      it.level-1, false, access);
> +			if (!sp)
> +				return RET_PF_RETRY;
>  		}
> 
>  		/*
> @@ -742,6 +744,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  		if (!is_shadow_present_pte(*it.sptep)) {
>  			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
>  					      it.level - 1, true, direct_access);
> +			BUG_ON(!sp);
> +
>  			link_shadow_page(vcpu, it.sptep, sp);
>  			if (fault->huge_page_disallowed &&
>  			    fault->req_level >= it.level)
> --
>

next prev parent reply	other threads:[~2021-09-13 11:31 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-24  7:55 [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
2021-08-24  7:55 ` [PATCH 1/7] KVM: X86: Fix missed remote tlb flush in rmap_write_protect() Lai Jiangshan
2021-09-02 21:38   ` Sean Christopherson
2021-09-13  9:57   ` Maxim Levitsky
2021-08-24  7:55 ` [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it Lai Jiangshan
2021-09-02 23:40   ` Sean Christopherson
2021-09-02 23:54     ` Sean Christopherson
2021-09-03  0:44       ` Lai Jiangshan
2021-09-03 16:06         ` Sean Christopherson
2021-09-03 16:25           ` Lai Jiangshan
2021-09-03 16:40             ` Sean Christopherson
2021-09-03 17:00               ` Lai Jiangshan
2021-09-03 16:33           ` Lai Jiangshan
2021-09-03  0:51     ` Lai Jiangshan
2021-09-13 11:30     ` Maxim Levitsky [this message]
2021-09-13 20:49       ` Sean Christopherson
2021-09-13 22:31         ` Maxim Levitsky
2021-08-24  7:55 ` [PATCH 3/7] KVM: X86: Zap the invalid list after remote tlb flushing Lai Jiangshan
2021-09-02 21:54   ` Sean Christopherson
2021-08-24  7:55 ` [PATCH 4/7] KVM: X86: Remove FNAME(update_pte) Lai Jiangshan
2021-09-13  9:49   ` Maxim Levitsky
2021-08-24  7:55 ` [PATCH 5/7] KVM: X86: Don't unsync pagetables when speculative Lai Jiangshan
2021-09-13 11:02   ` Maxim Levitsky
2021-09-18  3:06     ` Lai Jiangshan
2021-08-24  7:55 ` [PATCH 6/7] KVM: X86: Don't check unsync if the original spte is writible Lai Jiangshan
2021-08-24  7:55 ` [PATCH 7/7] KVM: X86: Also prefetch the last range in __direct_pte_prefetch() Lai Jiangshan
2021-08-25 15:18   ` Sean Christopherson
2021-08-25 22:58     ` Lai Jiangshan
2021-08-31 18:02 ` [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
2021-08-31 21:57   ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0103c8b2cccea601bd3474f47d982b37e9536921.camel@redhat.com \
    --to=mlevitsk@redhat.com \
    --cc=avi@redhat.com \
    --cc=bp@alien8.de \
    --cc=hpa@zytor.com \
    --cc=jiangshanlai@gmail.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=laijs@linux.alibaba.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).