From: Marcelo Tosatti <mtosatti@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Avi Kivity <avi@redhat.com>, kvm@vger.kernel.org
Subject: Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)
Date: Sun, 19 Apr 2009 14:54:28 -0300 [thread overview]
Message-ID: <20090419175428.GB28197@amt.cnet> (raw)
In-Reply-To: <20090418153427.GB15228@random.random>
On Sat, Apr 18, 2009 at 05:34:27PM +0200, Andrea Arcangeli wrote:
> On Sun, Apr 12, 2009 at 07:31:58PM -0300, Marcelo Tosatti wrote:
> > mmu_sync_children needs to find any _reachable_ sptes that are unsync,
> > read the guest pagetable, and sync the sptes. Before it can inspect the
> > guest pagetable, it needs to write protect it, with rmap_write_protect.
>
> So far so good.
>
> > In theory it wouldnt be necesarry to call
> > kvm_flush_remote_tlbs_cond(protected) here, but only
> > kvm_flush_remote_tlbs(), since the "kvm->remote_tlbs_dirty" information
> > is not pertinent to mmu_sync_children.
>
> Hmm I'm not sure I fully understand how it is not pertinent.
>
> When we have a cr3 write in a remote vcpu, mmu_sync_children runs and
> it resyncs all sptes reachaeble for that remote vcpu context. But to
> resync the sptes, it also has to get rid of any old writable tlb entry
> for its remote vcpu where cr3 write is running. Checking only sptes to
> find writable ones, updating the sptes that are mapped by the writable
> sptes, and marking them wrprotected, isn't enough, as old spte
> contents may still be cached in the tlb if remote_tlbs_dirty is true!
>
> Think if the invlpg'd spte that got nuked by rmap_remove in the invlpg
> handler running in the current vcpu has a writable tlb entry active in
> the vcpu that later does cr3 overwrite. mmu_sync_children running in
> the remote vcpu will find no writable spte in the rmap chain
> representing that spte (because that spte that is still cached in the
> remote tlb, has already been zapped by the current vcpu) but it is
> still cached and writable in the remote vcpu TLB cache, when cr3
> overwrite runs.
Right, so you have to cope with the fact that invlpg can skip a TLB
flush. OK.
> > But this is done here to "clear" remote_tlbs_dirty (after a
> > kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an
> > optimization.
>
> The whole point of remote_tlbs_dirty is to defer any smp tlb flush at
> the least possible time, so how can it be an optimization to run a
> kvm_flush_remote_tlbs earlier than necessary? The only way this can be
> an optimization, is to never run kvm_flush_remote_tlbs unless
> absolutely necessary, and to leave remote_tlbs_dirty true instead of
> calling kvm_flush_remote_tlbs. Calling kvm_flush_remote_tlbs_cond
> instead of kvm_flush_remote_tlbs cannot ever be an optimization.
Right.
> > > >> @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu,
> > > >> gva_t gva)
> > > >> rmap_remove(vcpu->kvm, sptep);
> > > >> if (is_large_pte(*sptep))
> > > >> --vcpu->kvm->stat.lpages;
> > > >> - need_flush = 1;
> > > >> + vcpu->kvm->remote_tlbs_dirty = true;
> > > >> }
> > > >> set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
> > > >> break;
> > > >> @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t
> > > >> gva)
> > > >> break;
> > > >> }
> > > >> - if (need_flush)
> > > >> - kvm_flush_remote_tlbs(vcpu->kvm);
> > > >> spin_unlock(&vcpu->kvm->mmu_lock);
> > >
> > > AFIK to be compliant with lowlevel archs (without ASN it doesn't
> > > matter I think as vmx always flush on exit), we have to flush the
> > > local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests). I
> > > don't see why it's missing. Or am I wrong?
> >
> > Caller does it:
> >
> > void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
> > {
> > vcpu->arch.mmu.invlpg(vcpu, gva);
> > kvm_mmu_flush_tlb(vcpu);
> > ++vcpu->stat.invlpg;
> > }
>
> Ah great! Avi also mentioned it I recall but I didn't figure out it
> was after FNAME(invlpg) returns. But isn't always more efficient to
> set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests) instead like I'm doing?
Sure that works too.
> See my version of kvm_flush_local_tlb, that's a bit different from
> kvm_mmu_flush_tlb and I'm made the old no-mmu notifier kernel safe too
> which I think is worth it. If you like to keep my version of
> kvm_flush_local_tlb, then I've simply to remove the kvm_mmu_flush_tlb
> from kvm_mmu_invlpg and move it inside the arch.mmu.invlpg handler so
> each shadow implementation does it its way. Comments?
I'm fine with your kvm_flush_local_tlb. Just one minor nit:
+ /* get new asid before returning to guest mode */
+ if (!test_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests))
+ set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests);
Whats the test_bit for?
> If you disagree, and you want to run kvm_mmu_flush_tlb in
> kvm_mmu_invlpg instead of set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests),
> then I can expand the kvm_flush_local_tlb with the smp_wmb() in the
> FNAME(invlpg) code. Like:
>
> if (need_flush) {
> smp_wmb();
> remote_tlbs_dirty = true;
> }
> spin_unlock(mmu_lock);
>
> Then the local tlb flush will run when we return from FNAME(invlpg)
> and remote_tlbs_dirty is set _after_ set_shadow_pte and _before_
> releasing mmu_lock, making it still safe against
> mmu_notifier_invalidate_page/range.
>
> > What about protecting remote_tlbs_dirty with mmu_lock? Only caller of
> > kvm_flush_remote_tlbs that lacks mmu_notifier is kvm_mmu_zap_all, which
> > is not performance sensitive anyway.
>
> I thought it too I've to say. Tried a bit too, then I figured out why
> Avi wanted to do out of order. The reason is that we want to allow
> kvm_flush_remote_tlbs to run outside the mmu_lock too. So this
> actually results in more self-containment of the remote_tlb_dirty
> logic and simpler code. But I documented at least what the smb_wmb is
> serializing and how it works.
OK.
> > > - if (protected)
> > > + /*
> > > + * Avoid leaving stale tlb entries in this vcpu representing
> > > + * sptes rmap_removed by invlpg without vcpu smp tlb flush.
> > > + */
> > > + if (protected || vcpu->kvm->remote_tlbs_dirty)
> > > kvm_flush_remote_tlbs(vcpu->kvm);
> >
> > +void kvm_flush_remote_tlbs_cond(struct kvm *kvm, bool cond)
> > +{
> > + if (cond || kvm->remote_tlbs_dirty)
> > + kvm_flush_remote_tlbs(kvm);
> > +}
> >
> > Aren't they the same?
>
> I didn't particularly like the kvm_flush_remote_tlbs_cond, and because
> I ended up editing the patch by hand in my code to fix it as I
> understood it, I ended up refactoring some bits like this. But if you
> like I can add it back, I don't really care.
It was nice to hide explicit knowledge about
vcpu->kvm->remote_tlbs_dirty behind the interface instead of exposing
it.
>
> > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > > index 09782a9..060b4a3 100644
> > > --- a/arch/x86/kvm/paging_tmpl.h
> > > +++ b/arch/x86/kvm/paging_tmpl.h
> > > @@ -483,7 +483,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
> > > }
> > >
> > > if (need_flush)
> > > - kvm_flush_remote_tlbs(vcpu->kvm);
> > > + kvm_flush_local_tlb(vcpu);
> > > spin_unlock(&vcpu->kvm->mmu_lock);
> >
> > No need, caller does it for us.
>
> Right but caller does it different, as commented above. Clearly one of the two
> has to go ;).
>
> > OOO ? Out Of Options?
>
> Eheeh, I meant out of order.
>
> > If remote_tlbs_dirty is protected by mmu_lock, most of this complexity
> > is unecessary?
>
> Answered above.
>
> Thanks a lot for review!! My current thought is that we should just
> move the unnecessary ->tlb_flush from the common invlpg handler to the
> other lowlevel .invlpg handler and then I hope this optimization will
> be measurable ;).
Depends on how often the guest uses invlpg right. I believe its a
worthwhile optimization.
TIA
next prev parent reply other threads:[~2009-04-19 17:54 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-03-19 13:17 [PATCH] KVM: Defer remote tlb flushes on invlpg (v3) Avi Kivity
2009-03-19 13:46 ` Andrew Theurer
2009-03-19 14:03 ` Avi Kivity
2009-03-29 10:36 ` Avi Kivity
2009-04-11 16:48 ` [PATCH] KVM: Defer remote tlb flushes on invlpg (v4) Andrea Arcangeli
2009-04-12 22:31 ` Marcelo Tosatti
2009-04-18 15:34 ` Andrea Arcangeli
2009-04-19 17:54 ` Marcelo Tosatti [this message]
2009-04-20 13:01 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090419175428.GB28197@amt.cnet \
--to=mtosatti@redhat.com \
--cc=aarcange@redhat.com \
--cc=avi@redhat.com \
--cc=kvm@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox