From: Marcelo Tosatti <mtosatti@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Avi Kivity <avi@redhat.com>, kvm@vger.kernel.org
Subject: Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)
Date: Sun, 12 Apr 2009 19:31:58 -0300 [thread overview]
Message-ID: <20090412223158.GA31408@amt.cnet> (raw)
In-Reply-To: <20090411164853.GH1329@random.random>
Hi Andrea,
On Sat, Apr 11, 2009 at 06:48:54PM +0200, Andrea Arcangeli wrote:
> On Sun, Mar 29, 2009 at 01:36:01PM +0300, Avi Kivity wrote:
> > Marcelo, Andrea?
>
> Had to read the code a bit more to understand the reason of the
> unsync_mmu flush in cr3 overwrite.
>
> > Avi Kivity wrote:
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 2a36f7f..f0ea56c 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -1184,8 +1184,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
> >> for_each_sp(pages, sp, parents, i)
> >> protected |= rmap_write_protect(vcpu->kvm, sp->gfn);
> >> - if (protected)
> >> - kvm_flush_remote_tlbs(vcpu->kvm);
> >> + kvm_flush_remote_tlbs_cond(vcpu->kvm, protected);
> >> for_each_sp(pages, sp, parents, i) {
> >> kvm_sync_page(vcpu, sp);
>
> Ok so because we didn't flush the tlb on the other vcpus when invlpg
> run, if cr3 overwrite needs to re-sync sptes wrprotecting them, we've
> to flush the tlb in all vcpus to be sure the possibly writable tlb
> entry reflecting the old writable spte instantiated before invlpg run,
> is removed from the physical cpus. We wouldn't find it in for_each_sp
> because it was rmap_removed, but we'll find something in
> mmu_unsync_walk (right? we definitely have to find something in
> mmu_unsync_walk for this to work, the parent sp have to leave
> child->unsync set even after rmap_remove run in invlpg without
> flushing the other vcpus tlbs).
mmu_sync_children needs to find any _reachable_ sptes that are unsync,
read the guest pagetable, and sync the sptes. Before it can inspect the
guest pagetable, it needs to write protect it, with rmap_write_protect.
In theory it wouldnt be necesarry to call
kvm_flush_remote_tlbs_cond(protected) here, but only
kvm_flush_remote_tlbs(), since the "kvm->remote_tlbs_dirty" information
is not pertinent to mmu_sync_children.
But this is done here to "clear" remote_tlbs_dirty (after a
kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an
optimization.
> >> @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu,
> >> gva_t gva)
> >> rmap_remove(vcpu->kvm, sptep);
> >> if (is_large_pte(*sptep))
> >> --vcpu->kvm->stat.lpages;
> >> - need_flush = 1;
> >> + vcpu->kvm->remote_tlbs_dirty = true;
> >> }
> >> set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
> >> break;
> >> @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t
> >> gva)
> >> break;
> >> }
> >> - if (need_flush)
> >> - kvm_flush_remote_tlbs(vcpu->kvm);
> >> spin_unlock(&vcpu->kvm->mmu_lock);
>
> AFIK to be compliant with lowlevel archs (without ASN it doesn't
> matter I think as vmx always flush on exit), we have to flush the
> local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests). I
> don't see why it's missing. Or am I wrong?
Caller does it:
void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
{
vcpu->arch.mmu.invlpg(vcpu, gva);
kvm_mmu_flush_tlb(vcpu);
++vcpu->stat.invlpg;
}
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 68b217e..12afa50 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -758,10 +758,18 @@ static bool make_all_cpus_request(struct kvm *kvm,
> >> unsigned int req)
> >> void kvm_flush_remote_tlbs(struct kvm *kvm)
> >> {
> >> + kvm->remote_tlbs_dirty = false;
> >> + smp_wmb();
>
> Still no lock prefix to the asm insn and here it runs outside the
> mmu_lock, but ok, I tend to agree smp_wmb should be enough to be sure
> the write is fully finished by the time smb_wmb returns. There's
> another problem though.
>
> CPU0 CPU1
> ----------- -------------
> remote_tlbs_dirty = false
> remote_tlbs_dirty = true
> smp_tlb_flush
> set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
>
>
> The flush for the sptep will be lost.
What about protecting remote_tlbs_dirty with mmu_lock? Only caller of
kvm_flush_remote_tlbs that lacks mmu_notifier is kvm_mmu_zap_all, which
is not performance sensitive anyway.
> >> @@ -907,8 +913,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct
> >> mmu_notifier *mn,
> >> young = kvm_age_hva(kvm, address);
> >> spin_unlock(&kvm->mmu_lock);
> >> - if (young)
> >> - kvm_flush_remote_tlbs(kvm);
> >> + kvm_flush_remote_tlbs_cond(kvm, young);
> >> return young;
> >> }
>
> No need to flush for clear_flush_young method, pages can't be freed
> there.
>
> I mangled over the patch a bit, plus fixed the above smp race, let me
> know what you think.
>
> The the best workload to exercise this is running a VM with lots of
> VCPUs and 8G of ram with a 32bit guest kernel and then just malloc and
> touch a byte for each 4096 page allocated by malloc. That will run a
> flood of invlpg. Then push the system to swap. while :; do cp /dev/hda
> /dev/null; done, also works without O_DIRECT so the host cache make it
> fast at the second run (not so much faster with host swapping though).
>
> I only tested it so far with 12 VM on swap with 64bit kernels with
> heavy I/O so it's not good test as I doubt any invlpg runs, not even
> munmap(addr, 4k) uses invlpg.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index d5bdf3a..900bc31 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1185,7 +1185,11 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
> for_each_sp(pages, sp, parents, i)
> protected |= rmap_write_protect(vcpu->kvm, sp->gfn);
>
> - if (protected)
> + /*
> + * Avoid leaving stale tlb entries in this vcpu representing
> + * sptes rmap_removed by invlpg without vcpu smp tlb flush.
> + */
> + if (protected || vcpu->kvm->remote_tlbs_dirty)
> kvm_flush_remote_tlbs(vcpu->kvm);
+void kvm_flush_remote_tlbs_cond(struct kvm *kvm, bool cond)
+{
+ if (cond || kvm->remote_tlbs_dirty)
+ kvm_flush_remote_tlbs(kvm);
+}
Aren't they the same?
>
> for_each_sp(pages, sp, parents, i) {
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 09782a9..060b4a3 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -483,7 +483,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
> }
>
> if (need_flush)
> - kvm_flush_remote_tlbs(vcpu->kvm);
> + kvm_flush_local_tlb(vcpu);
> spin_unlock(&vcpu->kvm->mmu_lock);
No need, caller does it for us.
> if (pte_gpa == -1)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 095ebb6..731b846 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -125,6 +125,7 @@ struct kvm_kernel_irq_routing_entry {
> struct kvm {
> struct mutex lock; /* protects the vcpus array and APIC accesses */
> spinlock_t mmu_lock;
> + bool remote_tlbs_dirty; /* =true serialized by mmu_lock, =false OOO */
OOO ? Out Of Options?
> +void kvm_flush_local_tlb(struct kvm_vcpu *vcpu)
> +{
> + /*
> + * This will guarantee that our current sptep is flushed from
> + * the tlb too, even if there's a kvm_flush_remote_tlbs
> + * running from under us (so if it's not running under
> + * mmu_lock like in the mmu notifier invocation).
> + */
> + smp_wmb();
> + /*
> + * If the below assignment get lost because of lack of atomic ops
> + * and one or more concurrent writers in kvm_flush_remote_tlbs
> + * we know that any set_shadow_pte preceding kvm_flush_local_tlb
> + * was visible to the other CPU, before KVM_REQ_TLB_FLUSH was set
> + * in each vcpu->requests by kvm_flush_remote_tlbs.
> + */
> + vcpu->kvm->remote_tlbs_dirty = true;
> +
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> + /* get new asid before returning to guest mode */
> + if (!test_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests))
> + set_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests);
> +#else
> + /*
> + * Without CONFIG_MMU_NOTIFIER enabled this isn't enough but
> + * it will reduce the risk window at least.
> + */
> + kvm_flush_remote_tlbs(vcpu->kvm);
> +#endif
> +}
> +
> void kvm_flush_remote_tlbs(struct kvm *kvm)
> {
> + kvm->remote_tlbs_dirty = false;
> + /*
> + * Guarantee that remote_tlbs_dirty is committed to memory
> + * before setting KVM_REQ_TLB_FLUSH.
> + */
> + smp_wmb();
> if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
> ++kvm->stat.remote_tlb_flush;
> }
If remote_tlbs_dirty is protected by mmu_lock, most of this complexity
is unecessary?
> @@ -810,6 +847,23 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> return container_of(mn, struct kvm, mmu_notifier);
> }
>
> +static void kvm_mmu_notifier_tlb_flush(int need_tlb_flush, struct kvm *kvm)
> +{
> + /*
> + * We've to flush the tlb before the pages can be freed.
> + *
> + * The important "remote_tlbs_dirty = true" that we can't miss
> + * always runs under mmu_lock before we taken it in the above
> + * spin_lock. Other than this we've to be sure not to set
> + * remote_tlbs_dirty to false inside kvm_flush_remote_tlbs
> + * unless we're going to flush the guest smp tlb for any
> + * relevant spte that has been invalidated just before setting
> + * "remote_tlbs_dirty = true".
> + */
> + if (need_tlb_flush || kvm->remote_tlbs_dirty)
> + kvm_flush_remote_tlbs(kvm);
> +}
> +
> static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
> struct mm_struct *mm,
> unsigned long address)
> @@ -840,10 +894,7 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
> need_tlb_flush = kvm_unmap_hva(kvm, address);
> spin_unlock(&kvm->mmu_lock);
>
> - /* we've to flush the tlb before the pages can be freed */
> - if (need_tlb_flush)
> - kvm_flush_remote_tlbs(kvm);
> -
> + kvm_mmu_notifier_tlb_flush(need_tlb_flush, kvm);
> }
>
> static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> @@ -865,9 +916,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> need_tlb_flush |= kvm_unmap_hva(kvm, start);
> spin_unlock(&kvm->mmu_lock);
>
> - /* we've to flush the tlb before the pages can be freed */
> - if (need_tlb_flush)
> - kvm_flush_remote_tlbs(kvm);
> + kvm_mmu_notifier_tlb_flush(need_tlb_flush, kvm);
> }
>
> static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> @@ -888,7 +937,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> * The above sequence increase must be visible before the
> * below count decrease but both values are read by the kvm
> * page fault under mmu_lock spinlock so we don't need to add
> - * a smb_wmb() here in between the two.
> + * a smp_wmb() here in between the two.
> */
> kvm->mmu_notifier_count--;
> spin_unlock(&kvm->mmu_lock);
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2009-04-12 22:32 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-03-19 13:17 [PATCH] KVM: Defer remote tlb flushes on invlpg (v3) Avi Kivity
2009-03-19 13:46 ` Andrew Theurer
2009-03-19 14:03 ` Avi Kivity
2009-03-29 10:36 ` Avi Kivity
2009-04-11 16:48 ` [PATCH] KVM: Defer remote tlb flushes on invlpg (v4) Andrea Arcangeli
2009-04-12 22:31 ` Marcelo Tosatti [this message]
2009-04-18 15:34 ` Andrea Arcangeli
2009-04-19 17:54 ` Marcelo Tosatti
2009-04-20 13:01 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090412223158.GA31408@amt.cnet \
--to=mtosatti@redhat.com \
--cc=aarcange@redhat.com \
--cc=avi@redhat.com \
--cc=kvm@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox