From: Sean Christopherson <seanjc@google.com>
To: Mingwei Zhang <mizhang@google.com>
Cc: Jim Mattson <jmattson@google.com>,
Paolo Bonzini <pbonzini@redhat.com>,
"H. Peter Anvin" <hpa@zytor.com>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Ben Gardon <bgardon@google.com>
Subject: Re: [PATCH] KVM: x86/mmu: Remove KVM MMU write lock when accessing indirect_shadow_pages
Date: Mon, 5 Jun 2023 11:25:02 -0700 [thread overview]
Message-ID: <ZH4ofuj0qvKNO9Bz@google.com> (raw)
In-Reply-To: <CAL715WJowYL=W40SWmtPoz1F9WVBFDG7TQwbsV2Bwf9-cS77=Q@mail.gmail.com>
On Mon, Jun 05, 2023, Mingwei Zhang wrote:
> On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote:
> >
> > On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote:
> > >
> > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when
> > > page role is direct because this counter value is used as a coarse-grained
> > > heuristics to check if there is nested guest active. Racing with this
> > > heuristics without mmu lock will be harmless because the corresponding
> > > indirect shadow sptes for the GPA will either be zapped by this thread or
> > > some other thread who has previously zapped all indirect shadow pages and
> > > makes the value to 0.
> > >
> > > Because of that, remove the KVM MMU write lock pair to potentially reduce
> > > the lock contension and improve the performance of nested VM. In addition
> > > opportunistically change the comment of 'direct mmu' to make the
> > > description consistent with other places.
> > >
> > > Reported-by: Jim Mattson <jmattson@google.com>
> > > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > > ---
> > > arch/x86/kvm/x86.c | 10 ++--------
> > > 1 file changed, 2 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 5ad55ef71433..97cfa5a00ff2 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >
> > > kvm_release_pfn_clean(pfn);
> > >
> > > - /* The instructions are well-emulated on direct mmu. */
> > > + /* The instructions are well-emulated on Direct MMUs. */
> > > if (vcpu->arch.mmu->root_role.direct) {
> > > - unsigned int indirect_shadow_pages;
> > > -
> > > - write_lock(&vcpu->kvm->mmu_lock);
> > > - indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> > > - write_unlock(&vcpu->kvm->mmu_lock);
> > > -
> > > - if (indirect_shadow_pages)
> > > + if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
> >
> > I don't understand the need for READ_ONCE() here. That implies that
> > there is something tricky going on, and I don't think that's the case.
>
> READ_ONCE() is just telling the compiler not to remove the read. Since
> this is reading a global variable, the compiler might just read a
> previous copy if the value has already been read into a local
> variable. But that is not the case here...
>
> Note I see there is another READ_ONCE for
> kvm->arch.indirect_shadow_pages, so I am reusing the same thing.
I agree with Jim, using READ_ONCE() doesn't make any sense. I suspect it may have
been a misguided attempt to force the memory read to be as close to the write_lock()
as possible, e.g. to minimize the chance of a false negative.
> I did check the reordering issue but it should be fine because when
> 'we' see indirect_shadow_pages as 0, the shadow pages must have
> already been zapped. Not only because of the locking, but also the
> program order in __kvm_mmu_prepare_zap_page() shows that it will zap
> shadow pages first before updating the stats.
I don't think zapping, i.e. the 1=>0 transition, is a concern. KVM is dropping
the SPTE, so racing with kvm_mmu_pte_write() is a non-issue because the guest
will either see the old value, or will fault after the SPTE is zapped, i.e. KVM
won't run with a stale even if kvm_mmu_pte_write() sees '0' before TLBs are
flushed.
I believe the 0=>1 transition on the other hand doesn't have a *very* theoretical
bug. KVM needs to ensure that either kvm_mmu_pte_write() sees an elevated count,
or that a page fault task sees the updated guest PTE, i.e. the emulated write.
The READ_ONCE() likely serves this purpose in practice, though technically it's
insufficient.
So I think this?
---
arch/x86/kvm/mmu.h | 14 ++++++++++++++
arch/x86/kvm/mmu/mmu.c | 13 ++++++++++++-
arch/x86/kvm/x86.c | 8 +-------
3 files changed, 27 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 92d5a1924fc1..9cd105ccb1d4 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -264,6 +264,20 @@ static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
}
+static inline bool kvm_mmu_has_indirect_shadow_pages(struct kvm *kvm)
+{
+ /*
+ * When emulating guest writes, ensure the written value is visible to
+ * any task that is handling page faults before checking whether or not
+ * KVM is shadowing a guest PTE. This ensures either KVM will create
+ * the correct SPTE in the page fault handler, or this task will see
+ * a non-zero indirect_shadow_pages. Pairs with the smp_mb() in
+ * account_shadowed() and unaccount_shadowed().
+ */
+ smp_mb();
+ return kvm->arch.indirect_shadow_pages;
+}
+
static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
{
/* KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K) must be 0. */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8961f45e3b1..1735bee3f653 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -830,6 +830,17 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
gfn_t gfn;
kvm->arch.indirect_shadow_pages++;
+
+ /*
+ * Ensure indirect_shadow_pages is elevated prior to re-reading guest
+ * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight
+ * emulated writes are visible before re-reading guest PTEs, or that
+ * an emulated write will see the elevated count and acquire mmu_lock
+ * to update SPTEs. Pairs with the smp_mb() in
+ * kvm_mmu_has_indirect_shadow_pages().
+ */
+ smp_mb();
+
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
@@ -5692,7 +5703,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
* If we don't have indirect shadow pages, it means no page is
* write-protected, so we can exit simply.
*/
- if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
+ if (!kvm_mmu_has_indirect_shadow_pages(vcpu->kvm))
return;
pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index abfba3cae0ba..22c226f5f4f8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8588,13 +8588,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
/* The instructions are well-emulated on direct mmu. */
if (vcpu->arch.mmu->root_role.direct) {
- unsigned int indirect_shadow_pages;
-
- write_lock(&vcpu->kvm->mmu_lock);
- indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
- write_unlock(&vcpu->kvm->mmu_lock);
-
- if (indirect_shadow_pages)
+ if (kvm_mmu_has_indirect_shadow_pages(vcpu->kvm))
kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
return true;
base-commit: 69b4e5b82fec7195c79c939ce25789b16a133f3a
--
next prev parent reply other threads:[~2023-06-05 18:25 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-05 0:43 [PATCH] KVM: x86/mmu: Remove KVM MMU write lock when accessing indirect_shadow_pages Mingwei Zhang
2023-06-05 16:55 ` Jim Mattson
2023-06-05 17:17 ` Ben Gardon
2023-06-05 17:53 ` Mingwei Zhang
2023-06-05 18:27 ` Paolo Bonzini
2023-06-05 17:42 ` Mingwei Zhang
2023-06-05 18:11 ` Jim Mattson
2023-06-05 18:23 ` Mingwei Zhang
2023-06-05 18:25 ` Sean Christopherson [this message]
2023-06-06 22:46 ` Mingwei Zhang
2023-06-06 22:48 ` Mingwei Zhang
2023-06-06 23:07 ` Sean Christopherson
2023-06-07 0:23 ` Mingwei Zhang
2023-06-07 0:28 ` Sean Christopherson
2023-06-15 23:57 ` Mingwei Zhang
2023-06-26 17:38 ` Jim Mattson
2023-06-26 20:42 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZH4ofuj0qvKNO9Bz@google.com \
--to=seanjc@google.com \
--cc=bgardon@google.com \
--cc=hpa@zytor.com \
--cc=jmattson@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mizhang@google.com \
--cc=pbonzini@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.