From: David Matlack <dmatlack@google.com>
To: Junaid Shahid <junaids@google.com>
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
Subject: Re: [PATCH] kvm: x86: mmu: Always flush TLBs when enabling dirty logging
Date: Mon, 25 Jul 2022 10:27:35 -0700 [thread overview]
Message-ID: <Yt7Sh/aN1dlXN21N@google.com> (raw)
In-Reply-To: <20220723024116.2724796-1-junaids@google.com>
On Fri, Jul 22, 2022 at 07:41:16PM -0700, Junaid Shahid wrote:
> When A/D bits are not available, KVM uses a software access tracking
> mechanism, which involves making the SPTEs inaccessible. However,
> the clear_young() MMU notifier does not flush TLBs. So it is possible
> that there may still be stale, potentially writable, TLB entries.
> This is usually fine, but can be problematic when enabling dirty
> logging, because it currently only does a TLB flush if any SPTEs were
> modified. But if all SPTEs are in access-tracked state, then there
> won't be a TLB flush, which means that the guest could still possibly
> write to memory and not have it reflected in the dirty bitmap.
>
> So just unconditionally flush the TLBs when enabling dirty logging.
> We could also do something more sophisticated if needed, but given
> that a flush almost always happens anyway, so just making it
> unconditional doesn't seem too bad.
>
> Signed-off-by: Junaid Shahid <junaids@google.com>
Wow, nice catch. I have some suggestions to word-smith the comments, but
otherwise:
Reviewed-by: David Matlack <dmatlack@google.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 28 ++++++++++------------------
> arch/x86/kvm/mmu/spte.h | 9 +++++++--
> arch/x86/kvm/x86.c | 7 +++++++
> 3 files changed, 24 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 52664c3caaab..f0d7193db455 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6058,27 +6058,23 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> const struct kvm_memory_slot *memslot,
> int start_level)
> {
> - bool flush = false;
> -
> if (kvm_memslots_have_rmaps(kvm)) {
> write_lock(&kvm->mmu_lock);
> - flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
> - start_level, KVM_MAX_HUGEPAGE_LEVEL,
> - false);
> + slot_handle_level(kvm, memslot, slot_rmap_write_protect,
> + start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
> write_unlock(&kvm->mmu_lock);
> }
>
> if (is_tdp_mmu_enabled(kvm)) {
> read_lock(&kvm->mmu_lock);
> - flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
> + kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
> read_unlock(&kvm->mmu_lock);
> }
>
> /*
> - * Flush TLBs if any SPTEs had to be write-protected to ensure that
> - * guest writes are reflected in the dirty bitmap before the memslot
> - * update completes, i.e. before enabling dirty logging is visible to
> - * userspace.
> + * The caller will flush TLBs to ensure that guest writes are reflected
> + * in the dirty bitmap before the memslot update completes, i.e. before
> + * enabling dirty logging is visible to userspace.
> *
> * Perform the TLB flush outside the mmu_lock to reduce the amount of
> * time the lock is held. However, this does mean that another CPU can
> @@ -6097,8 +6093,6 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> *
> * See is_writable_pte() for more details.
> */
> - if (flush)
> - kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> }
>
> static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> @@ -6468,32 +6462,30 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
> void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
> const struct kvm_memory_slot *memslot)
> {
> - bool flush = false;
> -
> if (kvm_memslots_have_rmaps(kvm)) {
> write_lock(&kvm->mmu_lock);
> /*
> * Clear dirty bits only on 4k SPTEs since the legacy MMU only
> * support dirty logging at a 4k granularity.
> */
> - flush = slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false);
> + slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false);
> write_unlock(&kvm->mmu_lock);
> }
>
> if (is_tdp_mmu_enabled(kvm)) {
> read_lock(&kvm->mmu_lock);
> - flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
> + kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
> read_unlock(&kvm->mmu_lock);
> }
>
> /*
> + * The caller will flush the TLBs after this function returns.
> + *
> * It's also safe to flush TLBs out of mmu lock here as currently this
> * function is only used for dirty logging, in which case flushing TLB
> * out of mmu lock also guarantees no dirty pages will be lost in
> * dirty_bitmap.
> */
> - if (flush)
> - kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> }
>
> void kvm_mmu_zap_all(struct kvm *kvm)
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index ba3dccb202bc..ec3e79ac4449 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -330,7 +330,7 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
> }
>
> /*
> - * An shadow-present leaf SPTE may be non-writable for 3 possible reasons:
> + * A shadow-present leaf SPTE may be non-writable for 4 possible reasons:
> *
> * 1. To intercept writes for dirty logging. KVM write-protects huge pages
> * so that they can be split be split down into the dirty logging
> @@ -348,6 +348,8 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
> * read-only memslot or guest memory backed by a read-only VMA. Writes to
> * such pages are disallowed entirely.
> *
> + * 4. To track the Accessed status for SPTEs without A/D bits.
> + *
> * To keep track of why a given SPTE is write-protected, KVM uses 2
> * software-only bits in the SPTE:
Drop the "2" now that we're also covering access-tracking here.
> *
> @@ -358,6 +360,8 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
> * shadow_host_writable_mask, aka Host-writable -
> * Cleared on SPTEs that are not host-writable (case 3 above)
> *
> + * In addition, is_acc_track_spte() is true in the case 4 above.
The section describes the PTE bits so I think it'd be useful to
explicilty call out SHADOW_ACC_TRACK_SAVED_MASK here. e.g.
SHADOW_ACC_TRACK_SAVED_MASK, aka access-tracked -
Contains the R/X bits from the SPTE when it is being access-tracked,
otherwise 0. Note that the W-bit is also cleared when an SPTE is being
access-tracked, but it is not saved in the saved-mask. The W-bit is
restored based on the Host/MMU-writable bits when a write is attempted.
> + *
> * Note, not all possible combinations of PT_WRITABLE_MASK,
> * shadow_mmu_writable_mask, and shadow_host_writable_mask are valid. A given
> * SPTE can be in only one of the following states, which map to the
> @@ -378,7 +382,8 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
> * shadow page tables between vCPUs. Write-protecting an SPTE for dirty logging
> * (which does not clear the MMU-writable bit), does not flush TLBs before
> * dropping the lock, as it only needs to synchronize guest writes with the
> - * dirty bitmap.
> + * dirty bitmap. Similarly, the clear_young() MMU notifier also does not flush
> + * TLBs even though the SPTE can become non-writable because of case 4.
The reader here may not be familier with clear_young() and the rest of
this comment does not explain what clear_young() has to do with
access-tracking. So I would suggest tweaking the wording here to make
things a bit more clear:
Write-protecting an SPTE for access-tracking via the clear_young()
MMU notifier also does not flush the TLB under the MMU lock.
> *
> * So, there is the problem: clearing the MMU-writable bit can encounter a
> * write-protected SPTE while CPUs still have writable mappings for that SPTE
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f389691d8c04..8e33e35e4da4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12448,6 +12448,13 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> } else {
> kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_4K);
> }
> +
> + /*
> + * Always flush the TLB even if no PTEs were modified above,
> + * because it is possible that there may still be stale writable
> + * TLB entries for non-AD PTEs from a prior clear_young().
s/non-AD/access-tracked/ and s/PTE/SPTE/
> + */
> + kvm_arch_flush_remote_tlbs_memslot(kvm, new);
> }
> }
>
>
> base-commit: a4850b5590d01bf3fb19fda3fc5d433f7382a974
> --
> 2.37.1.359.gd136c6c3e2-goog
>
next prev parent reply other threads:[~2022-07-25 17:27 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-23 2:41 [PATCH] kvm: x86: mmu: Always flush TLBs when enabling dirty logging Junaid Shahid
2022-07-25 17:27 ` David Matlack [this message]
2022-07-25 22:32 ` Sean Christopherson
2022-07-26 17:23 ` Junaid Shahid
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yt7Sh/aN1dlXN21N@google.com \
--to=dmatlack@google.com \
--cc=junaids@google.com \
--cc=kvm@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox