From: Mingwei Zhang <mizhang@google.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/4] KVM: x86/mmu: Document the "rules" for using host_pfn_mapping_level()
Date: Sat, 16 Jul 2022 18:51:58 +0000 [thread overview]
Message-ID: <YtMIvgfsgIPWMgGM@google.com> (raw)
In-Reply-To: <20220715232107.3775620-3-seanjc@google.com>
On Fri, Jul 15, 2022, Sean Christopherson wrote:
> Add a comment to document how host_pfn_mapping_level() can be used safely,
> as the line between safe and dangerous is quite thin. E.g. if KVM were
> to ever support in-place promotion to create huge pages, consuming the
> level is safe if the caller holds mmu_lock and checks that there's an
> existing _leaf_ SPTE, but unsafe if the caller only checks that there's a
> non-leaf SPTE.
>
> Opportunistically tweak the existing comments to explicitly document why
> KVM needs to use READ_ONCE().
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 42 +++++++++++++++++++++++++++++++++++-------
> 1 file changed, 35 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index bebff1d5acd4..d5b644f3e003 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2919,6 +2919,31 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
> __direct_pte_prefetch(vcpu, sp, sptep);
> }
>
> +/*
> + * Lookup the mapping level for @gfn in the current mm.
> + *
> + * WARNING! Use of host_pfn_mapping_level() requires the caller and the end
> + * consumer to be tied into KVM's handlers for MMU notifier events!
Since calling this function won't cause kernel crash now, I guess we can
remove the warning sign here, but keep the remaining statement since it
is necessary.
> + *
> + * There are several ways to safely use this helper:
> + *
> + * - Check mmu_notifier_retry_hva() after grabbing the mapping level, before
> + * consuming it. In this case, mmu_lock doesn't need to be held during the
> + * lookup, but it does need to be held while checking the MMU notifier.
but it does need to be held while checking the MMU notifier and
consuming the result.
> + *
> + * - Hold mmu_lock AND ensure there is no in-progress MMU notifier invalidation
> + * event for the hva. This can be done by explicit checking the MMU notifier
> + * or by ensuring that KVM already has a valid mapping that covers the hva.
Yes, more specifically, "mmu notifier sequence counter".
> + *
> + * - Do not use the result to install new mappings, e.g. use the host mapping
> + * level only to decide whether or not to zap an entry. In this case, it's
> + * not required to hold mmu_lock (though it's highly likely the caller will
> + * want to hold mmu_lock anyways, e.g. to modify SPTEs).
> + *
> + * Note! The lookup can still race with modifications to host page tables, but
> + * the above "rules" ensure KVM will not _consume_ the result of the walk if a
> + * race with the primary MMU occurs.
> + */
> static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> const struct kvm_memory_slot *slot)
> {
> @@ -2941,16 +2966,19 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> hva = __gfn_to_hva_memslot(slot, gfn);
>
> /*
> - * Lookup the mapping level in the current mm. The information
> - * may become stale soon, but it is safe to use as long as
> - * 1) mmu_notifier_retry was checked after taking mmu_lock, and
> - * 2) mmu_lock is taken now.
> - *
> - * We still need to disable IRQs to prevent concurrent tear down
> - * of page tables.
> + * Disable IRQs to prevent concurrent tear down of host page tables,
> + * e.g. if the primary MMU promotes a P*D to a huge page and then frees
> + * the original page table.
> */
> local_irq_save(flags);
>
> + /*
> + * Read each entry once. As above, a non-leaf entry can be promoted to
> + * a huge page _during_ this walk. Re-reading the entry could send the
> + * walk into the weeks, e.g. p*d_large() returns false (sees the old
> + * value) and then p*d_offset() walks into the target huge page instead
> + * of the old page table (sees the new value).
> + */
> pgd = READ_ONCE(*pgd_offset(kvm->mm, hva));
> if (pgd_none(pgd))
> goto out;
> --
> 2.37.0.170.g444d1eabd0-goog
>
next prev parent reply other threads:[~2022-07-16 18:52 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-15 23:21 [PATCH 0/4] Huge page related cleanups Sean Christopherson
2022-07-15 23:21 ` [PATCH 1/4] KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs Sean Christopherson
2022-07-16 5:53 ` Mingwei Zhang
2022-07-15 23:21 ` [PATCH 2/4] KVM: x86/mmu: Document the "rules" for using host_pfn_mapping_level() Sean Christopherson
2022-07-16 18:51 ` Mingwei Zhang [this message]
2022-07-18 16:50 ` Sean Christopherson
2022-07-18 20:48 ` Mingwei Zhang
2022-07-15 23:21 ` [PATCH 3/4] KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs Sean Christopherson
2022-07-15 23:21 ` [PATCH 4/4] KVM: selftests: Add an option to run vCPUs while disabling dirty logging Sean Christopherson
2022-07-19 18:02 ` [PATCH 0/4] Huge page related cleanups Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YtMIvgfsgIPWMgGM@google.com \
--to=mizhang@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.