From: Mingwei Zhang <mizhang@google.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/4] KVM: x86/mmu: Document the "rules" for using host_pfn_mapping_level()
Date: Sat, 16 Jul 2022 18:51:58 +0000 [thread overview]
Message-ID: <YtMIvgfsgIPWMgGM@google.com> (raw)
In-Reply-To: <20220715232107.3775620-3-seanjc@google.com>
On Fri, Jul 15, 2022, Sean Christopherson wrote:
> Add a comment to document how host_pfn_mapping_level() can be used safely,
> as the line between safe and dangerous is quite thin. E.g. if KVM were
> to ever support in-place promotion to create huge pages, consuming the
> level is safe if the caller holds mmu_lock and checks that there's an
> existing _leaf_ SPTE, but unsafe if the caller only checks that there's a
> non-leaf SPTE.
>
> Opportunistically tweak the existing comments to explicitly document why
> KVM needs to use READ_ONCE().
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 42 +++++++++++++++++++++++++++++++++++-------
> 1 file changed, 35 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index bebff1d5acd4..d5b644f3e003 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2919,6 +2919,31 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
> __direct_pte_prefetch(vcpu, sp, sptep);
> }
>
> +/*
> + * Lookup the mapping level for @gfn in the current mm.
> + *
> + * WARNING! Use of host_pfn_mapping_level() requires the caller and the end
> + * consumer to be tied into KVM's handlers for MMU notifier events!
Since calling this function won't cause kernel crash now, I guess we can
remove the warning sign here, but keep the remaining statement since it
is necessary.
> + *
> + * There are several ways to safely use this helper:
> + *
> + * - Check mmu_notifier_retry_hva() after grabbing the mapping level, before
> + * consuming it. In this case, mmu_lock doesn't need to be held during the
> + * lookup, but it does need to be held while checking the MMU notifier.
but it does need to be held while checking the MMU notifier and
consuming the result.
> + *
> + * - Hold mmu_lock AND ensure there is no in-progress MMU notifier invalidation
> + * event for the hva. This can be done by explicit checking the MMU notifier
> + * or by ensuring that KVM already has a valid mapping that covers the hva.
Yes, more specifically, "mmu notifier sequence counter".
> + *
> + * - Do not use the result to install new mappings, e.g. use the host mapping
> + * level only to decide whether or not to zap an entry. In this case, it's
> + * not required to hold mmu_lock (though it's highly likely the caller will
> + * want to hold mmu_lock anyways, e.g. to modify SPTEs).
> + *
> + * Note! The lookup can still race with modifications to host page tables, but
> + * the above "rules" ensure KVM will not _consume_ the result of the walk if a
> + * race with the primary MMU occurs.
> + */
> static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> const struct kvm_memory_slot *slot)
> {
> @@ -2941,16 +2966,19 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> hva = __gfn_to_hva_memslot(slot, gfn);
>
> /*
> - * Lookup the mapping level in the current mm. The information
> - * may become stale soon, but it is safe to use as long as
> - * 1) mmu_notifier_retry was checked after taking mmu_lock, and
> - * 2) mmu_lock is taken now.
> - *
> - * We still need to disable IRQs to prevent concurrent tear down
> - * of page tables.
> + * Disable IRQs to prevent concurrent tear down of host page tables,
> + * e.g. if the primary MMU promotes a P*D to a huge page and then frees
> + * the original page table.
> */
> local_irq_save(flags);
>
> + /*
> + * Read each entry once. As above, a non-leaf entry can be promoted to
> + * a huge page _during_ this walk. Re-reading the entry could send the
> + * walk into the weeks, e.g. p*d_large() returns false (sees the old
> + * value) and then p*d_offset() walks into the target huge page instead
> + * of the old page table (sees the new value).
> + */
> pgd = READ_ONCE(*pgd_offset(kvm->mm, hva));
> if (pgd_none(pgd))
> goto out;
> --
> 2.37.0.170.g444d1eabd0-goog
>
next prev parent reply other threads:[~2022-07-16 18:52 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-15 23:21 [PATCH 0/4] Huge page related cleanups Sean Christopherson
2022-07-15 23:21 ` [PATCH 1/4] KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs Sean Christopherson
2022-07-16 5:53 ` Mingwei Zhang
2022-07-15 23:21 ` [PATCH 2/4] KVM: x86/mmu: Document the "rules" for using host_pfn_mapping_level() Sean Christopherson
2022-07-16 18:51 ` Mingwei Zhang [this message]
2022-07-18 16:50 ` Sean Christopherson
2022-07-18 20:48 ` Mingwei Zhang
2022-07-15 23:21 ` [PATCH 3/4] KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs Sean Christopherson
2022-07-15 23:21 ` [PATCH 4/4] KVM: selftests: Add an option to run vCPUs while disabling dirty logging Sean Christopherson
2022-07-19 18:02 ` [PATCH 0/4] Huge page related cleanups Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YtMIvgfsgIPWMgGM@google.com \
--to=mizhang@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox