From: Sean Christopherson <seanjc@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: David Stevens <stevensd@chromium.org>,
Marc Zyngier <maz@kernel.org>,
Oliver Upton <oliver.upton@linux.dev>,
Paolo Bonzini <pbonzini@redhat.com>,
linux-arm-kernel@lists.infradead.org,
kvmarm@lists.cs.columbia.edu, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org
Subject: Re: [PATCH v6 1/4] KVM: mmu: introduce new gfn_to_pfn_noref functions
Date: Wed, 24 May 2023 09:46:13 -0700 [thread overview]
Message-ID: <ZG4/VdHu2LqLTlct@google.com> (raw)
In-Reply-To: <ZG45q0xJSnA6NKQN@x1n>
On Wed, May 24, 2023, Peter Xu wrote:
> On Mon, May 22, 2023 at 01:46:41PM -0700, Sean Christopherson wrote:
> > As for the flags vs. bools debate (see link above), I think the best approach is
> > a mix of the two. Specifically, reuse the FOLL_* flags as-is for inputs, and use
> > booleans for outputs. I don't _think_ there are any input bools/flags that don't
> > map 1:1 with existing FOLL_* flags.
> >
> > As a very, *very* rough sketch, provide APIs that look a bit like this.
>
> Unifying ref vs nonref cases does look a bit cleaner to me too.
>
> >
> > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> > {
> > kvm_pfn_t pfn;
> >
> > if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll.mmu_seq))
>
> IMHO we may not want to rely on mmu_seq==0 either for unlucky very initial
> mmu_seq being zero, or avoid overflows?
I was thinking we could initialize mmu_seq to '1' and make it a u64 to avoid
overflow.
> I'd say we can stick with FOLL_GET in this case to identify ref vs nonref
> and always assume mmu_seq a pure random number.
The intent of checking mmu_seq is to flag cases where the caller doesn't specify
FOLL_GET and isn't protected by mmu_invalidate_seq, i.e. isn't tapped into the
mmu_notifiers. I.e. this is a sanity check, not functionally necessary.
>
> > return KVM_PFN_ERR_FAULT;
> >
> > pfn = ???;
> >
> > if (foll->page && !(foll->flags & FOLL_GET))
> > put_page(foll->page);
> >
> > return pfn;
> > }
> >
> > kvm_pfn_t kvm_follow_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct page **page)
> > {
> > struct kvm_follow_pfn foll = {
> > .flags = FOLL_GET | FOLL_WRITE,
> > };
> >
> > <more stuff here?>
> >
> > foll.slot = ???;
> > if (!foll.slot || foll.slot->flags & KVM_MEMSLOT_INVALID)
> > return KVM_HVA_ERR_BAD;
> >
> > if (memslot_is_readonly(foll.slot))
> > return KVM_HVA_ERR_RO_BAD;
> >
> > return __kvm_follow_pfn(&foll);
> > }
> >
> > and a few partially converted users
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 67e2ac799aa7..5eaf0395ed87 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -550,12 +550,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> >
> > if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> > flush = true;
> > - kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > + if (is_refcounted_page_pte(old_spte))
>
> One question is how to impl is_refcounted_page_pte() here to identify
> non-refcountable pages.
KVM would use a software available bit in its PTEs to explicitly track which SPTEs
point at refcounted pages. E.g. I think bit 59 is available for EPT and 64-bit
paging. PAE paging doesn't have high available bits, which is why I called out
that this would have to be 64-bit only.
> IIUC those pages are mostly identical to a normal page (so !PG_reserved)
> but it has page_ref_count(page)==0 always, am I right? I got that roughly
> from reading f8be156be1 only though, so I could miss a lot of things..
>
> When thinking about that, I'm also wondering whether we can trivially allow
> kvm to support such mapping (without overhaul of the kvm pfn API) by
> something like this:
>
> ===8<===
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 51e4882d0873..467acbac1a96 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -192,7 +192,13 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)
>
> page = pfn_to_page(pfn);
> if (!PageReserved(page))
> - return page;
> + /*
> + * When page_ref_count(page)==0 it might be speical page
> + * that do not support refcounting. Treating them the same
> + * as normal reserved (e.g. MMIO) pages by returning NULL,
> + * so they're exempt of refcounting.
> + */
> + return page_ref_count(page) == 0 ? NULL : page;
Heh, because I got burned by this recently, using page_ref_count() is wrong. This
needs to be page_count() so that tail pages of refcounted compound pages are
properly identified.
>
> /* The ZERO_PAGE(s) is marked PG_reserved, but is refcounted. */
> if (is_zero_pfn(pfn))
> ===8<===
>
> So that we treat those special pages the same as normal PFNMAP ones by
> skipping all refcountings on inc/dec. This is based on the fact that kvm
> should always hold at least 1 ref on a normal page so a normal page should
> never hit ref==0 here, but again I could miss something somewhere..
This would "work" from a functionality perspective, and might be acceptable as an
out-of-tree patch to unblock the ChromeOS use case, but I don't want to rely on
this heuristic on the backend in KVM because it will suppress any and all
use-after-free bugs in KVM's MMU (see patch 4 of this series). I really want to
go in the opposite direction and harden KVM against MMU bugs, e.g. I'm planning
on posting the below (which is how I learned about page_count() vs. page_ref_count()).
Today, KVM gets partial protection from check_new_page_bad(), which detects *some*
cases where KVM marks a page dirty after the page is freed. But it's racy, and
the detection occurs well after the fact since it fires only when the page is
re-allocated.
If we hack kvm_pfn_to_refcounted_page(), then all of those protections are lost
because KVM would drop its assertions and also skip dirtying pages, i.e. would
effectively suppress the latent detection by check_new_page_bad().
Author: Sean Christopherson <seanjc@google.com>
Date: Wed May 17 13:26:54 2023 -0700
KVM: Assert that a page's refcount is elevated when marking accessed/dirty
Assert that a page's refcount is elevated, i.e. that _something_ holds a
reference to the page, when KVM marks a page as accessed and/or dirty.
KVM typically doesn't hold a reference to pages that are mapped into the
guest, e.g. to allow page migration, compaction, swap, etc., and instead
relies on mmu_notifiers to react to changes in the primary MMU.
Incorrect handling of mmu_notifier events (or similar mechanisms) can
result in KVM keeping a mapping beyond the lifetime of the backing page,
i.e. can (and often does) result in use-after-free. Yelling if KVM marks
a freed page as accessed/dirty doesn't prevent badness as KVM usually
only does A/D updates when unmapping memory from the guest, i.e. the
assertion fires well after an underlying bug has occured, but yelling
does help detect, triage, and debug use-after-free bugs.
Note, the assertion must use page_count(), NOT page_ref_count()! For
hugepages, the returned struct page may be a tailpage and thus not have
its own refcount.
Signed-off-by: Sean Christopherson <seanjc@google.com>
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d1abb331ea68..64f18697096c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2882,6 +2882,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
static bool kvm_is_ad_tracked_page(struct page *page)
{
+ /*
+ * Assert that KVM isn't attempting to mark a freed page as Accessed or
+ * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM
+ * (typically) doesn't pin pages that are mapped in KVM's MMU, and
+ * instead relies on mmu_notifiers to know when a mapping needs to be
+ * zapped/invalidated. Unmapping from KVM's MMU must happen _before_
+ * KVM returns from its mmu_notifier, i.e. the page should have an
+ * elevated refcount at this point even though KVM doesn't hold a
+ * reference of its own.
+ */
+ if (WARN_ON_ONCE(!page_count(page)))
+ return false;
+
/*
* Per page-flags.h, pages tagged PG_reserved "should in general not be
* touched (e.g. set dirty) except by its owner".
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-05-24 16:46 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-30 8:57 [PATCH v6 0/4] KVM: allow mapping non-refcounted pages David Stevens
2023-03-30 8:57 ` [PATCH v6 1/4] KVM: mmu: introduce new gfn_to_pfn_noref functions David Stevens
2023-05-22 20:46 ` Sean Christopherson
2023-05-24 16:22 ` Peter Xu
2023-05-24 16:46 ` Sean Christopherson [this message]
2023-05-24 17:14 ` Peter Xu
2023-05-24 18:29 ` Sean Christopherson
2023-05-24 19:09 ` Peter Xu
2023-05-24 20:05 ` Sean Christopherson
2023-03-30 8:58 ` [PATCH v6 2/4] KVM: x86/mmu: use gfn_to_pfn_noref David Stevens
2023-03-30 8:58 ` [PATCH v6 3/4] KVM: arm64/mmu: " David Stevens
2023-03-30 8:58 ` [PATCH v6 4/4] KVM: mmu: remove over-aggressive warnings David Stevens
2023-05-22 21:55 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZG4/VdHu2LqLTlct@google.com \
--to=seanjc@google.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.cs.columbia.edu \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maz@kernel.org \
--cc=oliver.upton@linux.dev \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=stevensd@chromium.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).