From: Peter Xu <peterx@redhat.com>
To: Sean Christopherson <seanjc@google.com>
Cc: David Stevens <stevensd@chromium.org>,
Marc Zyngier <maz@kernel.org>,
Oliver Upton <oliver.upton@linux.dev>,
Paolo Bonzini <pbonzini@redhat.com>,
linux-arm-kernel@lists.infradead.org,
kvmarm@lists.cs.columbia.edu, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org
Subject: Re: [PATCH v6 1/4] KVM: mmu: introduce new gfn_to_pfn_noref functions
Date: Wed, 24 May 2023 13:14:06 -0400 [thread overview]
Message-ID: <ZG5F3igFgdIAwrn4@x1n> (raw)
In-Reply-To: <ZG4/VdHu2LqLTlct@google.com>
On Wed, May 24, 2023 at 09:46:13AM -0700, Sean Christopherson wrote:
> On Wed, May 24, 2023, Peter Xu wrote:
> > On Mon, May 22, 2023 at 01:46:41PM -0700, Sean Christopherson wrote:
> > > As for the flags vs. bools debate (see link above), I think the best approach is
> > > a mix of the two. Specifically, reuse the FOLL_* flags as-is for inputs, and use
> > > booleans for outputs. I don't _think_ there are any input bools/flags that don't
> > > map 1:1 with existing FOLL_* flags.
> > >
> > > As a very, *very* rough sketch, provide APIs that look a bit like this.
> >
> > Unifying ref vs nonref cases does look a bit cleaner to me too.
> >
> > >
> > > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> > > {
> > > kvm_pfn_t pfn;
> > >
> > > if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll.mmu_seq))
> >
> > IMHO we may not want to rely on mmu_seq==0 either for unlucky very initial
> > mmu_seq being zero, or avoid overflows?
>
> I was thinking we could initialize mmu_seq to '1' and make it a u64 to avoid
> overflow.
Yeah, that's fine too.
>
> > I'd say we can stick with FOLL_GET in this case to identify ref vs nonref
> > and always assume mmu_seq a pure random number.
>
> The intent of checking mmu_seq is to flag cases where the caller doesn't specify
> FOLL_GET and isn't protected by mmu_invalidate_seq, i.e. isn't tapped into the
> mmu_notifiers. I.e. this is a sanity check, not functionally necessary.
>
> >
> > > return KVM_PFN_ERR_FAULT;
> > >
> > > pfn = ???;
> > >
> > > if (foll->page && !(foll->flags & FOLL_GET))
> > > put_page(foll->page);
> > >
> > > return pfn;
> > > }
> > >
> > > kvm_pfn_t kvm_follow_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct page **page)
> > > {
> > > struct kvm_follow_pfn foll = {
> > > .flags = FOLL_GET | FOLL_WRITE,
> > > };
> > >
> > > <more stuff here?>
> > >
> > > foll.slot = ???;
> > > if (!foll.slot || foll.slot->flags & KVM_MEMSLOT_INVALID)
> > > return KVM_HVA_ERR_BAD;
> > >
> > > if (memslot_is_readonly(foll.slot))
> > > return KVM_HVA_ERR_RO_BAD;
> > >
> > > return __kvm_follow_pfn(&foll);
> > > }
> > >
> > > and a few partially converted users
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 67e2ac799aa7..5eaf0395ed87 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -550,12 +550,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> > >
> > > if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> > > flush = true;
> > > - kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > > + if (is_refcounted_page_pte(old_spte))
> >
> > One question is how to impl is_refcounted_page_pte() here to identify
> > non-refcountable pages.
>
> KVM would use a software available bit in its PTEs to explicitly track which SPTEs
> point at refcounted pages. E.g. I think bit 59 is available for EPT and 64-bit
> paging. PAE paging doesn't have high available bits, which is why I called out
> that this would have to be 64-bit only.
>
> > IIUC those pages are mostly identical to a normal page (so !PG_reserved)
> > but it has page_ref_count(page)==0 always, am I right? I got that roughly
> > from reading f8be156be1 only though, so I could miss a lot of things..
> >
> > When thinking about that, I'm also wondering whether we can trivially allow
> > kvm to support such mapping (without overhaul of the kvm pfn API) by
> > something like this:
> >
> > ===8<===
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 51e4882d0873..467acbac1a96 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -192,7 +192,13 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)
> >
> > page = pfn_to_page(pfn);
> > if (!PageReserved(page))
> > - return page;
> > + /*
> > + * When page_ref_count(page)==0 it might be speical page
> > + * that do not support refcounting. Treating them the same
> > + * as normal reserved (e.g. MMIO) pages by returning NULL,
> > + * so they're exempt of refcounting.
> > + */
> > + return page_ref_count(page) == 0 ? NULL : page;
>
> Heh, because I got burned by this recently, using page_ref_count() is wrong. This
> needs to be page_count() so that tail pages of refcounted compound pages are
> properly identified.
:-D
Actually when I was replying I explicitly didn't use page_count() to make
sure we're reading the tail page, but I just noticed that's exactly the way
how we identify the special page with a PageCompound()==true tail page.
Yeah, if we'd like that it needs to be page_count()==0.
>
> >
> > /* The ZERO_PAGE(s) is marked PG_reserved, but is refcounted. */
> > if (is_zero_pfn(pfn))
> > ===8<===
> >
> > So that we treat those special pages the same as normal PFNMAP ones by
> > skipping all refcountings on inc/dec. This is based on the fact that kvm
> > should always hold at least 1 ref on a normal page so a normal page should
> > never hit ref==0 here, but again I could miss something somewhere..
>
> This would "work" from a functionality perspective, and might be acceptable as an
> out-of-tree patch to unblock the ChromeOS use case, but I don't want to rely on
> this heuristic on the backend in KVM because it will suppress any and all
> use-after-free bugs in KVM's MMU (see patch 4 of this series). I really want to
> go in the opposite direction and harden KVM against MMU bugs, e.g. I'm planning
> on posting the below (which is how I learned about page_count() vs. page_ref_count()).
>
> Today, KVM gets partial protection from check_new_page_bad(), which detects *some*
> cases where KVM marks a page dirty after the page is freed. But it's racy, and
> the detection occurs well after the fact since it fires only when the page is
> re-allocated.
>
> If we hack kvm_pfn_to_refcounted_page(), then all of those protections are lost
> because KVM would drop its assertions and also skip dirtying pages, i.e. would
> effectively suppress the latent detection by check_new_page_bad().
So it's probably that I totally have no idea what are the attributes for
those special pages so I don't understand enough on why we need to handle
those pages differently from e.g. PFNMAP pages, and also the benefits.
I think what I can tell is that they're pages that doesn't have
PageCompound bits set on either head or tails, however it's still a
multi-2-order large page. Is there an example on how these pages are used
and allocated? Why would we need those pages, and whether these pages need
to be set dirty/accessed after all?
>
> Author: Sean Christopherson <seanjc@google.com>
> Date: Wed May 17 13:26:54 2023 -0700
>
> KVM: Assert that a page's refcount is elevated when marking accessed/dirty
>
> Assert that a page's refcount is elevated, i.e. that _something_ holds a
> reference to the page, when KVM marks a page as accessed and/or dirty.
> KVM typically doesn't hold a reference to pages that are mapped into the
> guest, e.g. to allow page migration, compaction, swap, etc., and instead
> relies on mmu_notifiers to react to changes in the primary MMU.
>
> Incorrect handling of mmu_notifier events (or similar mechanisms) can
> result in KVM keeping a mapping beyond the lifetime of the backing page,
> i.e. can (and often does) result in use-after-free. Yelling if KVM marks
> a freed page as accessed/dirty doesn't prevent badness as KVM usually
> only does A/D updates when unmapping memory from the guest, i.e. the
> assertion fires well after an underlying bug has occured, but yelling
> does help detect, triage, and debug use-after-free bugs.
>
> Note, the assertion must use page_count(), NOT page_ref_count()! For
> hugepages, the returned struct page may be a tailpage and thus not have
> its own refcount.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d1abb331ea68..64f18697096c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2882,6 +2882,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
>
> static bool kvm_is_ad_tracked_page(struct page *page)
> {
> + /*
> + * Assert that KVM isn't attempting to mark a freed page as Accessed or
> + * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM
> + * (typically) doesn't pin pages that are mapped in KVM's MMU, and
> + * instead relies on mmu_notifiers to know when a mapping needs to be
> + * zapped/invalidated. Unmapping from KVM's MMU must happen _before_
> + * KVM returns from its mmu_notifier, i.e. the page should have an
> + * elevated refcount at this point even though KVM doesn't hold a
> + * reference of its own.
> + */
> + if (WARN_ON_ONCE(!page_count(page)))
> + return false;
> +
> /*
> * Per page-flags.h, pages tagged PG_reserved "should in general not be
> * touched (e.g. set dirty) except by its owner".
>
This looks like a good thing to have, indeed. But again it doesn't seem
like anything special to the pages we're discussing here, say, !Compound &&
refcount==0 ones.
--
Peter Xu
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: Peter Xu <peterx@redhat.com>
To: Sean Christopherson <seanjc@google.com>
Cc: David Stevens <stevensd@chromium.org>,
Marc Zyngier <maz@kernel.org>,
Oliver Upton <oliver.upton@linux.dev>,
Paolo Bonzini <pbonzini@redhat.com>,
linux-arm-kernel@lists.infradead.org,
kvmarm@lists.cs.columbia.edu, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org
Subject: Re: [PATCH v6 1/4] KVM: mmu: introduce new gfn_to_pfn_noref functions
Date: Wed, 24 May 2023 13:14:06 -0400 [thread overview]
Message-ID: <ZG5F3igFgdIAwrn4@x1n> (raw)
In-Reply-To: <ZG4/VdHu2LqLTlct@google.com>
On Wed, May 24, 2023 at 09:46:13AM -0700, Sean Christopherson wrote:
> On Wed, May 24, 2023, Peter Xu wrote:
> > On Mon, May 22, 2023 at 01:46:41PM -0700, Sean Christopherson wrote:
> > > As for the flags vs. bools debate (see link above), I think the best approach is
> > > a mix of the two. Specifically, reuse the FOLL_* flags as-is for inputs, and use
> > > booleans for outputs. I don't _think_ there are any input bools/flags that don't
> > > map 1:1 with existing FOLL_* flags.
> > >
> > > As a very, *very* rough sketch, provide APIs that look a bit like this.
> >
> > Unifying ref vs nonref cases does look a bit cleaner to me too.
> >
> > >
> > > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> > > {
> > > kvm_pfn_t pfn;
> > >
> > > if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll.mmu_seq))
> >
> > IMHO we may not want to rely on mmu_seq==0 either for unlucky very initial
> > mmu_seq being zero, or avoid overflows?
>
> I was thinking we could initialize mmu_seq to '1' and make it a u64 to avoid
> overflow.
Yeah, that's fine too.
>
> > I'd say we can stick with FOLL_GET in this case to identify ref vs nonref
> > and always assume mmu_seq a pure random number.
>
> The intent of checking mmu_seq is to flag cases where the caller doesn't specify
> FOLL_GET and isn't protected by mmu_invalidate_seq, i.e. isn't tapped into the
> mmu_notifiers. I.e. this is a sanity check, not functionally necessary.
>
> >
> > > return KVM_PFN_ERR_FAULT;
> > >
> > > pfn = ???;
> > >
> > > if (foll->page && !(foll->flags & FOLL_GET))
> > > put_page(foll->page);
> > >
> > > return pfn;
> > > }
> > >
> > > kvm_pfn_t kvm_follow_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct page **page)
> > > {
> > > struct kvm_follow_pfn foll = {
> > > .flags = FOLL_GET | FOLL_WRITE,
> > > };
> > >
> > > <more stuff here?>
> > >
> > > foll.slot = ???;
> > > if (!foll.slot || foll.slot->flags & KVM_MEMSLOT_INVALID)
> > > return KVM_HVA_ERR_BAD;
> > >
> > > if (memslot_is_readonly(foll.slot))
> > > return KVM_HVA_ERR_RO_BAD;
> > >
> > > return __kvm_follow_pfn(&foll);
> > > }
> > >
> > > and a few partially converted users
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 67e2ac799aa7..5eaf0395ed87 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -550,12 +550,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> > >
> > > if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> > > flush = true;
> > > - kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > > + if (is_refcounted_page_pte(old_spte))
> >
> > One question is how to impl is_refcounted_page_pte() here to identify
> > non-refcountable pages.
>
> KVM would use a software available bit in its PTEs to explicitly track which SPTEs
> point at refcounted pages. E.g. I think bit 59 is available for EPT and 64-bit
> paging. PAE paging doesn't have high available bits, which is why I called out
> that this would have to be 64-bit only.
>
> > IIUC those pages are mostly identical to a normal page (so !PG_reserved)
> > but it has page_ref_count(page)==0 always, am I right? I got that roughly
> > from reading f8be156be1 only though, so I could miss a lot of things..
> >
> > When thinking about that, I'm also wondering whether we can trivially allow
> > kvm to support such mapping (without overhaul of the kvm pfn API) by
> > something like this:
> >
> > ===8<===
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 51e4882d0873..467acbac1a96 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -192,7 +192,13 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)
> >
> > page = pfn_to_page(pfn);
> > if (!PageReserved(page))
> > - return page;
> > + /*
> > + * When page_ref_count(page)==0 it might be speical page
> > + * that do not support refcounting. Treating them the same
> > + * as normal reserved (e.g. MMIO) pages by returning NULL,
> > + * so they're exempt of refcounting.
> > + */
> > + return page_ref_count(page) == 0 ? NULL : page;
>
> Heh, because I got burned by this recently, using page_ref_count() is wrong. This
> needs to be page_count() so that tail pages of refcounted compound pages are
> properly identified.
:-D
Actually when I was replying I explicitly didn't use page_count() to make
sure we're reading the tail page, but I just noticed that's exactly the way
how we identify the special page with a PageCompound()==true tail page.
Yeah, if we'd like that it needs to be page_count()==0.
>
> >
> > /* The ZERO_PAGE(s) is marked PG_reserved, but is refcounted. */
> > if (is_zero_pfn(pfn))
> > ===8<===
> >
> > So that we treat those special pages the same as normal PFNMAP ones by
> > skipping all refcountings on inc/dec. This is based on the fact that kvm
> > should always hold at least 1 ref on a normal page so a normal page should
> > never hit ref==0 here, but again I could miss something somewhere..
>
> This would "work" from a functionality perspective, and might be acceptable as an
> out-of-tree patch to unblock the ChromeOS use case, but I don't want to rely on
> this heuristic on the backend in KVM because it will suppress any and all
> use-after-free bugs in KVM's MMU (see patch 4 of this series). I really want to
> go in the opposite direction and harden KVM against MMU bugs, e.g. I'm planning
> on posting the below (which is how I learned about page_count() vs. page_ref_count()).
>
> Today, KVM gets partial protection from check_new_page_bad(), which detects *some*
> cases where KVM marks a page dirty after the page is freed. But it's racy, and
> the detection occurs well after the fact since it fires only when the page is
> re-allocated.
>
> If we hack kvm_pfn_to_refcounted_page(), then all of those protections are lost
> because KVM would drop its assertions and also skip dirtying pages, i.e. would
> effectively suppress the latent detection by check_new_page_bad().
So it's probably that I totally have no idea what are the attributes for
those special pages so I don't understand enough on why we need to handle
those pages differently from e.g. PFNMAP pages, and also the benefits.
I think what I can tell is that they're pages that doesn't have
PageCompound bits set on either head or tails, however it's still a
multi-2-order large page. Is there an example on how these pages are used
and allocated? Why would we need those pages, and whether these pages need
to be set dirty/accessed after all?
>
> Author: Sean Christopherson <seanjc@google.com>
> Date: Wed May 17 13:26:54 2023 -0700
>
> KVM: Assert that a page's refcount is elevated when marking accessed/dirty
>
> Assert that a page's refcount is elevated, i.e. that _something_ holds a
> reference to the page, when KVM marks a page as accessed and/or dirty.
> KVM typically doesn't hold a reference to pages that are mapped into the
> guest, e.g. to allow page migration, compaction, swap, etc., and instead
> relies on mmu_notifiers to react to changes in the primary MMU.
>
> Incorrect handling of mmu_notifier events (or similar mechanisms) can
> result in KVM keeping a mapping beyond the lifetime of the backing page,
> i.e. can (and often does) result in use-after-free. Yelling if KVM marks
> a freed page as accessed/dirty doesn't prevent badness as KVM usually
> only does A/D updates when unmapping memory from the guest, i.e. the
> assertion fires well after an underlying bug has occured, but yelling
> does help detect, triage, and debug use-after-free bugs.
>
> Note, the assertion must use page_count(), NOT page_ref_count()! For
> hugepages, the returned struct page may be a tailpage and thus not have
> its own refcount.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d1abb331ea68..64f18697096c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2882,6 +2882,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
>
> static bool kvm_is_ad_tracked_page(struct page *page)
> {
> + /*
> + * Assert that KVM isn't attempting to mark a freed page as Accessed or
> + * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM
> + * (typically) doesn't pin pages that are mapped in KVM's MMU, and
> + * instead relies on mmu_notifiers to know when a mapping needs to be
> + * zapped/invalidated. Unmapping from KVM's MMU must happen _before_
> + * KVM returns from its mmu_notifier, i.e. the page should have an
> + * elevated refcount at this point even though KVM doesn't hold a
> + * reference of its own.
> + */
> + if (WARN_ON_ONCE(!page_count(page)))
> + return false;
> +
> /*
> * Per page-flags.h, pages tagged PG_reserved "should in general not be
> * touched (e.g. set dirty) except by its owner".
>
This looks like a good thing to have, indeed. But again it doesn't seem
like anything special to the pages we're discussing here, say, !Compound &&
refcount==0 ones.
--
Peter Xu
next prev parent reply other threads:[~2023-05-24 17:14 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-30 8:57 [PATCH v6 0/4] KVM: allow mapping non-refcounted pages David Stevens
2023-03-30 8:57 ` David Stevens
2023-03-30 8:57 ` [PATCH v6 1/4] KVM: mmu: introduce new gfn_to_pfn_noref functions David Stevens
2023-03-30 8:57 ` David Stevens
2023-05-22 20:46 ` Sean Christopherson
2023-05-22 20:46 ` Sean Christopherson
2023-05-24 16:22 ` Peter Xu
2023-05-24 16:22 ` Peter Xu
2023-05-24 16:46 ` Sean Christopherson
2023-05-24 16:46 ` Sean Christopherson
2023-05-24 17:14 ` Peter Xu [this message]
2023-05-24 17:14 ` Peter Xu
2023-05-24 18:29 ` Sean Christopherson
2023-05-24 18:29 ` Sean Christopherson
2023-05-24 19:09 ` Peter Xu
2023-05-24 19:09 ` Peter Xu
2023-05-24 20:05 ` Sean Christopherson
2023-05-24 20:05 ` Sean Christopherson
2023-03-30 8:58 ` [PATCH v6 2/4] KVM: x86/mmu: use gfn_to_pfn_noref David Stevens
2023-03-30 8:58 ` David Stevens
2023-03-30 8:58 ` [PATCH v6 3/4] KVM: arm64/mmu: " David Stevens
2023-03-30 8:58 ` David Stevens
2023-03-30 8:58 ` [PATCH v6 4/4] KVM: mmu: remove over-aggressive warnings David Stevens
2023-03-30 8:58 ` David Stevens
2023-05-22 21:55 ` Sean Christopherson
2023-05-22 21:55 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZG5F3igFgdIAwrn4@x1n \
--to=peterx@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.cs.columbia.edu \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maz@kernel.org \
--cc=oliver.upton@linux.dev \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=stevensd@chromium.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.