Re: [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: David Matlack <dmatlack@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org
Subject: Re: [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery
Date: Thu, 22 Aug 2024 13:11:48 -0700	[thread overview]
Message-ID: <ZsebhCLozcv1yyZ8@google.com> (raw)
In-Reply-To: <CALzav=d18V=T=QVbOtiLG1Y7rmG-8B31gdjnbrfGkzC2k3FPVQ@mail.gmail.com>

On Thu, Aug 22, 2024, David Matlack wrote:
> On Thu, Aug 22, 2024 at 9:50 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Aug 05, 2024, David Matlack wrote:
> > > Recheck the iter.old_spte still points to a page table when recovering
> > > huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
> > > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
> > > another CPU in between stepping down and back up.
> > >
> > > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
> > > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
> > > recovery worker, or vCPUs taking faults on the huge page region).
> > >
> > > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
> > > thus can see a different value, which is not immediately obvious when
> > > reading the code.
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 07d5363c9db7..bdc7fd476721 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm,
> > >               while (max_mapping_level > iter.level)
> > >                       tdp_iter_step_up(&iter);
> > >
> > > +             /*
> > > +              * Re-check that iter.old_spte still points to a page table.
> > > +              * Since mmu_lock is held for read and tdp_iter_step_up()
> > > +              * re-reads iter.sptep, it's possible the SPTE was zapped or
> > > +              * recovered by another CPU in between stepping down and
> > > +              * stepping back up.
> > > +              */
> > > +             if (!is_shadow_present_pte(iter.old_spte) ||
> > > +                 is_last_spte(iter.old_spte, iter.level))
> > > +                     continue;
> >
> > This is the part of the step-up logic that I do not like.  Even this check doesn't
> > guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was
> > used to reach the leaf SPTE.  E.g. in an absurdly theoretical situation, the SPTE
> > could be zapped and then re-set with another non-leaf SPTE.  Which is fine, but
> > only because of several very subtle mechanisms.
> 
> I'm not sure why that matters. The only thing that matters is that the
> GFN->PFN and permissions cannot change, and that is guaranteed by
> holding mmu_lock for read.

Because it introduces possible edge cases, that while benign, require the reader
to think about and reason through.  E.g. if _this_ task is trying to replace a
4KiB page with a 1GiB page, what happens if some other task replaces the parent
2MiB page?  It's not immediately obvious that looping on tdp_iter_step_up() will
happily blaze past a !PRESENT SPTE, which might not even be the actual SPTE in
the tree at this point.

> At the end of the day, we never actually care about the value of the
> SPTE we are replacing. We only care that it's a non-leaf SPTE.
>
> > kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere
> > in the huge range, so it's safe to propagate any and all WRITABLE bits.  This
> > requires knowing/remembering that KVM disallows huge pages when a gfn is write-
> > tracked, and relies on that never changing (which is a fairly safe bet, but the
> > behavior isn't fully set in stone).
> > not set.
> >
> > And the presence of a shadow-present leaf SPTE ensures there are no in-flight
> > mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't
> > return until all relevant leaf SPTEs have been zapped.
> 
> As you point out in the next paragraph there could be an inflight invalidate
> that yielded, no?

Yeah, ignore this, I forgot to amend it after remembering that invalidation can
yield.

     prev parent reply	other threads:[~2024-08-22 20:11 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-05 23:31 [PATCH 0/7] KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log David Matlack
2024-08-05 23:31 ` [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs" David Matlack
2024-08-16 22:47   ` Sean Christopherson
2024-08-16 23:35     ` David Matlack
2024-08-22 15:10       ` Sean Christopherson
2024-08-05 23:31 ` [PATCH 2/7] KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level() David Matlack
2024-08-05 23:31 ` [PATCH 3/7] KVM: x86/mmu: Batch TLB flushes when zapping collapsible TDP MMU SPTEs David Matlack
2024-08-05 23:31 ` [PATCH 4/7] KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zapping David Matlack
2024-08-05 23:31 ` [PATCH 5/7] KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte() David Matlack
2024-08-05 23:31 ` [PATCH 6/7] KVM: x86/mmu: WARN if huge page recovery triggered during dirty logging David Matlack
2024-08-05 23:31 ` [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery David Matlack
2024-08-22 16:50   ` Sean Christopherson
2024-08-22 18:35     ` David Matlack
2024-08-22 20:11       ` Sean Christopherson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZsebhCLozcv1yyZ8@google.com \
    --to=seanjc@google.com \
    --cc=dmatlack@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.