kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: James Houghton <jthoughton@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	 Oliver Upton <oliver.upton@linux.dev>,
	Marc Zyngier <maz@kernel.org>, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns
Date: Wed, 4 Sep 2024 08:48:39 -0700	[thread overview]
Message-ID: <ZtiBJydMqwkTaoOM@google.com> (raw)
In-Reply-To: <CADrL8HWizVVmTvbxpFbVN9M7YhRpJYpVnhK_O7XKYMdy1=DBHw@mail.gmail.com>

On Tue, Sep 03, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > DO NOT MERGE, yet...
> >
> > Cc: James Houghton <jthoughton@google.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 59 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 48e8608c2738..9df6b465de06 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -995,13 +995,11 @@ static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> >   * locking is the same, but the caller is disallowed from modifying the rmap,
> >   * and so the unlock flow is a nop if the rmap is/was empty.
> >   */
> > -__maybe_unused
> >  static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
> >  {
> >         return __kvm_rmap_lock(rmap_head);
> >  }
> >
> > -__maybe_unused
> >  static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
> >                                      unsigned long old_val)
> >  {
> > @@ -1743,8 +1741,53 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
> >         __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
> >  }
> >
> > -static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> > -                                  struct kvm_gfn_range *range, bool test_only)
> > +static bool kvm_rmap_age_gfn_range_lockless(struct kvm *kvm,
> > +                                           struct kvm_gfn_range *range,
> > +                                           bool test_only)
> > +{
> > +       struct kvm_rmap_head *rmap_head;
> > +       struct rmap_iterator iter;
> > +       unsigned long rmap_val;
> > +       bool young = false;
> > +       u64 *sptep;
> > +       gfn_t gfn;
> > +       int level;
> > +       u64 spte;
> > +
> > +       for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > +               for (gfn = range->start; gfn < range->end;
> > +                    gfn += KVM_PAGES_PER_HPAGE(level)) {
> > +                       rmap_head = gfn_to_rmap(gfn, level, range->slot);
> > +                       rmap_val = kvm_rmap_lock_readonly(rmap_head);
> > +
> > +                       for_each_rmap_spte_lockless(rmap_head, &iter, sptep, spte) {
> > +                               if (!is_accessed_spte(spte))
> > +                                       continue;
> > +
> > +                               if (test_only) {
> > +                                       kvm_rmap_unlock_readonly(rmap_head, rmap_val);
> > +                                       return true;
> > +                               }
> > +
> > +                               /*
> > +                                * Marking SPTEs for access tracking outside of
> > +                                * mmu_lock is unsupported.  Report the page as
> > +                                * young, but otherwise leave it as-is.
> 
> Just for my own understanding, what's the main reason why it's unsafe

Note, I specifically said "unsupported", not "unsafe" :-D

> to mark PTEs for access tracking outside the mmu_lock?

It probably can be done safely?  The main issue is that marking the SPTE for
access tracking can also clear the Writable bit, and so we'd need to audit all
the flows that consume is_writable_pte().  Hmm, actually, that's less scary than
it first seems, because thanks to kvm_mmu_notifier_clear_young(), KVM already
clears the Writable bit in AD-disabled SPTEs without a TLB flush.  E.g.
mmu_spte_update() specifically looks at MMU-writable, not the Writable bit, when
deciding if a TLB flush is required.

On a related note, I missed is that KVM would need to leaf SPTEs as volatile at
all times, as your MGLRU series modified kvm_tdp_mmu_spte_need_atomic_write(),
not the common spte_has_volatile_bits().

Actually, on second though, maybe it isn't necessary for the AD-enabled case.
Effectively clobbering the Accessed bit is completely fine, as aging is tolerant
of false negatives and false positives, so long as they aren't excessive.  And
that's doubly true if KVM x86 follows MM and doesn't force a TLB flush[1]

Oooh, and triply true if KVM stops marking the folio accesses when zapping SPTEs[2].

So yeah, after thinking though all of the moving parts, maybe we should commit
to aging AD-disabled SPTEs out of mmu_lock.  AD-disabled leaf SPTEs do end up being
"special", because KVM needs to ensure it doesn't clobber the Writable bit, i.e.
AD-disabled leaf SPTEs need to be treated as volatile at all times.  But in practice,
forcing an atomic update for all AD-disabled leaf SPTEs probably doesn't actually
change much, because in most cases KVM is probably using an atomic access anyways,
e.g. because KVM is clearing the Writable bit and the Writable bit is already volatile.

FWIW, marking the folio dirty if the SPTE was writable, as is done today in
mmu_spte_age(), is sketchy if mmu_lock isn't held, but probably ok since this is
invoked from an mmu_notifier and presumably the caller holds a reference to the
page/folio.  But that's largely a moot point since I want to yank out that code
anyways[3].

[1] https://lore.kernel.org/all/ZsS_OmxwFzrqDcfY@google.com
[2] https://lore.kernel.org/all/20240726235234.228822-82-seanjc@google.com
[3] https://lore.kernel.org/all/20240726235234.228822-8-seanjc@google.com

> > +                               if (spte_ad_enabled(spte))
> > +                                       clear_bit((ffs(shadow_accessed_mask) - 1),
> > +                                                 (unsigned long *)sptep);
> 
> I feel like it'd be kinda nice to de-duplicate this clear_bit() piece
> with the one in kvm_rmap_age_gfn_range().

Ya, definitely no argument against adding a helper.

> > +                               young = true;
> > +                       }
> > +
> > +                       kvm_rmap_unlock_readonly(rmap_head, rmap_val);
> > +               }
> > +       }
> > +       return young;
> > +}
> > +
> > +static bool __kvm_rmap_age_gfn_range(struct kvm *kvm,
> > +                                    struct kvm_gfn_range *range, bool test_only)
> >  {
> >         struct slot_rmap_walk_iterator iterator;
> >         struct rmap_iterator iter;
> > @@ -1783,6 +1826,18 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> >         return young;
> >  }
> >
> > +
> > +static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> > +                                  struct kvm_gfn_range *range, bool test_only)
> > +{
> > +       /* FIXME: This also needs to be guarded with something like range->fast_only. */
> > +       if (kvm_ad_enabled())
> 
> I expect this to be something like `if (kvm_ad_enabled() ||
> range->fast_only)`. With MGLRU, that means the pages will always be the last
> candidates for eviction, though it is still possible for them to be evicted
> (though I think this would basically never happen). I think this is fine.
> 
> I think the only other possible choice is to record/return 'not young'/false
> instead of 'young'/true if the spte is young but !spte_ad_enabled(). That
> doesn't seem to be obviously better, though we *will* get correct age
> information at eviction time, when !range->fast_only, at which point the page
> will not be evicted, and Accessed will be cleared.

As above, I think the simpler solution overall is to support aging AD-disabled
SPTEs out of mmu_lock.  The sequence of getting to that end state will be more
complex, but most of that complexity is going to happen irrespective of this series.
And it would mean KVM MGLRU support has no chance of landing in 6.12, but again
I think that's the reality either way.

  reply	other threads:[~2024-09-04 15:48 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
2024-08-09 19:43 ` [PATCH 01/22] KVM: selftests: Check for a potential unhandled exception iff KVM_RUN succeeded Sean Christopherson
2024-08-09 19:43 ` [PATCH 02/22] KVM: selftests: Rename max_guest_memory_test to mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 03/22] KVM: selftests: Only muck with SREGS on x86 in mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 04/22] KVM: selftests: Compute number of extra pages needed " Sean Christopherson
2024-09-06  0:03   ` James Houghton
2024-09-06  4:42     ` Sean Christopherson
2024-09-06 18:25       ` James Houghton
2024-08-09 19:43 ` [PATCH 05/22] KVM: selftests: Enable mmu_stress_test on arm64 Sean Christopherson
2024-08-09 19:43 ` [PATCH 06/22] KVM: selftests: Use vcpu_arch_put_guest() in mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 07/22] KVM: selftests: Precisely limit the number of guest loops " Sean Christopherson
2024-08-09 19:43 ` [PATCH 08/22] KVM: selftests: Add a read-only mprotect() phase to mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ) Sean Christopherson
     [not found]   ` <CADrL8HXcD--jn1iLeCJycCd3Btv4_rBPxz6NMnTREXfeh0vRZA@mail.gmail.com>
2024-09-07  0:53     ` Sean Christopherson
2024-09-07  2:26       ` James Houghton
2024-09-09 15:49         ` Sean Christopherson
2024-08-09 19:43 ` [PATCH 10/22] KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() Sean Christopherson
2024-08-28 18:20   ` Sean Christopherson
2024-08-09 19:43 ` [PATCH 11/22] KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() Sean Christopherson
2024-08-09 19:43 ` [PATCH 12/22] KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot Sean Christopherson
2024-08-09 19:43 ` [PATCH 13/22] KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed Sean Christopherson
2024-08-09 19:43 ` [PATCH 14/22] KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper Sean Christopherson
2024-08-12 21:53   ` David Matlack
2024-08-12 21:58     ` David Matlack
2024-08-09 19:43 ` [PATCH 15/22] KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() Sean Christopherson
2024-08-09 19:43 ` [PATCH 16/22] KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals Sean Christopherson
2024-08-09 19:43 ` [PATCH 17/22] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock Sean Christopherson
2024-08-09 19:43 ` [PATCH 18/22] KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent Sean Christopherson
2024-08-09 19:43 ` [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock Sean Christopherson
2024-08-12  8:39   ` Lai Jiangshan
2024-08-12 15:22     ` Sean Christopherson
2024-08-13  6:35       ` Lai Jiangshan
2024-08-13 15:19         ` Sean Christopherson
2024-09-03 20:17   ` James Houghton
2024-09-03 21:27     ` Sean Christopherson
2024-09-03 21:40       ` James Houghton
2024-09-09 19:00   ` James Houghton
2024-09-09 20:02     ` Sean Christopherson
2024-09-09 20:46       ` James Houghton
2024-09-09 22:02         ` Sean Christopherson
2024-09-10  0:28           ` James Houghton
2024-09-10  1:42             ` Sean Christopherson
2024-09-10 21:11               ` James Houghton
2024-09-11 15:33                 ` Sean Christopherson
2024-08-09 19:43 ` [PATCH 20/22] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs Sean Christopherson
2024-08-09 19:43 ` [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns Sean Christopherson
2024-09-03 20:36   ` James Houghton
2024-09-04 15:48     ` Sean Christopherson [this message]
2024-08-09 19:43 ` [PATCH 22/22] ***HACK*** KVM: x86: Don't take " Sean Christopherson
2024-09-10  4:56 ` [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZtiBJydMqwkTaoOM@google.com \
    --to=seanjc@google.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).