From: Sean Christopherson <seanjc@google.com>
To: James Houghton <jthoughton@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Paolo Bonzini <pbonzini@redhat.com>,
Ankit Agrawal <ankita@nvidia.com>,
Axel Rasmussen <axelrasmussen@google.com>,
Catalin Marinas <catalin.marinas@arm.com>,
David Matlack <dmatlack@google.com>,
David Rientjes <rientjes@google.com>,
James Morse <james.morse@arm.com>,
Jason Gunthorpe <jgg@ziepe.ca>, Jonathan Corbet <corbet@lwn.net>,
Marc Zyngier <maz@kernel.org>,
Oliver Upton <oliver.upton@linux.dev>,
Raghavendra Rao Ananta <rananta@google.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Shaoqin Huang <shahuang@redhat.com>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
Wei Xu <weixugc@google.com>, Will Deacon <will@kernel.org>,
Yu Zhao <yuzhao@google.com>, Zenghui Yu <yuzenghui@huawei.com>,
kvmarm@lists.linux.dev, kvm@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
Date: Fri, 16 Aug 2024 18:05:26 -0700 [thread overview]
Message-ID: <Zr_3Vohvzt0KmFiN@google.com> (raw)
In-Reply-To: <20240724011037.3671523-3-jthoughton@google.com>
On Wed, Jul 24, 2024, James Houghton wrote:
> Walk the TDP MMU in an RCU read-side critical section.
...without holding mmu_lock, while doing xxx. There are a lot of TDP MMU walks,
pand they all need RCU protection.
> This requires a way to do RCU-safe walking of the tdp_mmu_roots; do this with
> a new macro. The PTE modifications are now done atomically, and
> kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for the fact
> that kvm_age_gfn can now lockless update the accessed bit and the R/X bits).
>
> If the cmpxchg for marking the spte for access tracking fails, we simply
> retry if the spte is still a leaf PTE. If it isn't, we return false
> to continue the walk.
Please avoid pronouns. E.g. s/we/KVM (and adjust grammar as needed), so that
it's clear what actor in particular is doing the retry.
> Harvesting age information from the shadow MMU is still done while
> holding the MMU write lock.
>
> Suggested-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/Kconfig | 1 +
> arch/x86/kvm/mmu/mmu.c | 10 ++++-
> arch/x86/kvm/mmu/tdp_iter.h | 27 +++++++------
> arch/x86/kvm/mmu/tdp_mmu.c | 67 +++++++++++++++++++++++++--------
> 5 files changed, 77 insertions(+), 29 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 950a03e0181e..096988262005 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1456,6 +1456,7 @@ struct kvm_arch {
> * tdp_mmu_page set.
> *
> * For reads, this list is protected by:
> + * RCU alone or
> * the MMU lock in read mode + RCU or
> * the MMU lock in write mode
> *
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 4287a8071a3a..6ac43074c5e9 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -23,6 +23,7 @@ config KVM
> depends on X86_LOCAL_APIC
> select KVM_COMMON
> select KVM_GENERIC_MMU_NOTIFIER
> + select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
> select HAVE_KVM_IRQCHIP
> select HAVE_KVM_PFNCACHE
> select HAVE_KVM_DIRTY_RING_TSO
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 901be9e420a4..7b93ce8f0680 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1633,8 +1633,11 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> bool young = false;
>
> - if (kvm_memslots_have_rmaps(kvm))
> + if (kvm_memslots_have_rmaps(kvm)) {
> + write_lock(&kvm->mmu_lock);
> young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> + write_unlock(&kvm->mmu_lock);
> + }
>
> if (tdp_mmu_enabled)
> young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> @@ -1646,8 +1649,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> bool young = false;
>
> - if (kvm_memslots_have_rmaps(kvm))
> + if (kvm_memslots_have_rmaps(kvm)) {
> + write_lock(&kvm->mmu_lock);
> young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> + write_unlock(&kvm->mmu_lock);
> + }
>
> if (tdp_mmu_enabled)
> young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index 2880fd392e0c..510936a8455a 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
> return xchg(rcu_dereference(sptep), new_spte);
> }
>
> +static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
> +{
> + atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
> +
> + return (u64)atomic64_fetch_and(~mask, sptep_atomic);
> +}
> +
> static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> {
> KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte));
> @@ -32,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> }
>
> /*
> - * SPTEs must be modified atomically if they are shadow-present, leaf
> - * SPTEs, and have volatile bits, i.e. has bits that can be set outside
> - * of mmu_lock. The Writable bit can be set by KVM's fast page fault
> - * handler, and Accessed and Dirty bits can be set by the CPU.
> + * SPTEs must be modified atomically if they have bits that can be set outside
> + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
> + * Writable bit can be set by KVM's fast page fault handler, the Accessed and
> + * Dirty bits can be set by the CPU, and the Accessed and R/X bits can be
> + * cleared by age_gfn_range.
> *
> * Note, non-leaf SPTEs do have Accessed bits and those bits are
> * technically volatile, but KVM doesn't consume the Accessed bit of
> @@ -46,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
> {
> return is_shadow_present_pte(old_spte) &&
> - is_last_spte(old_spte, level) &&
> - spte_has_volatile_bits(old_spte);
> + is_last_spte(old_spte, level);
> }
>
> static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> @@ -63,12 +70,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte,
> u64 mask, int level)
> {
> - atomic64_t *sptep_atomic;
> -
> - if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) {
> - sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
> - return (u64)atomic64_fetch_and(~mask, sptep_atomic);
> - }
> + if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level))
> + return tdp_mmu_clear_spte_bits_atomic(sptep, mask);
>
> __kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask);
> return old_spte;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index c7dc49ee7388..3f13b2db53de 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -29,6 +29,11 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
>
> return true;
> }
> +static __always_inline bool kvm_lockdep_assert_rcu_read_lock_held(void)
> +{
> + WARN_ON_ONCE(!rcu_read_lock_held());
> + return true;
> +}
I doubt KVM needs a manual WARN, the RCU deference stuff should yell loudly if
something is missing an rcu_read_lock().
> void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> {
> @@ -178,6 +183,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> ((_only_valid) && (_root)->role.invalid))) { \
> } else
>
> +/*
> + * Iterate over all TDP MMU roots in an RCU read-side critical section.
> + */
> +#define for_each_tdp_mmu_root_rcu(_kvm, _root, _as_id) \
> + list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link) \
This should just process valid roots:
https://lore.kernel.org/all/20240801183453.57199-7-seanjc@google.com
> + if (kvm_lockdep_assert_rcu_read_lock_held() && \
> + (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id)) { \
> + } else
> +
> #define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
> __for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
>
> @@ -1224,6 +1238,27 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
> return ret;
> }
>
> +static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless(
> + struct kvm *kvm,
> + struct kvm_gfn_range *range,
> + tdp_handler_t handler)
Please burn all the Google3 from your brain, and code ;-)
> + struct kvm_mmu_page *root;
> + struct tdp_iter iter;
> + bool ret = false;
> +
> + rcu_read_lock();
> +
> + for_each_tdp_mmu_root_rcu(kvm, root, range->slot->as_id) {
> + tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
> + ret |= handler(kvm, &iter, range);
> + }
> +
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> /*
> * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
> * if any of the GFNs in the range have been accessed.
> @@ -1237,28 +1272,30 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
> {
> u64 new_spte;
>
> +retry:
> /* If we have a non-accessed entry we don't need to change the pte. */
> if (!is_accessed_spte(iter->old_spte))
> return false;
>
> if (spte_ad_enabled(iter->old_spte)) {
> - iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> - iter->old_spte,
> - shadow_accessed_mask,
> - iter->level);
> + iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> + shadow_accessed_mask);
> new_spte = iter->old_spte & ~shadow_accessed_mask;
> } else {
> - /*
> - * Capture the dirty status of the page, so that it doesn't get
> - * lost when the SPTE is marked for access tracking.
> - */
> + new_spte = mark_spte_for_access_track(iter->old_spte);
> + if (__tdp_mmu_set_spte_atomic(iter, new_spte)) {
> + /*
> + * The cmpxchg failed. If the spte is still a
> + * last-level spte, we can safely retry.
> + */
> + if (is_shadow_present_pte(iter->old_spte) &&
> + is_last_spte(iter->old_spte, iter->level))
> + goto retry;
Do we have a feel for how often conflicts actually happen? I.e. is it worth
retrying and having to worry about infinite loops, however improbable they may
be?
next prev parent reply other threads:[~2024-08-17 1:05 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-24 1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
2024-07-24 1:10 ` [PATCH v6 01/11] KVM: Add lockless memslot walk to KVM James Houghton
2024-07-25 16:39 ` David Matlack
2024-07-26 0:28 ` James Houghton
2024-07-24 1:10 ` [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
2024-07-25 18:07 ` David Matlack
2024-07-26 0:34 ` James Houghton
2024-08-17 1:05 ` Sean Christopherson [this message]
2024-08-30 0:35 ` James Houghton
2024-08-30 3:47 ` Sean Christopherson
2024-08-30 12:47 ` Jason Gunthorpe
2024-08-30 17:09 ` Sean Christopherson
2024-08-30 20:22 ` Jason Gunthorpe
2024-07-24 1:10 ` [PATCH v6 03/11] KVM: arm64: " James Houghton
2024-07-25 21:55 ` James Houghton
2024-08-17 0:46 ` Sean Christopherson
2024-08-17 1:03 ` Yu Zhao
2024-08-19 20:41 ` Oliver Upton
2024-08-19 22:47 ` Sean Christopherson
2024-08-30 0:33 ` James Houghton
2024-08-30 0:48 ` Oliver Upton
2024-08-30 15:33 ` David Matlack
2024-08-30 17:38 ` Oliver Upton
2024-07-24 1:10 ` [PATCH v6 04/11] mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER James Houghton
2024-08-01 9:34 ` David Hildenbrand
2024-07-24 1:10 ` [PATCH v6 05/11] mm: Add fast_only bool to test_young and clear_young MMU notifiers James Houghton
2024-08-01 9:36 ` David Hildenbrand
2024-08-01 23:13 ` James Houghton
2024-08-02 15:57 ` David Hildenbrand
2024-08-05 16:54 ` James Houghton
2024-07-24 1:10 ` [PATCH v6 06/11] mm: Add has_fast_aging to struct mmu_notifier James Houghton
2024-07-24 1:10 ` [PATCH v6 07/11] KVM: Pass fast_only to kvm_{test_,}age_gfn James Houghton
2024-07-24 1:10 ` [PATCH v6 08/11] KVM: x86: Optimize kvm_{test_,}age_gfn a little bit James Houghton
2024-07-25 18:17 ` David Matlack
2024-08-17 1:00 ` Sean Christopherson
2024-08-30 0:34 ` James Houghton
2024-07-24 1:10 ` [PATCH v6 09/11] KVM: x86: Implement fast_only versions of kvm_{test_,}age_gfn James Houghton
2024-07-25 18:24 ` David Matlack
2024-07-24 1:10 ` [PATCH v6 10/11] mm: multi-gen LRU: Have secondary MMUs participate in aging James Houghton
2024-07-24 1:10 ` [PATCH v6 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zr_3Vohvzt0KmFiN@google.com \
--to=seanjc@google.com \
--cc=akpm@linux-foundation.org \
--cc=ankita@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=catalin.marinas@arm.com \
--cc=corbet@lwn.net \
--cc=dmatlack@google.com \
--cc=james.morse@arm.com \
--cc=jgg@ziepe.ca \
--cc=jthoughton@google.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.linux.dev \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=maz@kernel.org \
--cc=oliver.upton@linux.dev \
--cc=pbonzini@redhat.com \
--cc=rananta@google.com \
--cc=rientjes@google.com \
--cc=ryan.roberts@arm.com \
--cc=shahuang@redhat.com \
--cc=suzuki.poulose@arm.com \
--cc=weixugc@google.com \
--cc=will@kernel.org \
--cc=yuzenghui@huawei.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).