kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: James Houghton <jthoughton@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	David Matlack <dmatlack@google.com>,
	 David Rientjes <rientjes@google.com>,
	Marc Zyngier <maz@kernel.org>,
	 Oliver Upton <oliver.upton@linux.dev>,
	Wei Xu <weixugc@google.com>, Yu Zhao <yuzhao@google.com>,
	 Axel Rasmussen <axelrasmussen@google.com>,
	kvm@vger.kernel.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
Date: Fri, 10 Jan 2025 14:47:08 -0800	[thread overview]
Message-ID: <Z4GjbCRgyBe7k9gw@google.com> (raw)
In-Reply-To: <20241105184333.2305744-5-jthoughton@google.com>

On Tue, Nov 05, 2024, James Houghton wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index a24fca3f9e7f..f26d0b60d2dd 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  }
>  
>  /*
> - * SPTEs must be modified atomically if they are shadow-present, leaf
> - * SPTEs, and have volatile bits, i.e. has bits that can be set outside
> - * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
> - * handler, and Accessed and Dirty bits can be set by the CPU.
> + * SPTEs must be modified atomically if they have bits that can be set outside
> + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
> + * Writable bit can be set by KVM's fast page fault handler, the Accessed and
> + * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be
> + * cleared by age_gfn_range().
>   *
>   * Note, non-leaf SPTEs do have Accessed bits and those bits are
>   * technically volatile, but KVM doesn't consume the Accessed bit of
> @@ -53,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
>  {
>  	return is_shadow_present_pte(old_spte) &&
> -	       is_last_spte(old_spte, level) &&
> -	       spte_has_volatile_bits(old_spte);
> +	       is_last_spte(old_spte, level);

I don't like this change on multiple fronts.  First and foremost, it results in
spte_has_volatile_bits() being wrong for the TDP MMU.  Second, the same logic
applies to the shadow MMU; the rmap lock prevents a use-after-free of the page
that owns the SPTE, but the zapping of the SPTE happens before the writer grabs
the rmap lock.

Lastly, I'm very, very tempted to say we should omit Accessed state from
spte_has_volatile_bits() and rename it to something like spte_needs_atomic_write().
KVM x86 no longer flushes TLBs on aging, so we're already committed to incorrectly
thinking a page is old in rare cases, for the sake of performance.  The odds of
KVM clobbering the Accessed bit are probably smaller than the odds of missing an
Accessed update due to a stale TLB entry.

Note, only the shadow_accessed_mask check can be removed.  KVM needs to ensure
access-tracked SPTEs are zapped properly, and dirty logging can't have false
negatives.

>  }
>  
>  static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4508d868f1cd..f5b4f1060fff 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -178,6 +178,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>  		     ((_only_valid) && (_root)->role.invalid))) {		\
>  		} else
>  
> +/*
> + * Iterate over all TDP MMU roots in an RCU read-side critical section.

Heh, that's pretty darn obvious.  It would be far more helpful if the comment
explained the usage rules, e.g. what is safe (at a high level).

> + */
> +#define for_each_valid_tdp_mmu_root_rcu(_kvm, _root, _as_id)			\
> +	list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)		\
> +		if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||	\
> +		    (_root)->role.invalid) {					\
> +		} else
> +
>  #define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
>  	__for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
>  
> @@ -1168,16 +1177,16 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
>  	u64 new_spte;
>  
>  	if (spte_ad_enabled(iter->old_spte)) {
> -		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> -							 iter->old_spte,
> -							 shadow_accessed_mask,
> -							 iter->level);
> +		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> +						shadow_accessed_mask);

Align, and let this poke past 80:

		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
								shadow_accessed_mask);

>  		new_spte = iter->old_spte & ~shadow_accessed_mask;
>  	} else {
>  		new_spte = mark_spte_for_access_track(iter->old_spte);
> -		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
> -							iter->old_spte, new_spte,
> -							iter->level);
> +		/*
> +		 * It is safe for the following cmpxchg to fail. Leave the
> +		 * Accessed bit set, as the spte is most likely young anyway.
> +		 */
> +		(void)__tdp_mmu_set_spte_atomic(iter, new_spte);

Just a reminder that this needs to be:

		if (__tdp_mmu_set_spte_atomic(iter, new_spte))
			return;

>  	}
>  
>  	trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
> -- 
> 2.47.0.199.ga7371fff76-goog
> 

  parent reply	other threads:[~2025-01-10 22:47 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
2024-11-05 18:43 ` [PATCH v8 01/11] KVM: Remove kvm_handle_hva_range helper functions James Houghton
2025-01-10 22:15   ` Sean Christopherson
2025-01-27 19:50     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 02/11] KVM: Add lockless memslot walk to KVM James Houghton
2025-01-10 22:26   ` Sean Christopherson
2025-01-27 19:51     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine James Houghton
2024-11-05 18:45   ` Yu Zhao
2025-01-10 22:34   ` Sean Christopherson
2025-01-27 19:51     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
2024-11-06  8:22   ` kernel test robot
2024-11-08  3:00     ` James Houghton
2024-11-08 22:45       ` Sean Christopherson
2024-11-11 14:45         ` James Houghton
2025-01-10 22:47   ` Sean Christopherson [this message]
2025-01-27 19:52     ` James Houghton
2025-01-27 19:57   ` James Houghton
2025-01-27 20:09     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn James Houghton
2024-11-05 18:46   ` Yu Zhao
2025-01-10 22:59   ` Sean Christopherson
2025-01-27 19:58     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 James Houghton
2024-11-05 18:49   ` Yu Zhao
2025-01-10 23:05   ` Sean Christopherson
2025-01-27 19:58     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 07/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock James Houghton
2024-11-05 18:43 ` [PATCH v8 08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock James Houghton
2025-01-10 23:18   ` Sean Christopherson
2025-01-27 21:42     ` James Houghton
2025-01-27 21:52   ` James Houghton
2024-11-05 18:43 ` [PATCH v8 09/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs James Houghton
2024-11-05 18:43 ` [PATCH v8 10/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns James Houghton
2024-11-05 18:43 ` [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton
2025-01-11  0:12   ` Sean Christopherson
2025-02-03 22:46     ` James Houghton
2025-01-11  0:21   ` Yosry Ahmed
2024-11-05 19:21 ` [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly Yu Zhao
2024-11-05 19:28   ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z4GjbCRgyBe7k9gw@google.com \
    --to=seanjc@google.com \
    --cc=axelrasmussen@google.com \
    --cc=dmatlack@google.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=rientjes@google.com \
    --cc=weixugc@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).