public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
@ 2025-02-04  0:40 James Houghton
  2025-02-04  0:40 ` [PATCH v9 01/11] KVM: Rename kvm_handle_hva_range() James Houghton
                   ` (12 more replies)
  0 siblings, 13 replies; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
waiting for aging. This contention reduction improves guest performance
and saves a significant amount of Google Cloud's CPU usage, and it has
valuable improvements for ChromeOS, as Yu has mentioned previously[1].

Please see v8[8] for some performance results using
access_tracking_perf_test patched to use MGLRU.

Neither access_tracking_perf_test nor mmu_stress_test trigger any
splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.

=== Previous Versions ===

Since v8[8]:
 - Re-added the kvm_handle_hva_range helpers and applied Sean's
   kvm_{handle -> age}_hva_range rename.
 - Renamed spte_has_volatile_bits() to spte_needs_atomic_write() and
   removed its Accessed bit check. Undid change to
   tdp_mmu_spte_need_atomic_write().
 - Renamed KVM_MMU_NOTIFIER_{YOUNG -> AGING}_LOCKLESS.
 - cpu_relax(), lockdep, preempt_disable(), and locking fixups for
   per-rmap lock (thanks Lai and Sean).
 - Renamed kvm_{has -> may_have}_shadow_mmu_sptes().
 - Rebased onto latest kvm/next, including changing
   for_each_tdp_mmu_root_rcu to use `types`.
 - Dropped MGLRU changes from access_tracking_perf_test.
 - Picked up Acked-bys from Yu. (thank you!)

Since v7[7]:
 - Dropped MGLRU changes.
 - Dropped DAMON cleanup.
 - Dropped MMU notifier changes completely.
 - Made shadow MMU aging *always* lockless, not just lockless when the
   now-removed "fast_only" clear notifier was used.
 - Given that the MGLRU changes no longer introduce a new MGLRU
   capability, drop the new capability check from the selftest.
 - Rebased on top of latest kvm-x86/next, including the x86 mmu changes
   for marking pages as dirty.

Since v6[6]:
 - Rebased on top of kvm-x86/next and Sean's lockless rmap walking
   changes.
 - Removed HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY (thanks DavidM).
 - Split up kvm_age_gfn() / kvm_test_age_gfn() optimizations (thanks
   DavidM and Sean).
 - Improved new MMU notifier documentation (thanks DavidH).
 - Dropped arm64 locking change.
 - No longer retry for CAS failure in TDP MMU non-A/D case (thanks
   Sean).
 - Added some R-bys and A-bys.

Since v5[5]:
 - Reworked test_clear_young_fast_only() into a new parameter for the
   existing notifiers (thanks Sean).
 - Added mmu_notifier.has_fast_aging to tell mm if calling fast-only
   notifiers should be done.
 - Added mm_has_fast_young_notifiers() to inform users if calling
   fast-only notifier helpers is worthwhile (for look-around to use).
 - Changed MGLRU to invoke a single notifier instead of two when
   aging and doing look-around (thanks Yu).
 - For KVM/x86, check indirect_shadow_pages > 0 instead of
   kvm_memslots_have_rmaps() when collecting age information
   (thanks Sean).
 - For KVM/arm, some fixes from Oliver.
 - Small fixes to access_tracking_perf_test.
 - Added missing !MMU_NOTIFIER version of mmu_notifier_clear_young().

Since v4[4]:
 - Removed Kconfig that controlled when aging was enabled. Aging will
   be done whenever the architecture supports it (thanks Yu).
 - Added a new MMU notifier, test_clear_young_fast_only(), specifically
   for MGLRU to use.
 - Add kvm_fast_{test_,}age_gfn, implemented by x86.
 - Fix locking for clear_flush_young().
 - Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes
   (thanks Sean).
 - Fix WARN_ON and other cleanup for the arm64 locking changes
   (thanks Oliver).

Since v3[3]:
 - Vastly simplified the series (thanks David). Removed mmu notifier
   batching logic entirely.
 - Cleaned up how locking is done for mmu_notifier_test/clear_young
   (thanks David).
 - Look-around is now only done when there are no secondary MMUs
   subscribed to MMU notifiers.
 - CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
 - Fixed the lockless implementation of kvm_{test,}age_gfn for x86
   (thanks David).
 - Added MGLRU functional and performance tests to
   access_tracking_perf_test (thanks Axel).
 - In v3, an mm would be completely ignored (for aging) if there was a
   secondary MMU but support for secondary MMU walking was missing. Now,
   missing secondary MMU walking support simply skips the notifier
   calls (except for eviction).
 - Added a sanity check for that range->lockless and range->on_lock are
   never both provided for the memslot walk.

For the changes since v2[2], see v3.

Based on latest kvm/next.

[1]: https://lore.kernel.org/kvm/CAOUHufYS0XyLEf_V+q5SCW54Zy2aW5nL8CnSWreM8d1rX5NKYg@mail.gmail.com/
[2]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuzhao@google.com/
[3]: https://lore.kernel.org/linux-mm/20240401232946.1837665-1-jthoughton@google.com/
[4]: https://lore.kernel.org/linux-mm/20240529180510.2295118-1-jthoughton@google.com/
[5]: https://lore.kernel.org/linux-mm/20240611002145.2078921-1-jthoughton@google.com/
[6]: https://lore.kernel.org/linux-mm/20240724011037.3671523-1-jthoughton@google.com/
[7]: https://lore.kernel.org/kvm/20240926013506.860253-1-jthoughton@google.com/
[8]: https://lore.kernel.org/kvm/20241105184333.2305744-1-jthoughton@google.com/

James Houghton (7):
  KVM: Rename kvm_handle_hva_range()
  KVM: Add lockless memslot walk to KVM
  KVM: x86/mmu: Factor out spte atomic bit clearing routine
  KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
  KVM: x86/mmu: Rename spte_has_volatile_bits() to
    spte_needs_atomic_write()
  KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as
    young
  KVM: x86/mmu: Only check gfn age in shadow MMU if
    indirect_shadow_pages > 0

Sean Christopherson (4):
  KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o
    mmu_lock
  KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of
    mmu_lock
  KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
  KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging
    gfns

 Documentation/virt/kvm/locking.rst |   4 +-
 arch/x86/include/asm/kvm_host.h    |   4 +-
 arch/x86/kvm/Kconfig               |   1 +
 arch/x86/kvm/mmu/mmu.c             | 364 +++++++++++++++++++++--------
 arch/x86/kvm/mmu/spte.c            |  19 +-
 arch/x86/kvm/mmu/spte.h            |   2 +-
 arch/x86/kvm/mmu/tdp_iter.h        |  26 ++-
 arch/x86/kvm/mmu/tdp_mmu.c         |  36 ++-
 include/linux/kvm_host.h           |   1 +
 virt/kvm/Kconfig                   |   2 +
 virt/kvm/kvm_main.c                |  56 +++--
 11 files changed, 364 insertions(+), 151 deletions(-)


base-commit: f7bafceba76e9ab475b413578c1757ee18c3e44b
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 01/11] KVM: Rename kvm_handle_hva_range()
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
@ 2025-02-04  0:40 ` James Houghton
  2025-02-04  0:40 ` [PATCH v9 02/11] KVM: Add lockless memslot walk to KVM James Houghton
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

Rename kvm_handle_hva_range() to kvm_age_hva_range(),
kvm_handle_hva_range_no_flush() to kvm_age_hva_range_no_flush(), and
__kvm_handle_hva_range() to kvm_handle_hva_range(), as
kvm_age_hva_range() will get more aging-specific functionality.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 virt/kvm/kvm_main.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index add042839823..1bd49770506a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -551,8 +551,8 @@ static void kvm_null_fn(void)
 	     node;							     \
 	     node = interval_tree_iter_next(node, start, last))	     \
 
-static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
-							   const struct kvm_mmu_notifier_range *range)
+static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
+							 const struct kvm_mmu_notifier_range *range)
 {
 	struct kvm_mmu_notifier_return r = {
 		.ret = false,
@@ -633,7 +633,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 	return r;
 }
 
-static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
+static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn,
 						unsigned long start,
 						unsigned long end,
 						gfn_handler_t handler,
@@ -649,15 +649,15 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.may_block	= false,
 	};
 
-	return __kvm_handle_hva_range(kvm, &range).ret;
+	return kvm_handle_hva_range(kvm, &range).ret;
 }
 
-static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
-							 unsigned long start,
-							 unsigned long end,
-							 gfn_handler_t handler)
+static __always_inline int kvm_age_hva_range_no_flush(struct mmu_notifier *mn,
+						      unsigned long start,
+						      unsigned long end,
+						      gfn_handler_t handler)
 {
-	return kvm_handle_hva_range(mn, start, end, handler, false);
+	return kvm_age_hva_range(mn, start, end, handler, false);
 }
 
 void kvm_mmu_invalidate_begin(struct kvm *kvm)
@@ -752,7 +752,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * that guest memory has been reclaimed.  This needs to be done *after*
 	 * dropping mmu_lock, as x86's reclaim path is slooooow.
 	 */
-	if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
+	if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
 		kvm_arch_guest_memory_reclaimed(kvm);
 
 	return 0;
@@ -798,7 +798,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	};
 	bool wake;
 
-	__kvm_handle_hva_range(kvm, &hva_range);
+	kvm_handle_hva_range(kvm, &hva_range);
 
 	/* Pairs with the increment in range_start(). */
 	spin_lock(&kvm->mn_invalidate_lock);
@@ -822,8 +822,8 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 {
 	trace_kvm_age_hva(start, end);
 
-	return kvm_handle_hva_range(mn, start, end, kvm_age_gfn,
-				    !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG));
+	return kvm_age_hva_range(mn, start, end, kvm_age_gfn,
+				 !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG));
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
@@ -846,7 +846,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * cadence. If we find this inaccurate, we might come up with a
 	 * more sophisticated heuristic later.
 	 */
-	return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
+	return kvm_age_hva_range_no_flush(mn, start, end, kvm_age_gfn);
 }
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
@@ -855,8 +855,8 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 {
 	trace_kvm_test_age_hva(address);
 
-	return kvm_handle_hva_range_no_flush(mn, address, address + 1,
-					     kvm_test_age_gfn);
+	return kvm_age_hva_range_no_flush(mn, address, address + 1,
+					  kvm_test_age_gfn);
 }
 
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v9 02/11] KVM: Add lockless memslot walk to KVM
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
  2025-02-04  0:40 ` [PATCH v9 01/11] KVM: Rename kvm_handle_hva_range() James Houghton
@ 2025-02-04  0:40 ` James Houghton
  2025-02-14 15:26   ` Sean Christopherson
  2025-02-04  0:40 ` [PATCH v9 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine James Houghton
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

It is possible to correctly do aging without taking the KVM MMU lock;
this option allows such architectures to do so. Architectures that
select CONFIG_KVM_MMU_NOTIFIER_AGING_LOCKLESS are responsible for
correctness.

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/Kconfig         |  2 ++
 virt/kvm/kvm_main.c      | 24 +++++++++++++++++-------
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f34f4cfaa513..c28a6aa1f2ed 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -267,6 +267,7 @@ struct kvm_gfn_range {
 	union kvm_mmu_notifier_arg arg;
 	enum kvm_gfn_range_filter attr_filter;
 	bool may_block;
+	bool lockless;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..9356f4e4e255 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -102,6 +102,8 @@ config KVM_GENERIC_MMU_NOTIFIER
 
 config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
        depends on KVM_GENERIC_MMU_NOTIFIER
+
+config KVM_MMU_NOTIFIER_AGING_LOCKLESS
        bool
 
 config KVM_GENERIC_MEMORY_ATTRIBUTES
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1bd49770506a..4734ae9e8a54 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -517,6 +517,7 @@ struct kvm_mmu_notifier_range {
 	on_lock_fn_t on_lock;
 	bool flush_on_ret;
 	bool may_block;
+	bool lockless;
 };
 
 /*
@@ -571,6 +572,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 			 IS_KVM_NULL_FN(range->handler)))
 		return r;
 
+	/* on_lock will never be called for lockless walks */
+	if (WARN_ON_ONCE(range->lockless && !IS_KVM_NULL_FN(range->on_lock)))
+		return r;
+
 	idx = srcu_read_lock(&kvm->srcu);
 
 	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
@@ -607,15 +612,18 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
+			gfn_range.lockless = range->lockless;
 
 			if (!r.found_memslot) {
 				r.found_memslot = true;
-				KVM_MMU_LOCK(kvm);
-				if (!IS_KVM_NULL_FN(range->on_lock))
-					range->on_lock(kvm);
-
-				if (IS_KVM_NULL_FN(range->handler))
-					goto mmu_unlock;
+				if (!range->lockless) {
+					KVM_MMU_LOCK(kvm);
+					if (!IS_KVM_NULL_FN(range->on_lock))
+						range->on_lock(kvm);
+
+					if (IS_KVM_NULL_FN(range->handler))
+						goto mmu_unlock;
+				}
 			}
 			r.ret |= range->handler(kvm, &gfn_range);
 		}
@@ -625,7 +633,7 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 		kvm_flush_remote_tlbs(kvm);
 
 mmu_unlock:
-	if (r.found_memslot)
+	if (r.found_memslot && !range->lockless)
 		KVM_MMU_UNLOCK(kvm);
 
 	srcu_read_unlock(&kvm->srcu, idx);
@@ -647,6 +655,8 @@ static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn,
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= flush_on_ret,
 		.may_block	= false,
+		.lockless	=
+			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_AGING_LOCKLESS),
 	};
 
 	return kvm_handle_hva_range(kvm, &range).ret;
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v9 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
  2025-02-04  0:40 ` [PATCH v9 01/11] KVM: Rename kvm_handle_hva_range() James Houghton
  2025-02-04  0:40 ` [PATCH v9 02/11] KVM: Add lockless memslot walk to KVM James Houghton
@ 2025-02-04  0:40 ` James Houghton
  2025-02-04  0:40 ` [PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn() James Houghton
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

This new function, tdp_mmu_clear_spte_bits_atomic(), will be used in a
follow-up patch to enable lockless Accessed bit clearing.

Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.h | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 047b78333653..9135b035fa40 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
 	return xchg(rcu_dereference(sptep), new_spte);
 }
 
+static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
+{
+	atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
+
+	return (u64)atomic64_fetch_and(~mask, sptep_atomic);
+}
+
 static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 {
 	KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte));
@@ -63,12 +70,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
 static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte,
 					  u64 mask, int level)
 {
-	atomic64_t *sptep_atomic;
-
-	if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) {
-		sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
-		return (u64)atomic64_fetch_and(~mask, sptep_atomic);
-	}
+	if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level))
+		return tdp_mmu_clear_spte_bits_atomic(sptep, mask);
 
 	__kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask);
 	return old_spte;
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (2 preceding siblings ...)
  2025-02-04  0:40 ` [PATCH v9 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine James Houghton
@ 2025-02-04  0:40 ` James Houghton
  2025-02-12 22:07   ` Sean Christopherson
  2025-02-04  0:40 ` [PATCH v9 05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write() James Houghton
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

Walk the TDP MMU in an RCU read-side critical section without holding
mmu_lock when harvesting and potentially updating age information on
sptes. This requires a way to do RCU-safe walking of the tdp_mmu_roots;
do this with a new macro. The PTE modifications are now always done
atomically.

spte_has_volatile_bits() no longer checks for Accessed bit at all. It
can (now) be set and cleared without taking the mmu_lock, but dropping
Accessed bit updates is already tolerated (the TLB is not invalidated
after clearing the Accessed bit).

If the cmpxchg for marking the spte for access tracking fails, leave it
as is and treat it as if it were young, as if the spte is being actively
modified, it is most likely young.

Harvesting age information from the shadow MMU is still done while
holding the MMU write lock.

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/Kconfig            |  1 +
 arch/x86/kvm/mmu/mmu.c          | 10 +++++++--
 arch/x86/kvm/mmu/spte.c         | 10 +++++++--
 arch/x86/kvm/mmu/tdp_iter.h     |  9 +++++----
 arch/x86/kvm/mmu/tdp_mmu.c      | 36 +++++++++++++++++++++++----------
 6 files changed, 48 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f378cd43241c..0e44fc1cec0d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1479,6 +1479,7 @@ struct kvm_arch {
 	 * tdp_mmu_page set.
 	 *
 	 * For reads, this list is protected by:
+	 *	RCU alone or
 	 *	the MMU lock in read mode + RCU or
 	 *	the MMU lock in write mode
 	 *
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ea2c4f21c1ca..f0a60e59c884 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -22,6 +22,7 @@ config KVM_X86
 	select KVM_COMMON
 	select KVM_GENERIC_MMU_NOTIFIER
 	select KVM_ELIDE_TLB_FLUSH_IF_YOUNG
+	select KVM_MMU_NOTIFIER_AGING_LOCKLESS
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_PFNCACHE
 	select HAVE_KVM_DIRTY_RING_TSO
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a45ae60e84ab..7779b49f386d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1592,8 +1592,11 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm))
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
 		young = kvm_rmap_age_gfn_range(kvm, range, false);
+		write_unlock(&kvm->mmu_lock);
+	}
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
@@ -1605,8 +1608,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm))
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
 		young = kvm_rmap_age_gfn_range(kvm, range, true);
+		write_unlock(&kvm->mmu_lock);
+	}
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 22551e2f1d00..e984b440c0f0 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -142,8 +142,14 @@ bool spte_has_volatile_bits(u64 spte)
 		return true;
 
 	if (spte_ad_enabled(spte)) {
-		if (!(spte & shadow_accessed_mask) ||
-		    (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
+		/*
+		 * Do not check the Accessed bit. It can be set (by the CPU)
+		 * and cleared (by kvm_tdp_mmu_age_spte()) without holding
+		 * the mmu_lock, but when clearing the Accessed bit, we do
+		 * not invalidate the TLB, so we can already miss Accessed bit
+		 * updates.
+		 */
+		if (is_writable_pte(spte) && !(spte & shadow_dirty_mask))
 			return true;
 	}
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 9135b035fa40..05e9d678aac9 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 }
 
 /*
- * SPTEs must be modified atomically if they are shadow-present, leaf
- * SPTEs, and have volatile bits, i.e. has bits that can be set outside
- * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
- * handler, and Accessed and Dirty bits can be set by the CPU.
+ * SPTEs must be modified atomically if they have bits that can be set outside
+ * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
+ * Writable bit can be set by KVM's fast page fault handler, the Accessed and
+ * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be
+ * cleared by age_gfn_range().
  *
  * Note, non-leaf SPTEs do have Accessed bits and those bits are
  * technically volatile, but KVM doesn't consume the Accessed bit of
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 046b6ba31197..c9778c3e6ecd 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -193,6 +193,19 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		     !tdp_mmu_root_match((_root), (_types)))) {			\
 		} else
 
+/*
+ * Iterate over all TDP MMU roots in an RCU read-side critical section.
+ * It is safe to iterate over the SPTEs under the root, but their values will
+ * be unstable, so all writes must be atomic. As this routine is meant to be
+ * used without holding the mmu_lock at all, any bits that are flipped must
+ * be reflected in kvm_tdp_mmu_spte_need_atomic_write().
+ */
+#define for_each_tdp_mmu_root_rcu(_kvm, _root, _as_id, _types)			\
+	list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)		\
+		if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||	\
+		    !tdp_mmu_root_match((_root), (_types))) {			\
+		} else
+
 #define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id)		\
 	__for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_VALID_ROOTS)
 
@@ -1332,21 +1345,22 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
  * from the clear_young() or clear_flush_young() notifier, which uses the
  * return value to determine if the page has been accessed.
  */
-static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
+static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter)
 {
 	u64 new_spte;
 
 	if (spte_ad_enabled(iter->old_spte)) {
-		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
-							 iter->old_spte,
-							 shadow_accessed_mask,
-							 iter->level);
+		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
+								shadow_accessed_mask);
 		new_spte = iter->old_spte & ~shadow_accessed_mask;
 	} else {
 		new_spte = mark_spte_for_access_track(iter->old_spte);
-		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
-							iter->old_spte, new_spte,
-							iter->level);
+		/*
+		 * It is safe for the following cmpxchg to fail. Leave the
+		 * Accessed bit set, as the spte is most likely young anyway.
+		 */
+		if (__tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
+			return;
 	}
 
 	trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
@@ -1371,9 +1385,9 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm,
 	 * valid roots!
 	 */
 	WARN_ON(types & ~KVM_VALID_ROOTS);
-	__for_each_tdp_mmu_root(kvm, root, range->slot->as_id, types) {
-		guard(rcu)();
 
+	guard(rcu)();
+	for_each_tdp_mmu_root_rcu(kvm, root, range->slot->as_id, types) {
 		tdp_root_for_each_leaf_pte(iter, kvm, root, range->start, range->end) {
 			if (!is_accessed_spte(iter.old_spte))
 				continue;
@@ -1382,7 +1396,7 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm,
 				return true;
 
 			ret = true;
-			kvm_tdp_mmu_age_spte(&iter);
+			kvm_tdp_mmu_age_spte(kvm, &iter);
 		}
 	}
 
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v9 05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write()
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (3 preceding siblings ...)
  2025-02-04  0:40 ` [PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn() James Houghton
@ 2025-02-04  0:40 ` James Houghton
  2025-02-12 22:09   ` Sean Christopherson
  2025-02-04  0:40 ` [PATCH v9 06/11] KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young James Houghton
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

spte_has_volatile_bits() is now a misnomer, as the an SPTE can have its
Accessed bit set or cleared without the mmu_lock held, but the state of
the Accessed bit is not checked in spte_has_volatile_bits().
Even if a caller uses spte_needs_atomic_write(), Accessed bit
information may still be lost, but that is already tolerated, as the TLB
is not invalidated after the Accessed bit is cleared.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 Documentation/virt/kvm/locking.rst | 4 ++--
 arch/x86/kvm/mmu/mmu.c             | 4 ++--
 arch/x86/kvm/mmu/spte.c            | 9 +++++----
 arch/x86/kvm/mmu/spte.h            | 2 +-
 arch/x86/kvm/mmu/tdp_iter.h        | 2 +-
 5 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index c56d5f26c750..4720053c70a3 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -196,7 +196,7 @@ writable between reading spte and updating spte. Like below case:
 The Dirty bit is lost in this case.
 
 In order to avoid this kind of issue, we always treat the spte as "volatile"
-if it can be updated out of mmu-lock [see spte_has_volatile_bits()]; it means
+if it can be updated out of mmu-lock [see spte_needs_atomic_write()]; it means
 the spte is always atomically updated in this case.
 
 3) flush tlbs due to spte updated
@@ -212,7 +212,7 @@ function to update spte (present -> present).
 
 Since the spte is "volatile" if it can be updated out of mmu-lock, we always
 atomically update the spte and the race caused by fast page fault can be avoided.
-See the comments in spte_has_volatile_bits() and mmu_spte_update().
+See the comments in spte_needs_atomic_write() and mmu_spte_update().
 
 Lockless Access Tracking:
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7779b49f386d..1fa0f47eb6a5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -501,7 +501,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 		return false;
 	}
 
-	if (!spte_has_volatile_bits(old_spte))
+	if (!spte_needs_atomic_write(old_spte))
 		__update_clear_spte_fast(sptep, new_spte);
 	else
 		old_spte = __update_clear_spte_slow(sptep, new_spte);
@@ -524,7 +524,7 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 	int level = sptep_to_sp(sptep)->role.level;
 
 	if (!is_shadow_present_pte(old_spte) ||
-	    !spte_has_volatile_bits(old_spte))
+	    !spte_needs_atomic_write(old_spte))
 		__update_clear_spte_fast(sptep, SHADOW_NONPRESENT_VALUE);
 	else
 		old_spte = __update_clear_spte_slow(sptep, SHADOW_NONPRESENT_VALUE);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index e984b440c0f0..ae2017cc1239 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -129,11 +129,12 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 }
 
 /*
- * Returns true if the SPTE has bits that may be set without holding mmu_lock.
- * The caller is responsible for checking if the SPTE is shadow-present, and
- * for determining whether or not the caller cares about non-leaf SPTEs.
+ * Returns true if the SPTE has bits other than the Accessed bit that may be
+ * changed without holding mmu_lock. The caller is responsible for checking if
+ * the SPTE is shadow-present, and for determining whether or not the caller
+ * cares about non-leaf SPTEs.
  */
-bool spte_has_volatile_bits(u64 spte)
+bool spte_needs_atomic_write(u64 spte)
 {
 	if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
 		return true;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 59746854c0af..4c290ae9a02a 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -519,7 +519,7 @@ static inline u64 get_mmio_spte_generation(u64 spte)
 	return gen;
 }
 
-bool spte_has_volatile_bits(u64 spte);
+bool spte_needs_atomic_write(u64 spte);
 
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 05e9d678aac9..b54123163efc 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -55,7 +55,7 @@ static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
 {
 	return is_shadow_present_pte(old_spte) &&
 	       is_last_spte(old_spte, level) &&
-	       spte_has_volatile_bits(old_spte);
+	       spte_needs_atomic_write(old_spte);
 }
 
 static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v9 06/11] KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (4 preceding siblings ...)
  2025-02-04  0:40 ` [PATCH v9 05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write() James Houghton
@ 2025-02-04  0:40 ` James Houghton
  2025-02-04  0:40 ` [PATCH v9 07/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 James Houghton
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

Reorder the processing of the TDP MMU versus the shadow MMU when aging
SPTEs, and skip the shadow MMU entirely in the test-only case if the TDP
MMU reports that the page is young, i.e. completely avoid taking mmu_lock
if the TDP MMU SPTE is young.  Swap the order for the test-and-age helper
as well for consistency.

Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1fa0f47eb6a5..4a9de4b330d7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1592,15 +1592,15 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
+	if (tdp_mmu_enabled)
+		young = kvm_tdp_mmu_age_gfn_range(kvm, range);
+
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		young = kvm_rmap_age_gfn_range(kvm, range, false);
+		young |= kvm_rmap_age_gfn_range(kvm, range, false);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-	if (tdp_mmu_enabled)
-		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
-
 	return young;
 }
 
@@ -1608,15 +1608,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm)) {
+	if (tdp_mmu_enabled)
+		young = kvm_tdp_mmu_test_age_gfn(kvm, range);
+
+	if (!young && kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		young = kvm_rmap_age_gfn_range(kvm, range, true);
+		young |= kvm_rmap_age_gfn_range(kvm, range, true);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-	if (tdp_mmu_enabled)
-		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
-
 	return young;
 }
 
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v9 07/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (5 preceding siblings ...)
  2025-02-04  0:40 ` [PATCH v9 06/11] KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young James Houghton
@ 2025-02-04  0:40 ` James Houghton
  2025-02-04  0:40 ` [PATCH v9 08/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock James Houghton
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

When aging SPTEs and the TDP MMU is enabled, process the shadow MMU if and
only if the VM has at least one shadow page, as opposed to checking if the
VM has rmaps.  Checking for rmaps will effectively yield a false positive
if the VM ran nested TDP VMs in the past, but is not currently doing so.

Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4a9de4b330d7..f75779d8d6fd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1588,6 +1588,11 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
 	return young;
 }
 
+static bool kvm_may_have_shadow_mmu_sptes(struct kvm *kvm)
+{
+	return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
@@ -1595,7 +1600,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		young = kvm_tdp_mmu_age_gfn_range(kvm, range);
 
-	if (kvm_memslots_have_rmaps(kvm)) {
+	if (kvm_may_have_shadow_mmu_sptes(kvm)) {
 		write_lock(&kvm->mmu_lock);
 		young |= kvm_rmap_age_gfn_range(kvm, range, false);
 		write_unlock(&kvm->mmu_lock);
@@ -1611,7 +1616,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		young = kvm_tdp_mmu_test_age_gfn(kvm, range);
 
-	if (!young && kvm_memslots_have_rmaps(kvm)) {
+	if (!young && kvm_may_have_shadow_mmu_sptes(kvm)) {
 		write_lock(&kvm->mmu_lock);
 		young |= kvm_rmap_age_gfn_range(kvm, range, true);
 		write_unlock(&kvm->mmu_lock);
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v9 08/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (6 preceding siblings ...)
  2025-02-04  0:40 ` [PATCH v9 07/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 James Houghton
@ 2025-02-04  0:40 ` James Houghton
  2025-02-04  0:40 ` [PATCH v9 09/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock James Houghton
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

From: Sean Christopherson <seanjc@google.com>

Refactor the pte_list and rmap code to always read and write rmap_head->val
exactly once, e.g. by collecting changes in a local variable and then
propagating those changes back to rmap_head->val as appropriate.  This will
allow implementing a per-rmap rwlock (of sorts) by adding a LOCKED bit into
the rmap value alongside the MANY bit.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 83 +++++++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f75779d8d6fd..a24cf8ddca7f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -864,21 +864,24 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
 static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
+	unsigned long old_val, new_val;
 	struct pte_list_desc *desc;
 	int count = 0;
 
-	if (!rmap_head->val) {
-		rmap_head->val = (unsigned long)spte;
-	} else if (!(rmap_head->val & KVM_RMAP_MANY)) {
+	old_val = rmap_head->val;
+
+	if (!old_val) {
+		new_val = (unsigned long)spte;
+	} else if (!(old_val & KVM_RMAP_MANY)) {
 		desc = kvm_mmu_memory_cache_alloc(cache);
-		desc->sptes[0] = (u64 *)rmap_head->val;
+		desc->sptes[0] = (u64 *)old_val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
 		desc->tail_count = 0;
-		rmap_head->val = (unsigned long)desc | KVM_RMAP_MANY;
+		new_val = (unsigned long)desc | KVM_RMAP_MANY;
 		++count;
 	} else {
-		desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+		desc = (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY);
 		count = desc->tail_count + desc->spte_count;
 
 		/*
@@ -887,21 +890,25 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		 */
 		if (desc->spte_count == PTE_LIST_EXT) {
 			desc = kvm_mmu_memory_cache_alloc(cache);
-			desc->more = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+			desc->more = (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY);
 			desc->spte_count = 0;
 			desc->tail_count = count;
-			rmap_head->val = (unsigned long)desc | KVM_RMAP_MANY;
+			new_val = (unsigned long)desc | KVM_RMAP_MANY;
+		} else {
+			new_val = old_val;
 		}
 		desc->sptes[desc->spte_count++] = spte;
 	}
+
+	rmap_head->val = new_val;
+
 	return count;
 }
 
-static void pte_list_desc_remove_entry(struct kvm *kvm,
-				       struct kvm_rmap_head *rmap_head,
+static void pte_list_desc_remove_entry(struct kvm *kvm, unsigned long *rmap_val,
 				       struct pte_list_desc *desc, int i)
 {
-	struct pte_list_desc *head_desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	struct pte_list_desc *head_desc = (struct pte_list_desc *)(*rmap_val & ~KVM_RMAP_MANY);
 	int j = head_desc->spte_count - 1;
 
 	/*
@@ -928,9 +935,9 @@ static void pte_list_desc_remove_entry(struct kvm *kvm,
 	 * head at the next descriptor, i.e. the new head.
 	 */
 	if (!head_desc->more)
-		rmap_head->val = 0;
+		*rmap_val = 0;
 	else
-		rmap_head->val = (unsigned long)head_desc->more | KVM_RMAP_MANY;
+		*rmap_val = (unsigned long)head_desc->more | KVM_RMAP_MANY;
 	mmu_free_pte_list_desc(head_desc);
 }
 
@@ -938,24 +945,26 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 			    struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
+	unsigned long rmap_val;
 	int i;
 
-	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_head->val, kvm))
-		return;
+	rmap_val = rmap_head->val;
+	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
+		goto out;
 
-	if (!(rmap_head->val & KVM_RMAP_MANY)) {
-		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_head->val != spte, kvm))
-			return;
+	if (!(rmap_val & KVM_RMAP_MANY)) {
+		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_val != spte, kvm))
+			goto out;
 
-		rmap_head->val = 0;
+		rmap_val = 0;
 	} else {
-		desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+		desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 		while (desc) {
 			for (i = 0; i < desc->spte_count; ++i) {
 				if (desc->sptes[i] == spte) {
-					pte_list_desc_remove_entry(kvm, rmap_head,
+					pte_list_desc_remove_entry(kvm, &rmap_val,
 								   desc, i);
-					return;
+					goto out;
 				}
 			}
 			desc = desc->more;
@@ -963,6 +972,9 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 
 		KVM_BUG_ON_DATA_CORRUPTION(true, kvm);
 	}
+
+out:
+	rmap_head->val = rmap_val;
 }
 
 static void kvm_zap_one_rmap_spte(struct kvm *kvm,
@@ -977,17 +989,19 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 				   struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc, *next;
+	unsigned long rmap_val;
 	int i;
 
-	if (!rmap_head->val)
+	rmap_val = rmap_head->val;
+	if (!rmap_val)
 		return false;
 
-	if (!(rmap_head->val & KVM_RMAP_MANY)) {
-		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val);
+	if (!(rmap_val & KVM_RMAP_MANY)) {
+		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_val);
 		goto out;
 	}
 
-	desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 
 	for (; desc; desc = next) {
 		for (i = 0; i < desc->spte_count; i++)
@@ -1003,14 +1017,15 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
+	unsigned long rmap_val = rmap_head->val;
 	struct pte_list_desc *desc;
 
-	if (!rmap_head->val)
+	if (!rmap_val)
 		return 0;
-	else if (!(rmap_head->val & KVM_RMAP_MANY))
+	else if (!(rmap_val & KVM_RMAP_MANY))
 		return 1;
 
-	desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	return desc->tail_count + desc->spte_count;
 }
 
@@ -1053,6 +1068,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
  */
 struct rmap_iterator {
 	/* private fields */
+	struct rmap_head *head;
 	struct pte_list_desc *desc;	/* holds the sptep if not NULL */
 	int pos;			/* index of the sptep */
 };
@@ -1067,18 +1083,19 @@ struct rmap_iterator {
 static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
+	unsigned long rmap_val = rmap_head->val;
 	u64 *sptep;
 
-	if (!rmap_head->val)
+	if (!rmap_val)
 		return NULL;
 
-	if (!(rmap_head->val & KVM_RMAP_MANY)) {
+	if (!(rmap_val & KVM_RMAP_MANY)) {
 		iter->desc = NULL;
-		sptep = (u64 *)rmap_head->val;
+		sptep = (u64 *)rmap_val;
 		goto out;
 	}
 
-	iter->desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	iter->desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	iter->pos = 0;
 	sptep = iter->desc->sptes[iter->pos];
 out:
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v9 09/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (7 preceding siblings ...)
  2025-02-04  0:40 ` [PATCH v9 08/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock James Houghton
@ 2025-02-04  0:40 ` James Houghton
  2025-02-04  0:40 ` [PATCH v9 10/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs James Houghton
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

From: Sean Christopherson <seanjc@google.com>

Steal another bit from rmap entries (which are word aligned pointers, i.e.
have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
the bit to implement a *very* rudimentary per-rmap spinlock.  The only
anticipated usage of the lock outside of mmu_lock is for aging gfns, and
collisions between aging and other MMU rmap operations are quite rare,
e.g. unless userspace is being silly and aging a tiny range over and over
in a tight loop, time between contention when aging an actively running VM
is O(seconds).  In short, a more sophisticated locking scheme shouldn't be
necessary.

Note, the lock only protects the rmap structure itself, SPTEs that are
pointed at by a locked rmap can still be modified and zapped by another
task (KVM drops/zaps SPTEs before deleting the rmap entries)

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +-
 arch/x86/kvm/mmu/mmu.c          | 107 ++++++++++++++++++++++++++++----
 2 files changed, 98 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0e44fc1cec0d..bd18fde99116 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -28,6 +28,7 @@
 #include <linux/kfifo.h>
 #include <linux/sched/vhost_task.h>
 #include <linux/call_once.h>
+#include <linux/atomic.h>
 
 #include <asm/apic.h>
 #include <asm/pvclock-abi.h>
@@ -406,7 +407,7 @@ union kvm_cpu_role {
 };
 
 struct kvm_rmap_head {
-	unsigned long val;
+	atomic_long_t val;
 };
 
 struct kvm_pio_request {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a24cf8ddca7f..267cf2d4c3e3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -853,11 +853,95 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
  * About rmap_head encoding:
  *
  * If the bit zero of rmap_head->val is clear, then it points to the only spte
- * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
+ * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct
  * pte_list_desc containing more mappings.
  */
 #define KVM_RMAP_MANY	BIT(0)
 
+/*
+ * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
+ * operates with mmu_lock held for write), but rmaps can be walked without
+ * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
+ * being zapped/dropped _while the rmap is locked_.
+ *
+ * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
+ * done while holding mmu_lock for write.  This allows a task walking rmaps
+ * without holding mmu_lock to concurrently walk the same entries as a task
+ * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
+ * the rmaps, thus the walks are stable.
+ *
+ * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
+ * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
+ * ensures all "struct pte_list_desc" fields are stable.
+ */
+#define KVM_RMAP_LOCKED	BIT(1)
+
+static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+{
+	unsigned long old_val, new_val;
+
+	/*
+	 * Elide the lock if the rmap is empty, as lockless walkers (read-only
+	 * mode) don't need to (and can't) walk an empty rmap, nor can they add
+	 * entries to the rmap.  I.e. the only paths that process empty rmaps
+	 * do so while holding mmu_lock for write, and are mutually exclusive.
+	 */
+	old_val = atomic_long_read(&rmap_head->val);
+	if (!old_val)
+		return 0;
+
+	do {
+		/*
+		 * If the rmap is locked, wait for it to be unlocked before
+		 * trying acquire the lock, e.g. to bounce the cache line.
+		 */
+		while (old_val & KVM_RMAP_LOCKED) {
+			cpu_relax();
+			old_val = atomic_long_read(&rmap_head->val);
+		}
+
+		/*
+		 * Recheck for an empty rmap, it may have been purged by the
+		 * task that held the lock.
+		 */
+		if (!old_val)
+			return 0;
+
+		new_val = old_val | KVM_RMAP_LOCKED;
+	/*
+	 * Use try_cmpxchg_acquire to prevent reads and writes to the rmap
+	 * from being reordered outside of the critical section created by
+	 * kvm_rmap_lock.
+	 *
+	 * Pairs with smp_store_release in kvm_rmap_unlock.
+	 *
+	 * For the !old_val case, no ordering is needed, as there is no rmap
+	 * to walk.
+	 */
+	} while (!atomic_long_try_cmpxchg_acquire(&rmap_head->val, &old_val, new_val));
+
+	/* Return the old value, i.e. _without_ the LOCKED bit set. */
+	return old_val;
+}
+
+static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
+			    unsigned long new_val)
+{
+	WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
+	/*
+	 * Ensure that all accesses to the rmap have completed
+	 * before we actually unlock the rmap.
+	 *
+	 * Pairs with the atomic_long_try_cmpxchg_acquire in kvm_rmap_lock.
+	 */
+	atomic_long_set_release(&rmap_head->val, new_val);
+}
+
+static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
+{
+	return atomic_long_read(&rmap_head->val) & ~KVM_RMAP_LOCKED;
+}
+
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
@@ -868,7 +952,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 	struct pte_list_desc *desc;
 	int count = 0;
 
-	old_val = rmap_head->val;
+	old_val = kvm_rmap_lock(rmap_head);
 
 	if (!old_val) {
 		new_val = (unsigned long)spte;
@@ -900,7 +984,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		desc->sptes[desc->spte_count++] = spte;
 	}
 
-	rmap_head->val = new_val;
+	kvm_rmap_unlock(rmap_head, new_val);
 
 	return count;
 }
@@ -948,7 +1032,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = rmap_head->val;
+	rmap_val = kvm_rmap_lock(rmap_head);
 	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
 		goto out;
 
@@ -974,7 +1058,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	}
 
 out:
-	rmap_head->val = rmap_val;
+	kvm_rmap_unlock(rmap_head, rmap_val);
 }
 
 static void kvm_zap_one_rmap_spte(struct kvm *kvm,
@@ -992,7 +1076,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = rmap_head->val;
+	rmap_val = kvm_rmap_lock(rmap_head);
 	if (!rmap_val)
 		return false;
 
@@ -1011,13 +1095,13 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	}
 out:
 	/* rmap_head is meaningless now, remember to reset it */
-	rmap_head->val = 0;
+	kvm_rmap_unlock(rmap_head, 0);
 	return true;
 }
 
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
-	unsigned long rmap_val = rmap_head->val;
+	unsigned long rmap_val = kvm_rmap_get(rmap_head);
 	struct pte_list_desc *desc;
 
 	if (!rmap_val)
@@ -1083,7 +1167,7 @@ struct rmap_iterator {
 static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
-	unsigned long rmap_val = rmap_head->val;
+	unsigned long rmap_val = kvm_rmap_get(rmap_head);
 	u64 *sptep;
 
 	if (!rmap_val)
@@ -1418,7 +1502,7 @@ static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
 	while (++iterator->rmap <= iterator->end_rmap) {
 		iterator->gfn += KVM_PAGES_PER_HPAGE(iterator->level);
 
-		if (iterator->rmap->val)
+		if (atomic_long_read(&iterator->rmap->val))
 			return;
 	}
 
@@ -2444,7 +2528,8 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 			 * avoids retaining a large number of stale nested SPs.
 			 */
 			if (tdp_enabled && invalid_list &&
-			    child->role.guest_mode && !child->parent_ptes.val)
+			    child->role.guest_mode &&
+			    !atomic_long_read(&child->parent_ptes.val))
 				return kvm_mmu_prepare_zap_page(kvm, child,
 								invalid_list);
 		}
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v9 10/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (8 preceding siblings ...)
  2025-02-04  0:40 ` [PATCH v9 09/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock James Houghton
@ 2025-02-04  0:40 ` James Houghton
  2025-02-04  0:40 ` [PATCH v9 11/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns James Houghton
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

From: Sean Christopherson <seanjc@google.com>

Add a lockless version of for_each_rmap_spte(), which is pretty much the
same as the normal version, except that it doesn't BUG() the host if a
non-present SPTE is encountered.  When mmu_lock is held, it should be
impossible for a different task to zap a SPTE, _and_ zapped SPTEs must
be removed from their rmap chain prior to dropping mmu_lock.  Thus, the
normal walker BUG()s if a non-present SPTE is encountered as something is
wildly broken.

When walking rmaps without holding mmu_lock, the SPTEs pointed at by the
rmap chain can be zapped/dropped, and so a lockless walk can observe a
non-present SPTE if it runs concurrently with a different operation that
is zapping SPTEs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 135 ++++++++++++++++++++++++++++-------------
 1 file changed, 94 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 267cf2d4c3e3..a0f735eeaaeb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -876,10 +876,12 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
  */
 #define KVM_RMAP_LOCKED	BIT(1)
 
-static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+static unsigned long __kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
 {
 	unsigned long old_val, new_val;
 
+	lockdep_assert_preemption_disabled();
+
 	/*
 	 * Elide the lock if the rmap is empty, as lockless walkers (read-only
 	 * mode) don't need to (and can't) walk an empty rmap, nor can they add
@@ -911,7 +913,7 @@ static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
 	/*
 	 * Use try_cmpxchg_acquire to prevent reads and writes to the rmap
 	 * from being reordered outside of the critical section created by
-	 * kvm_rmap_lock.
+	 * __kvm_rmap_lock.
 	 *
 	 * Pairs with smp_store_release in kvm_rmap_unlock.
 	 *
@@ -920,21 +922,42 @@ static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
 	 */
 	} while (!atomic_long_try_cmpxchg_acquire(&rmap_head->val, &old_val, new_val));
 
-	/* Return the old value, i.e. _without_ the LOCKED bit set. */
+	/*
+	 * Return the old value, i.e. _without_ the LOCKED bit set.  It's
+	 * impossible for the return value to be 0 (see above), i.e. the read-
+	 * only unlock flow can't get a false positive and fail to unlock.
+	 */
 	return old_val;
 }
 
-static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
-			    unsigned long new_val)
+static unsigned long kvm_rmap_lock(struct kvm *kvm,
+				   struct kvm_rmap_head *rmap_head)
+{
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	return __kvm_rmap_lock(rmap_head);
+}
+
+static void __kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
+			      unsigned long val)
 {
-	WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
+	KVM_MMU_WARN_ON(val & KVM_RMAP_LOCKED);
 	/*
 	 * Ensure that all accesses to the rmap have completed
 	 * before we actually unlock the rmap.
 	 *
-	 * Pairs with the atomic_long_try_cmpxchg_acquire in kvm_rmap_lock.
+	 * Pairs with the atomic_long_try_cmpxchg_acquire in __kvm_rmap_lock.
 	 */
-	atomic_long_set_release(&rmap_head->val, new_val);
+	atomic_long_set_release(&rmap_head->val, val);
+}
+
+static void kvm_rmap_unlock(struct kvm *kvm,
+			    struct kvm_rmap_head *rmap_head,
+			    unsigned long new_val)
+{
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	__kvm_rmap_unlock(rmap_head, new_val);
 }
 
 static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
@@ -942,17 +965,49 @@ static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
 	return atomic_long_read(&rmap_head->val) & ~KVM_RMAP_LOCKED;
 }
 
+/*
+ * If mmu_lock isn't held, rmaps can only be locked in read-only mode.  The
+ * actual locking is the same, but the caller is disallowed from modifying the
+ * rmap, and so the unlock flow is a nop if the rmap is/was empty.
+ */
+__maybe_unused
+static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
+{
+	unsigned long rmap_val;
+
+	preempt_disable();
+	rmap_val = __kvm_rmap_lock(rmap_head);
+
+	if (!rmap_val)
+		preempt_enable();
+
+	return rmap_val;
+}
+
+__maybe_unused
+static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
+				     unsigned long old_val)
+{
+	if (!old_val)
+		return;
+
+	KVM_MMU_WARN_ON(old_val != kvm_rmap_get(rmap_head));
+
+	__kvm_rmap_unlock(rmap_head, old_val);
+	preempt_enable();
+}
+
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
-			struct kvm_rmap_head *rmap_head)
+static int pte_list_add(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
+			u64 *spte, struct kvm_rmap_head *rmap_head)
 {
 	unsigned long old_val, new_val;
 	struct pte_list_desc *desc;
 	int count = 0;
 
-	old_val = kvm_rmap_lock(rmap_head);
+	old_val = kvm_rmap_lock(kvm, rmap_head);
 
 	if (!old_val) {
 		new_val = (unsigned long)spte;
@@ -984,7 +1039,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		desc->sptes[desc->spte_count++] = spte;
 	}
 
-	kvm_rmap_unlock(rmap_head, new_val);
+	kvm_rmap_unlock(kvm, rmap_head, new_val);
 
 	return count;
 }
@@ -1032,7 +1087,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = kvm_rmap_lock(rmap_head);
+	rmap_val = kvm_rmap_lock(kvm, rmap_head);
 	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
 		goto out;
 
@@ -1058,7 +1113,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	}
 
 out:
-	kvm_rmap_unlock(rmap_head, rmap_val);
+	kvm_rmap_unlock(kvm, rmap_head, rmap_val);
 }
 
 static void kvm_zap_one_rmap_spte(struct kvm *kvm,
@@ -1076,7 +1131,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = kvm_rmap_lock(rmap_head);
+	rmap_val = kvm_rmap_lock(kvm, rmap_head);
 	if (!rmap_val)
 		return false;
 
@@ -1095,7 +1150,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	}
 out:
 	/* rmap_head is meaningless now, remember to reset it */
-	kvm_rmap_unlock(rmap_head, 0);
+	kvm_rmap_unlock(kvm, rmap_head, 0);
 	return true;
 }
 
@@ -1168,23 +1223,18 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
 	unsigned long rmap_val = kvm_rmap_get(rmap_head);
-	u64 *sptep;
 
 	if (!rmap_val)
 		return NULL;
 
 	if (!(rmap_val & KVM_RMAP_MANY)) {
 		iter->desc = NULL;
-		sptep = (u64 *)rmap_val;
-		goto out;
+		return (u64 *)rmap_val;
 	}
 
 	iter->desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	iter->pos = 0;
-	sptep = iter->desc->sptes[iter->pos];
-out:
-	BUG_ON(!is_shadow_present_pte(*sptep));
-	return sptep;
+	return iter->desc->sptes[iter->pos];
 }
 
 /*
@@ -1194,14 +1244,11 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
  */
 static u64 *rmap_get_next(struct rmap_iterator *iter)
 {
-	u64 *sptep;
-
 	if (iter->desc) {
 		if (iter->pos < PTE_LIST_EXT - 1) {
 			++iter->pos;
-			sptep = iter->desc->sptes[iter->pos];
-			if (sptep)
-				goto out;
+			if (iter->desc->sptes[iter->pos])
+				return iter->desc->sptes[iter->pos];
 		}
 
 		iter->desc = iter->desc->more;
@@ -1209,20 +1256,24 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
 		if (iter->desc) {
 			iter->pos = 0;
 			/* desc->sptes[0] cannot be NULL */
-			sptep = iter->desc->sptes[iter->pos];
-			goto out;
+			return iter->desc->sptes[iter->pos];
 		}
 	}
 
 	return NULL;
-out:
-	BUG_ON(!is_shadow_present_pte(*sptep));
-	return sptep;
 }
 
-#define for_each_rmap_spte(_rmap_head_, _iter_, _spte_)			\
-	for (_spte_ = rmap_get_first(_rmap_head_, _iter_);		\
-	     _spte_; _spte_ = rmap_get_next(_iter_))
+#define __for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)	\
+	for (_sptep_ = rmap_get_first(_rmap_head_, _iter_);	\
+	     _sptep_; _sptep_ = rmap_get_next(_iter_))
+
+#define for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)			\
+	__for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)			\
+		if (!WARN_ON_ONCE(!is_shadow_present_pte(*(_sptep_))))	\
+
+#define for_each_rmap_spte_lockless(_rmap_head_, _iter_, _sptep_, _spte_)	\
+	__for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)			\
+		if (is_shadow_present_pte(_spte_ = mmu_spte_get_lockless(sptep)))
 
 static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
@@ -1308,12 +1359,13 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	struct rmap_iterator iter;
 	bool flush = false;
 
-	for_each_rmap_spte(rmap_head, &iter, sptep)
+	for_each_rmap_spte(rmap_head, &iter, sptep) {
 		if (spte_ad_need_write_protect(*sptep))
 			flush |= test_and_clear_bit(PT_WRITABLE_SHIFT,
 						    (unsigned long *)sptep);
 		else
 			flush |= spte_clear_dirty(sptep);
+	}
 
 	return flush;
 }
@@ -1634,7 +1686,7 @@ static void __rmap_add(struct kvm *kvm,
 	kvm_update_page_stats(kvm, sp->role.level, 1);
 
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-	rmap_count = pte_list_add(cache, spte, rmap_head);
+	rmap_count = pte_list_add(kvm, cache, spte, rmap_head);
 
 	if (rmap_count > kvm->stat.max_mmu_rmap_size)
 		kvm->stat.max_mmu_rmap_size = rmap_count;
@@ -1768,13 +1820,14 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
+static void mmu_page_add_parent_pte(struct kvm *kvm,
+				    struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
 	if (!parent_pte)
 		return;
 
-	pte_list_add(cache, parent_pte, &sp->parent_ptes);
+	pte_list_add(kvm, cache, parent_pte, &sp->parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
@@ -2464,7 +2517,7 @@ static void __link_shadow_page(struct kvm *kvm,
 
 	mmu_spte_set(sptep, spte);
 
-	mmu_page_add_parent_pte(cache, sp, sptep);
+	mmu_page_add_parent_pte(kvm, cache, sp, sptep);
 
 	/*
 	 * The non-direct sub-pagetable must be updated before linking.  For
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v9 11/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (9 preceding siblings ...)
  2025-02-04  0:40 ` [PATCH v9 10/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs James Houghton
@ 2025-02-04  0:40 ` James Houghton
  2025-02-15  0:50 ` [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly Sean Christopherson
  2025-02-18 19:29 ` Maxim Levitsky
  12 siblings, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-04  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

From: Sean Christopherson <seanjc@google.com>

When A/D bits are supported on sptes, it is safe to simply clear the
Accessed bits.

The less obvious case is marking sptes for access tracking in the
non-A/D case (for EPT only). In this case, we have to be sure that it is
okay for TLB entries to exist for non-present sptes. For example, when
doing dirty tracking, if we come across a non-present SPTE, we need to
know that we need to do a TLB invalidation.

This case is already supported today (as we already support *not* doing
TLBIs for clear_young(); there is a separate notifier for clearing *and*
flushing, clear_flush_young()). This works today because GET_DIRTY_LOG
flushes the TLB before returning to userspace.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 72 +++++++++++++++++++++++-------------------
 1 file changed, 39 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a0f735eeaaeb..57b99daa8614 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -970,7 +970,6 @@ static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
  * actual locking is the same, but the caller is disallowed from modifying the
  * rmap, and so the unlock flow is a nop if the rmap is/was empty.
  */
-__maybe_unused
 static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
 {
 	unsigned long rmap_val;
@@ -984,7 +983,6 @@ static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
 	return rmap_val;
 }
 
-__maybe_unused
 static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
 				     unsigned long old_val)
 {
@@ -1705,37 +1703,48 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 }
 
 static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
-				   struct kvm_gfn_range *range, bool test_only)
+				   struct kvm_gfn_range *range,
+				   bool test_only)
 {
-	struct slot_rmap_walk_iterator iterator;
+	struct kvm_rmap_head *rmap_head;
 	struct rmap_iterator iter;
+	unsigned long rmap_val;
 	bool young = false;
 	u64 *sptep;
+	gfn_t gfn;
+	int level;
+	u64 spte;
 
-	for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
-				 range->start, range->end - 1, &iterator) {
-		for_each_rmap_spte(iterator.rmap, &iter, sptep) {
-			u64 spte = *sptep;
+	for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+		for (gfn = range->start; gfn < range->end;
+		     gfn += KVM_PAGES_PER_HPAGE(level)) {
+			rmap_head = gfn_to_rmap(gfn, level, range->slot);
+			rmap_val = kvm_rmap_lock_readonly(rmap_head);
 
-			if (!is_accessed_spte(spte))
-				continue;
+			for_each_rmap_spte_lockless(rmap_head, &iter, sptep, spte) {
+				if (!is_accessed_spte(spte))
+					continue;
+
+				if (test_only) {
+					kvm_rmap_unlock_readonly(rmap_head, rmap_val);
+					return true;
+				}
 
-			if (test_only)
-				return true;
-
-			if (spte_ad_enabled(spte)) {
-				clear_bit((ffs(shadow_accessed_mask) - 1),
-					(unsigned long *)sptep);
-			} else {
-				/*
-				 * WARN if mmu_spte_update() signals the need
-				 * for a TLB flush, as Access tracking a SPTE
-				 * should never trigger an _immediate_ flush.
-				 */
-				spte = mark_spte_for_access_track(spte);
-				WARN_ON_ONCE(mmu_spte_update(sptep, spte));
+				if (spte_ad_enabled(spte))
+					clear_bit((ffs(shadow_accessed_mask) - 1),
+						  (unsigned long *)sptep);
+				else
+					/*
+					 * If the following cmpxchg fails, the
+					 * spte is being concurrently modified
+					 * and should most likely stay young.
+					 */
+					cmpxchg64(sptep, spte,
+					      mark_spte_for_access_track(spte));
+				young = true;
 			}
-			young = true;
+
+			kvm_rmap_unlock_readonly(rmap_head, rmap_val);
 		}
 	}
 	return young;
@@ -1753,11 +1762,8 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		young = kvm_tdp_mmu_age_gfn_range(kvm, range);
 
-	if (kvm_may_have_shadow_mmu_sptes(kvm)) {
-		write_lock(&kvm->mmu_lock);
+	if (kvm_may_have_shadow_mmu_sptes(kvm))
 		young |= kvm_rmap_age_gfn_range(kvm, range, false);
-		write_unlock(&kvm->mmu_lock);
-	}
 
 	return young;
 }
@@ -1769,11 +1775,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		young = kvm_tdp_mmu_test_age_gfn(kvm, range);
 
-	if (!young && kvm_may_have_shadow_mmu_sptes(kvm)) {
-		write_lock(&kvm->mmu_lock);
+	if (young)
+		return young;
+
+	if (kvm_may_have_shadow_mmu_sptes(kvm))
 		young |= kvm_rmap_age_gfn_range(kvm, range, true);
-		write_unlock(&kvm->mmu_lock);
-	}
 
 	return young;
 }
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
  2025-02-04  0:40 ` [PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn() James Houghton
@ 2025-02-12 22:07   ` Sean Christopherson
  2025-02-13  0:25     ` James Houghton
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2025-02-12 22:07 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, Feb 04, 2025, James Houghton wrote:
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 22551e2f1d00..e984b440c0f0 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -142,8 +142,14 @@ bool spte_has_volatile_bits(u64 spte)
>  		return true;
>  
>  	if (spte_ad_enabled(spte)) {
> -		if (!(spte & shadow_accessed_mask) ||
> -		    (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
> +		/*
> +		 * Do not check the Accessed bit. It can be set (by the CPU)
> +		 * and cleared (by kvm_tdp_mmu_age_spte()) without holding

When possible, don't reference functions by name in comments.  There are situations
where it's unavoidable, e.g. when calling out memory barrier pairs, but for cases
like this, it inevitably leads to stale code.

> +		 * the mmu_lock, but when clearing the Accessed bit, we do
> +		 * not invalidate the TLB, so we can already miss Accessed bit

No "we" in comments or changelog.

> +		 * updates.
> +		 */
> +		if (is_writable_pte(spte) && !(spte & shadow_dirty_mask))
>  			return true;

This 100% belongs in a separate prepatory patch.  And if it's moved to a separate
patch, then the rename can/should happen at the same time.

And with the Accessed check gone, and looking forward to the below change, I think
this is the perfect opportunity to streamline the final check to:

	return spte_ad_enabled(spte) && is_writable_pte(spte) &&
	       !(spte & shadow_dirty_mask);

No need to send another version, I'll move things around when applying.

Also, as discussed off-list I'm 99% certain that with the lockless aging, KVM
must atomically update A/D-disabled SPTEs, as the SPTE can be access-tracked and
restored outside of mmu_lock.  E.g. a task that holds mmu_lock and is clearing
the writable bit needs to update it atomically to avoid its change from being
lost.

I'll slot this is in:

--
From: Sean Christopherson <seanjc@google.com>
Date: Wed, 12 Feb 2025 12:58:39 -0800
Subject: [PATCH 03/10] KVM: x86/mmu: Always update A/D-disable SPTEs
 atomically

In anticipation of aging SPTEs outside of mmu_lock, force A/D-disabled
SPTEs to be updated atomically, as aging A/D-disabled SPTEs will mark them
for access-tracking outside of mmu_lock.  Coupled with restoring access-
tracked SPTEs in the fast page fault handler, the end result is that
A/D-disable SPTEs will be volatile at all times.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/spte.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 9663ba867178..0f9f47b4ab0e 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -141,8 +141,11 @@ bool spte_needs_atomic_update(u64 spte)
 	if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
 		return true;
 
-	/* Access-tracked SPTEs can be restored by KVM's fast page fault handler. */
-	if (is_access_track_spte(spte))
+	/*
+	 * A/D-disabled SPTEs can be access-tracked by aging, and access-tracked
+	 * SPTEs can be restored by KVM's fast page fault handler.
+	 */
+	if (!spte_ad_enabled(spte))
 		return true;
 
 	/*
@@ -151,8 +154,7 @@ bool spte_needs_atomic_update(u64 spte)
 	 * invalidate TLBs when aging SPTEs, and so it's safe to clobber the
 	 * Accessed bit (and rare in practice).
 	 */
-	return spte_ad_enabled(spte) && is_writable_pte(spte) &&
-	       !(spte & shadow_dirty_mask);
+	return is_writable_pte(spte) && !(spte & shadow_dirty_mask);
 }
 
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
--

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write()
  2025-02-04  0:40 ` [PATCH v9 05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write() James Houghton
@ 2025-02-12 22:09   ` Sean Christopherson
  2025-02-13  0:26     ` James Houghton
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2025-02-12 22:09 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, Feb 04, 2025, James Houghton wrote:
> spte_has_volatile_bits() is now a misnomer, as the an SPTE can have its
> Accessed bit set or cleared without the mmu_lock held, but the state of
> the Accessed bit is not checked in spte_has_volatile_bits().
> Even if a caller uses spte_needs_atomic_write(), Accessed bit
> information may still be lost, but that is already tolerated, as the TLB
> is not invalidated after the Accessed bit is cleared.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---

...

> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 59746854c0af..4c290ae9a02a 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -519,7 +519,7 @@ static inline u64 get_mmio_spte_generation(u64 spte)
>  	return gen;
>  }
>  
> -bool spte_has_volatile_bits(u64 spte);
> +bool spte_needs_atomic_write(u64 spte);
>  
>  bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	       const struct kvm_memory_slot *slot,
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index 05e9d678aac9..b54123163efc 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -55,7 +55,7 @@ static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
>  {
>  	return is_shadow_present_pte(old_spte) &&
>  	       is_last_spte(old_spte, level) &&
> -	       spte_has_volatile_bits(old_spte);
> +	       spte_needs_atomic_write(old_spte);

Unless you object, I'll change this to spte_needs_atomic_update(), and tweak
kvm_tdp_mmu_spte_need_atomic_write() accordingly.  "write" was a bad choice by
me.  It's not just the store/write that needs to be atomic, it's the entire
read-modify-write.  E.g. KVM needs to preserve the existing value, but for many
flows, it's even more important that KVM's snapshot of the old SPTE is accurate.

>  }
>  
>  static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> -- 
> 2.48.1.362.g079036d154-goog
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
  2025-02-12 22:07   ` Sean Christopherson
@ 2025-02-13  0:25     ` James Houghton
  0 siblings, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-13  0:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Wed, Feb 12, 2025 at 2:07 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Feb 04, 2025, James Houghton wrote:
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index 22551e2f1d00..e984b440c0f0 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -142,8 +142,14 @@ bool spte_has_volatile_bits(u64 spte)
> >               return true;
> >
> >       if (spte_ad_enabled(spte)) {
> > -             if (!(spte & shadow_accessed_mask) ||
> > -                 (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
> > +             /*
> > +              * Do not check the Accessed bit. It can be set (by the CPU)
> > +              * and cleared (by kvm_tdp_mmu_age_spte()) without holding
>
> When possible, don't reference functions by name in comments.  There are situations
> where it's unavoidable, e.g. when calling out memory barrier pairs, but for cases
> like this, it inevitably leads to stale code.

Good point. Thanks.

>
> > +              * the mmu_lock, but when clearing the Accessed bit, we do
> > +              * not invalidate the TLB, so we can already miss Accessed bit
>
> No "we" in comments or changelog.

Ah my mistake.

> > +              * updates.
> > +              */
> > +             if (is_writable_pte(spte) && !(spte & shadow_dirty_mask))
> >                       return true;
>
> This 100% belongs in a separate prepatory patch.  And if it's moved to a separate
> patch, then the rename can/should happen at the same time.
>
> And with the Accessed check gone, and looking forward to the below change, I think
> this is the perfect opportunity to streamline the final check to:
>
>         return spte_ad_enabled(spte) && is_writable_pte(spte) &&
>                !(spte & shadow_dirty_mask);

LGTM.

> No need to send another version, I'll move things around when applying.

Thanks!

> Also, as discussed off-list I'm 99% certain that with the lockless aging, KVM
> must atomically update A/D-disabled SPTEs, as the SPTE can be access-tracked and
> restored outside of mmu_lock.  E.g. a task that holds mmu_lock and is clearing
> the writable bit needs to update it atomically to avoid its change from being
> lost.

Yeah you're right. This logic applies to A/D-enabled SPTEs too, just
that we choose to tolerate failures to update the Accessed bit. But in
the case of A/D-disabled SPTEs, we can't do that. So this makes sense.
Thanks!

> I'll slot this is in:
>
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Wed, 12 Feb 2025 12:58:39 -0800
> Subject: [PATCH 03/10] KVM: x86/mmu: Always update A/D-disable SPTEs
>  atomically
>
> In anticipation of aging SPTEs outside of mmu_lock, force A/D-disabled
> SPTEs to be updated atomically, as aging A/D-disabled SPTEs will mark them
> for access-tracking outside of mmu_lock.  Coupled with restoring access-
> tracked SPTEs in the fast page fault handler, the end result is that
> A/D-disable SPTEs will be volatile at all times.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: James Houghton <jthoughton@google.com>

> ---
>  arch/x86/kvm/mmu/spte.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 9663ba867178..0f9f47b4ab0e 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -141,8 +141,11 @@ bool spte_needs_atomic_update(u64 spte)
>         if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
>                 return true;
>
> -       /* Access-tracked SPTEs can be restored by KVM's fast page fault handler. */
> -       if (is_access_track_spte(spte))
> +       /*
> +        * A/D-disabled SPTEs can be access-tracked by aging, and access-tracked
> +        * SPTEs can be restored by KVM's fast page fault handler.
> +        */
> +       if (!spte_ad_enabled(spte))
>                 return true;
>
>         /*
> @@ -151,8 +154,7 @@ bool spte_needs_atomic_update(u64 spte)
>          * invalidate TLBs when aging SPTEs, and so it's safe to clobber the
>          * Accessed bit (and rare in practice).
>          */
> -       return spte_ad_enabled(spte) && is_writable_pte(spte) &&
> -              !(spte & shadow_dirty_mask);
> +       return is_writable_pte(spte) && !(spte & shadow_dirty_mask);
>  }
>
>  bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> --

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write()
  2025-02-12 22:09   ` Sean Christopherson
@ 2025-02-13  0:26     ` James Houghton
  0 siblings, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-13  0:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Wed, Feb 12, 2025 at 2:09 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Feb 04, 2025, James Houghton wrote:
> > spte_has_volatile_bits() is now a misnomer, as the an SPTE can have its
> > Accessed bit set or cleared without the mmu_lock held, but the state of
> > the Accessed bit is not checked in spte_has_volatile_bits().
> > Even if a caller uses spte_needs_atomic_write(), Accessed bit
> > information may still be lost, but that is already tolerated, as the TLB
> > is not invalidated after the Accessed bit is cleared.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
>
> ...
>
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index 59746854c0af..4c290ae9a02a 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -519,7 +519,7 @@ static inline u64 get_mmio_spte_generation(u64 spte)
> >       return gen;
> >  }
> >
> > -bool spte_has_volatile_bits(u64 spte);
> > +bool spte_needs_atomic_write(u64 spte);
> >
> >  bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >              const struct kvm_memory_slot *slot,
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index 05e9d678aac9..b54123163efc 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -55,7 +55,7 @@ static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
> >  {
> >       return is_shadow_present_pte(old_spte) &&
> >              is_last_spte(old_spte, level) &&
> > -            spte_has_volatile_bits(old_spte);
> > +            spte_needs_atomic_write(old_spte);
>
> Unless you object, I'll change this to spte_needs_atomic_update(), and tweak
> kvm_tdp_mmu_spte_need_atomic_write() accordingly.  "write" was a bad choice by
> me.  It's not just the store/write that needs to be atomic, it's the entire
> read-modify-write.  E.g. KVM needs to preserve the existing value, but for many
> flows, it's even more important that KVM's snapshot of the old SPTE is accurate.

No objections, please make that change. Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 02/11] KVM: Add lockless memslot walk to KVM
  2025-02-04  0:40 ` [PATCH v9 02/11] KVM: Add lockless memslot walk to KVM James Houghton
@ 2025-02-14 15:26   ` Sean Christopherson
  2025-02-14 19:27     ` James Houghton
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2025-02-14 15:26 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

It's not a lockless walk of the memslots.  The walk of memslots is already
"lockless" in that the memslots are protected by SRCU, not by mmu_lock.

On Tue, Feb 04, 2025, James Houghton wrote:
> It is possible to correctly do aging without taking the KVM MMU lock;
> this option allows such architectures to do so. Architectures that
> select CONFIG_KVM_MMU_NOTIFIER_AGING_LOCKLESS are responsible for
> correctness.
> 
> Suggested-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> Reviewed-by: David Matlack <dmatlack@google.com>
> ---
>  include/linux/kvm_host.h |  1 +
>  virt/kvm/Kconfig         |  2 ++
>  virt/kvm/kvm_main.c      | 24 +++++++++++++++++-------
>  3 files changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f34f4cfaa513..c28a6aa1f2ed 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -267,6 +267,7 @@ struct kvm_gfn_range {
>  	union kvm_mmu_notifier_arg arg;
>  	enum kvm_gfn_range_filter attr_filter;
>  	bool may_block;
> +	bool lockless;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 54e959e7d68f..9356f4e4e255 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -102,6 +102,8 @@ config KVM_GENERIC_MMU_NOTIFIER
>  
>  config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
>         depends on KVM_GENERIC_MMU_NOTIFIER
> +
> +config KVM_MMU_NOTIFIER_AGING_LOCKLESS
>         bool

As noted by Stephen[*], this steals the "bool" from KVM_ELIDE_TLB_FLUSH_IF_YOUNG.

Looking at it with fresh eyes, it also fails to take a depenency on
KVM_GENERIC_MMU_NOTIFIER.

Lastly, the name is unnecessarily long.  The "NOTIFIER" part is superfluous and
can be dropped, as it's a property of the architecture's MMU, not of KVM's
mmu_notifier implementation. E.g. if KVM ever did aging outside of the notifier,
then this Kconfig would be relevant for that flow as well.  The dependency on
KVM_GENERIC_MMU_NOTIFIER is what communicates that its currently used only by
mmu_notifier aging.

Actually, I take "Lastly" back.  IMO, it reads much better as LOCKLESS_AGING,
because LOCKLESS is an adverb that describes the AGING process.

[*] https://lore.kernel.org/all/20250214181401.4e7dd91d@canb.auug.org.au

TL;DR: I'm squashing this:

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index f0a60e59c884..fe8ea8c097de 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -22,7 +22,7 @@ config KVM_X86
        select KVM_COMMON
        select KVM_GENERIC_MMU_NOTIFIER
        select KVM_ELIDE_TLB_FLUSH_IF_YOUNG
-       select KVM_MMU_NOTIFIER_AGING_LOCKLESS
+       select KVM_MMU_LOCKLESS_AGING
        select HAVE_KVM_IRQCHIP
        select HAVE_KVM_PFNCACHE
        select HAVE_KVM_DIRTY_RING_TSO
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 9356f4e4e255..746e1f466aa6 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -102,8 +102,10 @@ config KVM_GENERIC_MMU_NOTIFIER
 
 config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
        depends on KVM_GENERIC_MMU_NOTIFIER
+       bool
 
-config KVM_MMU_NOTIFIER_AGING_LOCKLESS
+config KVM_MMU_LOCKLESS_AGING
+       depends on KVM_GENERIC_MMU_NOTIFIER
        bool
 
 config KVM_GENERIC_MEMORY_ATTRIBUTES
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e514e3db1b31..201c14ff476f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -655,8 +655,7 @@ static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn,
                .on_lock        = (void *)kvm_null_fn,
                .flush_on_ret   = flush_on_ret,
                .may_block      = false,
-               .lockless       =
-                       IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_AGING_LOCKLESS),
+               .lockless       = IS_ENABLED(CONFIG_KVM_MMU_LOCKLESS_AGING),
        };
 
        return kvm_handle_hva_range(kvm, &range).ret;

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 02/11] KVM: Add lockless memslot walk to KVM
  2025-02-14 15:26   ` Sean Christopherson
@ 2025-02-14 19:27     ` James Houghton
  0 siblings, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-14 19:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Fri, Feb 14, 2025 at 7:27 AM Sean Christopherson <seanjc@google.com> wrote:
>
> It's not a lockless walk of the memslots.  The walk of memslots is already
> "lockless" in that the memslots are protected by SRCU, not by mmu_lock.

Indeed, so I guess I should have said something like "Allow memslot
walk callbacks to be lockless"

>
> On Tue, Feb 04, 2025, James Houghton wrote:
> > It is possible to correctly do aging without taking the KVM MMU lock;
> > this option allows such architectures to do so. Architectures that
> > select CONFIG_KVM_MMU_NOTIFIER_AGING_LOCKLESS are responsible for
> > correctness.
> >
> > Suggested-by: Yu Zhao <yuzhao@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > Reviewed-by: David Matlack <dmatlack@google.com>
> > ---
> >  include/linux/kvm_host.h |  1 +
> >  virt/kvm/Kconfig         |  2 ++
> >  virt/kvm/kvm_main.c      | 24 +++++++++++++++++-------
> >  3 files changed, 20 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index f34f4cfaa513..c28a6aa1f2ed 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -267,6 +267,7 @@ struct kvm_gfn_range {
> >       union kvm_mmu_notifier_arg arg;
> >       enum kvm_gfn_range_filter attr_filter;
> >       bool may_block;
> > +     bool lockless;
> >  };
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index 54e959e7d68f..9356f4e4e255 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -102,6 +102,8 @@ config KVM_GENERIC_MMU_NOTIFIER
> >
> >  config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
> >         depends on KVM_GENERIC_MMU_NOTIFIER
> > +
> > +config KVM_MMU_NOTIFIER_AGING_LOCKLESS
> >         bool
>
> As noted by Stephen[*], this steals the "bool" from KVM_ELIDE_TLB_FLUSH_IF_YOUNG.

Ah sorry!

> Looking at it with fresh eyes, it also fails to take a depenency on
> KVM_GENERIC_MMU_NOTIFIER.

Indeed, thanks.

> Lastly, the name is unnecessarily long.  The "NOTIFIER" part is superfluous and
> can be dropped, as it's a property of the architecture's MMU, not of KVM's
> mmu_notifier implementation. E.g. if KVM ever did aging outside of the notifier,
> then this Kconfig would be relevant for that flow as well.  The dependency on
> KVM_GENERIC_MMU_NOTIFIER is what communicates that its currently used only by
> mmu_notifier aging.
>
> Actually, I take "Lastly" back.  IMO, it reads much better as LOCKLESS_AGING,
> because LOCKLESS is an adverb that describes the AGING process.
>
> [*] https://lore.kernel.org/all/20250214181401.4e7dd91d@canb.auug.org.au
>
> TL;DR: I'm squashing this:
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index f0a60e59c884..fe8ea8c097de 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -22,7 +22,7 @@ config KVM_X86
>         select KVM_COMMON
>         select KVM_GENERIC_MMU_NOTIFIER
>         select KVM_ELIDE_TLB_FLUSH_IF_YOUNG
> -       select KVM_MMU_NOTIFIER_AGING_LOCKLESS
> +       select KVM_MMU_LOCKLESS_AGING
>         select HAVE_KVM_IRQCHIP
>         select HAVE_KVM_PFNCACHE
>         select HAVE_KVM_DIRTY_RING_TSO
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 9356f4e4e255..746e1f466aa6 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -102,8 +102,10 @@ config KVM_GENERIC_MMU_NOTIFIER
>
>  config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
>         depends on KVM_GENERIC_MMU_NOTIFIER
> +       bool
>
> -config KVM_MMU_NOTIFIER_AGING_LOCKLESS
> +config KVM_MMU_LOCKLESS_AGING
> +       depends on KVM_GENERIC_MMU_NOTIFIER
>         bool
>
>  config KVM_GENERIC_MEMORY_ATTRIBUTES
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e514e3db1b31..201c14ff476f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -655,8 +655,7 @@ static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn,
>                 .on_lock        = (void *)kvm_null_fn,
>                 .flush_on_ret   = flush_on_ret,
>                 .may_block      = false,
> -               .lockless       =
> -                       IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_AGING_LOCKLESS),
> +               .lockless       = IS_ENABLED(CONFIG_KVM_MMU_LOCKLESS_AGING),
>         };
>
>         return kvm_handle_hva_range(kvm, &range).ret;

LGTM, thanks! You will also need to do this same rename in patch 4[1].

[1]: https://lore.kernel.org/kvm/20250204004038.1680123-5-jthoughton@google.com/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (10 preceding siblings ...)
  2025-02-04  0:40 ` [PATCH v9 11/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns James Houghton
@ 2025-02-15  0:50 ` Sean Christopherson
  2025-02-18 19:29 ` Maxim Levitsky
  12 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2025-02-15  0:50 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, James Houghton
  Cc: David Matlack, David Rientjes, Marc Zyngier, Oliver Upton, Wei Xu,
	Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, 04 Feb 2025 00:40:27 +0000, James Houghton wrote:
> By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> waiting for aging. This contention reduction improves guest performance
> and saves a significant amount of Google Cloud's CPU usage, and it has
> valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> 
> Please see v8[8] for some performance results using
> access_tracking_perf_test patched to use MGLRU.
> 
> [...]

Applied to kvm-x86 mmu, thanks!

[01/11] KVM: Rename kvm_handle_hva_range()
        https://github.com/kvm-x86/linux/commit/374ccd63600b
[02/11] KVM: Add lockless memslot walk to KVM
        https://github.com/kvm-x86/linux/commit/aa34b811650c
[03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine
        https://github.com/kvm-x86/linux/commit/e29b74920e6f
[04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
        https://github.com/kvm-x86/linux/commit/b146a9b34aed
[05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write()
        https://github.com/kvm-x86/linux/commit/61d65f2dc766
[06/11] KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young
        https://github.com/kvm-x86/linux/commit/e25c2332346f
[07/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0
        https://github.com/kvm-x86/linux/commit/8c403cf23119
[08/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock
        https://github.com/kvm-x86/linux/commit/9fb13ba6b5ff
[09/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
        https://github.com/kvm-x86/linux/commit/4834eaded91e
[10/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
        https://github.com/kvm-x86/linux/commit/bb6c7749ccee
[11/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns
        https://github.com/kvm-x86/linux/commit/af3b6a9eba48

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
  2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (11 preceding siblings ...)
  2025-02-15  0:50 ` [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly Sean Christopherson
@ 2025-02-18 19:29 ` Maxim Levitsky
  2025-02-19  1:13   ` Sean Christopherson
  12 siblings, 1 reply; 27+ messages in thread
From: Maxim Levitsky @ 2025-02-18 19:29 UTC (permalink / raw)
  To: James Houghton, Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, Marc Zyngier, Oliver Upton, Wei Xu,
	Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> waiting for aging. This contention reduction improves guest performance
> and saves a significant amount of Google Cloud's CPU usage, and it has
> valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> 
> Please see v8[8] for some performance results using
> access_tracking_perf_test patched to use MGLRU.
> 
> Neither access_tracking_perf_test nor mmu_stress_test trigger any
> splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.


Hi, I have a question about this patch series and about the access_tracking_perf_test:

Some time ago, I investigated a failure in access_tracking_perf_test which shows up in our CI.

The root cause was that 'folio_clear_idle' doesn't clear the idle bit when MGLRU is enabled,
and overall I got the impression that MGLRU is not compatible with idle page tracking.


I thought that this patch series and the 'mm: multi-gen LRU: Have secondary MMUs participate in MM_WALK' 
patch series could address this but the test still fails.


For the reference the exact problem is:

1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap

2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.

3. A NUMA autobalance code write protects the guest memory. KVM in response evicts the SPTE mappings with A/D bit set,
   and while doing so tells mm that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
   but due to MLGRU the call doesn't clear the idle bit and thus all the traces of the guest access disappear
   and the kernel thinks that the page is still idle.

I can say that the root cause of this is that folio_mark_accessed doesn't do what it supposed to do.

Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed() 
will probably fix this but I don't have enough confidence
to say if this is all that is needed to fix this. 
If this is the case I can send a patch.


This patch makes the test pass (but only on 6.12 kernel and below, see below):

diff --git a/mm/swap.c b/mm/swap.c
index 59f30a981c6f..2013e1f4d572 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio)
 {
        if (lru_gen_enabled()) {
                folio_inc_refs(folio);
-               return;
+               goto clear_idle_bit;
        }
 
        if (!folio_test_referenced(folio)) {
@@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio)
                folio_clear_referenced(folio);
                workingset_activation(folio);
        }
+clear_idle_bit:
        if (folio_test_idle(folio))
                folio_clear_idle(folio);
 }


To always reproduce this, it is best to use a patch to make the test run in a loop, 
like below (although the test fails without this as well).


diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..829774e325fa 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -131,6 +131,7 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
        uint64_t pages = vcpu_args->pages;
        uint64_t page;
        uint64_t still_idle = 0;
+       uint64_t failed_to_mark_idle = 0;
        uint64_t no_pfn = 0;
        int page_idle_fd;
        int pagemap_fd;
@@ -160,6 +161,14 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
                }
 
                mark_page_idle(page_idle_fd, pfn);
+
+
+                if (!is_page_idle(page_idle_fd, pfn)) {
+                        failed_to_mark_idle++;
+                        continue;
+                }
+
+
        }
 
        /*
@@ -183,16 +192,15 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
         * explicitly flush the TLB when aging SPTEs.  As a result, more pages
         * are cached and the guest won't see the "idle" bit cleared.
         */
-       if (still_idle >= pages / 10) {
+       //if (still_idle >= pages / 10) {
 #ifdef __x86_64__
-               TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR),
-                           "vCPU%d: Too many pages still idle (%lu out of %lu)",
-                           vcpu_idx, still_idle, pages);
+       //      TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR),
+       //                  "vCPU%d: Too many pages still idle (%lu out of %lu)",
+       //                  vcpu_idx, still_idle, pages);
 #endif
-               printf("WARNING: vCPU%d: Too many pages still idle (%lu out of %lu), "
-                      "this will affect performance results.\n",
-                      vcpu_idx, still_idle, pages);
-       }
+               printf("vCPU%d: idle pages: %lu out of %lu, failed to mark idle: %lu no pfn: %lu\n",
+                      vcpu_idx, still_idle, pages, failed_to_mark_idle, no_pfn);
+       //}
 
        close(page_idle_fd);
        close(pagemap_fd);
@@ -315,14 +323,16 @@ static void run_test(enum vm_guest_mode mode, void *arg)
        access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory");
 
        /* As a control, read and write to the populated memory first. */
-       access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to populated memory");
-       access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");
+       //access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to populated memory");
+       //access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");
 
        /* Repeat on memory that has been marked as idle. */
+again:
        mark_memory_idle(vm, nr_vcpus);
        access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory");
-       mark_memory_idle(vm, nr_vcpus);
-       access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
+       //mark_memory_idle(vm, nr_vcpus);
+       //access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
+       goto again;
 
        memstress_join_vcpu_threads(nr_vcpus);
        memstress_destroy_vm(vm);


With the above patch applied, you will notice after 4-6 iterations that the number of still idle
pages soars:

Populating memory             : 0.798882357s
vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.003939277s
Writing to idle memory        : 0.503653562s
vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.060128175s
Writing to idle memory        : 0.502705587s
vCPU0: idle pages: 2048 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.039294079s
Writing to idle memory        : 0.092227612s
vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.046216234s
Writing to idle memory        : 0.295077724s
vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 2.711946690s
Writing to idle memory        : 0.302882502s

...



(*) Turns out that since kernel 6.13, this code that sets accessed bit in the primary paging
structure, when the secondary was zapped was *removed*. I bisected this to commit:

66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs

So now the access_tracking_test is broken regardless of MGLRU.

Any ideas on how to fix all this mess?


Best regards,
	Maxim Levitsky


> 
> === Previous Versions ===
> 
> Since v8[8]:
>  - Re-added the kvm_handle_hva_range helpers and applied Sean's
>    kvm_{handle -> age}_hva_range rename.
>  - Renamed spte_has_volatile_bits() to spte_needs_atomic_write() and
>    removed its Accessed bit check. Undid change to
>    tdp_mmu_spte_need_atomic_write().
>  - Renamed KVM_MMU_NOTIFIER_{YOUNG -> AGING}_LOCKLESS.
>  - cpu_relax(), lockdep, preempt_disable(), and locking fixups for
>    per-rmap lock (thanks Lai and Sean).
>  - Renamed kvm_{has -> may_have}_shadow_mmu_sptes().
>  - Rebased onto latest kvm/next, including changing
>    for_each_tdp_mmu_root_rcu to use `types`.
>  - Dropped MGLRU changes from access_tracking_perf_test.
>  - Picked up Acked-bys from Yu. (thank you!)
> 
> Since v7[7]:
>  - Dropped MGLRU changes.
>  - Dropped DAMON cleanup.
>  - Dropped MMU notifier changes completely.
>  - Made shadow MMU aging *always* lockless, not just lockless when the
>    now-removed "fast_only" clear notifier was used.
>  - Given that the MGLRU changes no longer introduce a new MGLRU
>    capability, drop the new capability check from the selftest.
>  - Rebased on top of latest kvm-x86/next, including the x86 mmu changes
>    for marking pages as dirty.
> 
> Since v6[6]:
>  - Rebased on top of kvm-x86/next and Sean's lockless rmap walking
>    changes.
>  - Removed HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY (thanks DavidM).
>  - Split up kvm_age_gfn() / kvm_test_age_gfn() optimizations (thanks
>    DavidM and Sean).
>  - Improved new MMU notifier documentation (thanks DavidH).
>  - Dropped arm64 locking change.
>  - No longer retry for CAS failure in TDP MMU non-A/D case (thanks
>    Sean).
>  - Added some R-bys and A-bys.
> 
> Since v5[5]:
>  - Reworked test_clear_young_fast_only() into a new parameter for the
>    existing notifiers (thanks Sean).
>  - Added mmu_notifier.has_fast_aging to tell mm if calling fast-only
>    notifiers should be done.
>  - Added mm_has_fast_young_notifiers() to inform users if calling
>    fast-only notifier helpers is worthwhile (for look-around to use).
>  - Changed MGLRU to invoke a single notifier instead of two when
>    aging and doing look-around (thanks Yu).
>  - For KVM/x86, check indirect_shadow_pages > 0 instead of
>    kvm_memslots_have_rmaps() when collecting age information
>    (thanks Sean).
>  - For KVM/arm, some fixes from Oliver.
>  - Small fixes to access_tracking_perf_test.
>  - Added missing !MMU_NOTIFIER version of mmu_notifier_clear_young().
> 
> Since v4[4]:
>  - Removed Kconfig that controlled when aging was enabled. Aging will
>    be done whenever the architecture supports it (thanks Yu).
>  - Added a new MMU notifier, test_clear_young_fast_only(), specifically
>    for MGLRU to use.
>  - Add kvm_fast_{test_,}age_gfn, implemented by x86.
>  - Fix locking for clear_flush_young().
>  - Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes
>    (thanks Sean).
>  - Fix WARN_ON and other cleanup for the arm64 locking changes
>    (thanks Oliver).
> 
> Since v3[3]:
>  - Vastly simplified the series (thanks David). Removed mmu notifier
>    batching logic entirely.
>  - Cleaned up how locking is done for mmu_notifier_test/clear_young
>    (thanks David).
>  - Look-around is now only done when there are no secondary MMUs
>    subscribed to MMU notifiers.
>  - CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
>  - Fixed the lockless implementation of kvm_{test,}age_gfn for x86
>    (thanks David).
>  - Added MGLRU functional and performance tests to
>    access_tracking_perf_test (thanks Axel).
>  - In v3, an mm would be completely ignored (for aging) if there was a
>    secondary MMU but support for secondary MMU walking was missing. Now,
>    missing secondary MMU walking support simply skips the notifier
>    calls (except for eviction).
>  - Added a sanity check for that range->lockless and range->on_lock are
>    never both provided for the memslot walk.
> 
> For the changes since v2[2], see v3.
> 
> Based on latest kvm/next.
> 
> [1]: https://lore.kernel.org/kvm/CAOUHufYS0XyLEf_V+q5SCW54Zy2aW5nL8CnSWreM8d1rX5NKYg@mail.gmail.com/
> [2]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuzhao@google.com/
> [3]: https://lore.kernel.org/linux-mm/20240401232946.1837665-1-jthoughton@google.com/
> [4]: https://lore.kernel.org/linux-mm/20240529180510.2295118-1-jthoughton@google.com/
> [5]: https://lore.kernel.org/linux-mm/20240611002145.2078921-1-jthoughton@google.com/
> [6]: https://lore.kernel.org/linux-mm/20240724011037.3671523-1-jthoughton@google.com/
> [7]: https://lore.kernel.org/kvm/20240926013506.860253-1-jthoughton@google.com/
> [8]: https://lore.kernel.org/kvm/20241105184333.2305744-1-jthoughton@google.com/
> 
> James Houghton (7):
>   KVM: Rename kvm_handle_hva_range()
>   KVM: Add lockless memslot walk to KVM
>   KVM: x86/mmu: Factor out spte atomic bit clearing routine
>   KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
>   KVM: x86/mmu: Rename spte_has_volatile_bits() to
>     spte_needs_atomic_write()
>   KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as
>     young
>   KVM: x86/mmu: Only check gfn age in shadow MMU if
>     indirect_shadow_pages > 0
> 
> Sean Christopherson (4):
>   KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o
>     mmu_lock
>   KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of
>     mmu_lock
>   KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
>   KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging
>     gfns
> 
>  Documentation/virt/kvm/locking.rst |   4 +-
>  arch/x86/include/asm/kvm_host.h    |   4 +-
>  arch/x86/kvm/Kconfig               |   1 +
>  arch/x86/kvm/mmu/mmu.c             | 364 +++++++++++++++++++++--------
>  arch/x86/kvm/mmu/spte.c            |  19 +-
>  arch/x86/kvm/mmu/spte.h            |   2 +-
>  arch/x86/kvm/mmu/tdp_iter.h        |  26 ++-
>  arch/x86/kvm/mmu/tdp_mmu.c         |  36 ++-
>  include/linux/kvm_host.h           |   1 +
>  virt/kvm/Kconfig                   |   2 +
>  virt/kvm/kvm_main.c                |  56 +++--
>  11 files changed, 364 insertions(+), 151 deletions(-)
> 
> 
> base-commit: f7bafceba76e9ab475b413578c1757ee18c3e44b





^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
  2025-02-18 19:29 ` Maxim Levitsky
@ 2025-02-19  1:13   ` Sean Christopherson
  2025-02-19 18:56     ` James Houghton
  2025-02-25 22:00     ` Maxim Levitsky
  0 siblings, 2 replies; 27+ messages in thread
From: Sean Christopherson @ 2025-02-19  1:13 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: James Houghton, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm,
	linux-kernel

On Tue, Feb 18, 2025, Maxim Levitsky wrote:
> On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> > waiting for aging. This contention reduction improves guest performance
> > and saves a significant amount of Google Cloud's CPU usage, and it has
> > valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> > 
> > Please see v8[8] for some performance results using
> > access_tracking_perf_test patched to use MGLRU.
> > 
> > Neither access_tracking_perf_test nor mmu_stress_test trigger any
> > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.
> 
> 
> Hi, I have a question about this patch series and about the
> access_tracking_perf_test:
> 
> Some time ago, I investigated a failure in access_tracking_perf_test which
> shows up in our CI.
> 
> The root cause was that 'folio_clear_idle' doesn't clear the idle bit when
> MGLRU is enabled, and overall I got the impression that MGLRU is not
> compatible with idle page tracking.
>
> I thought that this patch series and the 'mm: multi-gen LRU: Have secondary
> MMUs participate in MM_WALK' patch series could address this but the test
> still fails.
> 
> 
> For the reference the exact problem is:
> 
> 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap
> 
> 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.
> 
> 3. A NUMA autobalance code write protects the guest memory. KVM in response
>    evicts the SPTE mappings with A/D bit set, and while doing so tells mm
>    that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
>    but due to MLGRU the call doesn't clear the idle bit and thus all the traces
>    of the guest access disappear and the kernel thinks that the page is still idle.
> 
> I can say that the root cause of this is that folio_mark_accessed doesn't do
> what it supposed to do.
> 
> Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed()
> will probably fix this but I don't have enough confidence to say if this is
> all that is needed to fix this.  If this is the case I can send a patch.

My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
effectively isn't supported by MGLRU.

[1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com

> This patch makes the test pass (but only on 6.12 kernel and below, see below):
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 59f30a981c6f..2013e1f4d572 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio)
>  {
>         if (lru_gen_enabled()) {
>                 folio_inc_refs(folio);
> -               return;
> +               goto clear_idle_bit;
>         }
>  
>         if (!folio_test_referenced(folio)) {
> @@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio)
>                 folio_clear_referenced(folio);
>                 workingset_activation(folio);
>         }
> +clear_idle_bit:
>         if (folio_test_idle(folio))
>                 folio_clear_idle(folio);
>  }
> 
> 
> To always reproduce this, it is best to use a patch to make the test run in a
> loop, like below (although the test fails without this as well).

..

> With the above patch applied, you will notice after 4-6 iterations that the
> number of still idle pages soars:
> 
> Populating memory             : 0.798882357s

...

> vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0
> Mark memory idle              : 2.711946690s
> Writing to idle memory        : 0.302882502s
> 
> ...
> 
> (*) Turns out that since kernel 6.13, this code that sets accessed bit in the
> primary paging structure, when the secondary was zapped was *removed*. I
> bisected this to commit:
> 
> 66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs
> 
> So now the access_tracking_test is broken regardless of MGLRU.

Just to confirm, do you see failures on 6.13 with MGLRU disabled?  

> Any ideas on how to fix all this mess?

The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is
enabled.  In a real-world scenario, if the guest is actually accessing the pages
that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're
faulted back in.  There's a window where page_idle/bitmap could be read between
making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's
one of the many flaws of NUMA balancing.

That said, one thing is quite odd.  In the failing case, *half* of the guest pages
are still idle.  That's quite insane.

Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
different node, and that causes NUMA balancing to go crazy and zap pretty much
all of guest memory.  If that's what's happening, then a better solution for the
NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
pin it to a single pCPU?).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
  2025-02-19  1:13   ` Sean Christopherson
@ 2025-02-19 18:56     ` James Houghton
  2025-02-25 22:00     ` Maxim Levitsky
  1 sibling, 0 replies; 27+ messages in thread
From: James Houghton @ 2025-02-19 18:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm,
	linux-kernel

On Tue, Feb 18, 2025 at 5:13 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Feb 18, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> > > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> > > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> > > waiting for aging. This contention reduction improves guest performance
> > > and saves a significant amount of Google Cloud's CPU usage, and it has
> > > valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> > >
> > > Please see v8[8] for some performance results using
> > > access_tracking_perf_test patched to use MGLRU.
> > >
> > > Neither access_tracking_perf_test nor mmu_stress_test trigger any
> > > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.
> >
> >
> > Hi, I have a question about this patch series and about the
> > access_tracking_perf_test:
> >
> > Some time ago, I investigated a failure in access_tracking_perf_test which
> > shows up in our CI.
> >
> > The root cause was that 'folio_clear_idle' doesn't clear the idle bit when
> > MGLRU is enabled, and overall I got the impression that MGLRU is not
> > compatible with idle page tracking.
> >
> > I thought that this patch series and the 'mm: multi-gen LRU: Have secondary
> > MMUs participate in MM_WALK' patch series could address this but the test
> > still fails.
> >
> >
> > For the reference the exact problem is:
> >
> > 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap
> >
> > 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.
> >
> > 3. A NUMA autobalance code write protects the guest memory. KVM in response
> >    evicts the SPTE mappings with A/D bit set, and while doing so tells mm
> >    that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
> >    but due to MLGRU the call doesn't clear the idle bit and thus all the traces
> >    of the guest access disappear and the kernel thinks that the page is still idle.
> >
> > I can say that the root cause of this is that folio_mark_accessed doesn't do
> > what it supposed to do.
> >
> > Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed()
> > will probably fix this but I don't have enough confidence to say if this is
> > all that is needed to fix this.  If this is the case I can send a patch.
>
> My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
> effectively isn't supported by MGLRU.
>
> [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com

Yu's suggestion was to look at the generation numbers themselves, and
that is exactly what my patched access_tracking_perf_test does[2]. :)

So I think to make this work with MGLRU, I'll re-post my
access_tracking_perf_test patch, but if MGLRU is enabled, always use
the MGLRU debugfs instead of using page_idle/bitmap. It needs some
cleanup first though.

[2]: https://lore.kernel.org/kvm/20241105184333.2305744-12-jthoughton@google.com/

> > Any ideas on how to fix all this mess?
>
> The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is
> enabled.  In a real-world scenario, if the guest is actually accessing the pages
> that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're
> faulted back in.  There's a window where page_idle/bitmap could be read between
> making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's
> one of the many flaws of NUMA balancing.
>
> That said, one thing is quite odd.  In the failing case, *half* of the guest pages
> are still idle.  That's quite insane.
>
> Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> different node, and that causes NUMA balancing to go crazy and zap pretty much
> all of guest memory.  If that's what's happening, then a better solution for the
> NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> pin it to a single pCPU?).

+1 to this idea, if this is really what's going on. If NUMA balancing
is only migrating a few pages, the 90% threshold in the test should be
low enough that we tolerate the few pages that were moved.

Or we could just print a warning (instead of fail) if NUMA balancing is enabled.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
  2025-02-19  1:13   ` Sean Christopherson
  2025-02-19 18:56     ` James Houghton
@ 2025-02-25 22:00     ` Maxim Levitsky
  2025-02-26  0:50       ` Sean Christopherson
  1 sibling, 1 reply; 27+ messages in thread
From: Maxim Levitsky @ 2025-02-25 22:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm,
	linux-kernel

On Tue, 2025-02-18 at 17:13 -0800, Sean Christopherson wrote:
> On Tue, Feb 18, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> > > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> > > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> > > waiting for aging. This contention reduction improves guest performance
> > > and saves a significant amount of Google Cloud's CPU usage, and it has
> > > valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> > > 
> > > Please see v8[8] for some performance results using
> > > access_tracking_perf_test patched to use MGLRU.
> > > 
> > > Neither access_tracking_perf_test nor mmu_stress_test trigger any
> > > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.
> > 
> > Hi, I have a question about this patch series and about the
> > access_tracking_perf_test:
> > 
> > Some time ago, I investigated a failure in access_tracking_perf_test which
> > shows up in our CI.
> > 
> > The root cause was that 'folio_clear_idle' doesn't clear the idle bit when
> > MGLRU is enabled, and overall I got the impression that MGLRU is not
> > compatible with idle page tracking.
> > 
> > I thought that this patch series and the 'mm: multi-gen LRU: Have secondary
> > MMUs participate in MM_WALK' patch series could address this but the test
> > still fails.
> > 
> > 
> > For the reference the exact problem is:
> > 
> > 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap
> > 
> > 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.
> > 
> > 3. A NUMA autobalance code write protects the guest memory. KVM in response
> >    evicts the SPTE mappings with A/D bit set, and while doing so tells mm
> >    that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
> >    but due to MLGRU the call doesn't clear the idle bit and thus all the traces
> >    of the guest access disappear and the kernel thinks that the page is still idle.
> > 
> > I can say that the root cause of this is that folio_mark_accessed doesn't do
> > what it supposed to do.
> > 
> > Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed()
> > will probably fix this but I don't have enough confidence to say if this is
> > all that is needed to fix this.  If this is the case I can send a patch.
> 
> My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
> effectively isn't supported by MGLRU.
> 
> [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com

Hi,

Reading this mail makes me think that the page idle interface isn't really used anymore.

Maybe we should redo the access_tracking_perf_test to only use the MGLRU specific interfaces/mode,
and remove its classical page_idle mode altogher?

The point I am trying to get across is that currently access_tracking_perf_test main 
purpose is to test that page_idle works with secondary paging and the fact is that it doesn't work 
well due to more that one reason:

The mere fact that we don't flush TLB already necessitated hacks like the 90% check,
which for example doesn't work nested so another hack was needed, to skip the check
completely when hypervisor is detected, etc, etc.

And now as of 6.13, we don't propagate accessed bit when KVM zaps the SPTE at all, 
which can happen at least in theory due to other reasons than NUMA balancing.


Tomorrow there will be something else that will cause KVM to zap the SPTEs, and the test will fail again,
and again...

What do you think?


> 
> > This patch makes the test pass (but only on 6.12 kernel and below, see below):
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 59f30a981c6f..2013e1f4d572 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio)
> >  {
> >         if (lru_gen_enabled()) {
> >                 folio_inc_refs(folio);
> > -               return;
> > +               goto clear_idle_bit;
> >         }
> >  
> >         if (!folio_test_referenced(folio)) {
> > @@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio)
> >                 folio_clear_referenced(folio);
> >                 workingset_activation(folio);
> >         }
> > +clear_idle_bit:
> >         if (folio_test_idle(folio))
> >                 folio_clear_idle(folio);
> >  }
> > 
> > 
> > To always reproduce this, it is best to use a patch to make the test run in a
> > loop, like below (although the test fails without this as well).
> 
> ..
> 
> > With the above patch applied, you will notice after 4-6 iterations that the
> > number of still idle pages soars:
> > 
> > Populating memory             : 0.798882357s
> 
> ...
> 
> > vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0
> > Mark memory idle              : 2.711946690s
> > Writing to idle memory        : 0.302882502s
> > 
> > ...
> > 
> > (*) Turns out that since kernel 6.13, this code that sets accessed bit in the
> > primary paging structure, when the secondary was zapped was *removed*. I
> > bisected this to commit:
> > 
> > 66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs
> > 
> > So now the access_tracking_test is broken regardless of MGLRU.
> 
> Just to confirm, do you see failures on 6.13 with MGLRU disabled?  

Yes. The test always fails.

> 
> > Any ideas on how to fix all this mess?
> 
> The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is
> enabled.  In a real-world scenario, if the guest is actually accessing the pages
> that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're
> faulted back in.  There's a window where page_idle/bitmap could be read between
> making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's
> one of the many flaws of NUMA balancing.
> 
> That said, one thing is quite odd.  In the failing case, *half* of the guest pages
> are still idle.  That's quite insane.
> 
> Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> different node, and that causes NUMA balancing to go crazy and zap pretty much
> all of guest memory.  If that's what's happening, then a better solution for the
> NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> pin it to a single pCPU?).

Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem persists.
On 6.13, the only way to make the test consistently work is to disable NUMA balancing.


Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
  2025-02-25 22:00     ` Maxim Levitsky
@ 2025-02-26  0:50       ` Sean Christopherson
  2025-02-26 18:39         ` Maxim Levitsky
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2025-02-26  0:50 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: James Houghton, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm,
	linux-kernel

On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> On Tue, 2025-02-18 at 17:13 -0800, Sean Christopherson wrote:
> > My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
> > effectively isn't supported by MGLRU.
> > 
> > [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com
> 
> Hi,
> 
> Reading this mail makes me think that the page idle interface isn't really
> used anymore.

I'm sure it's still used in production somewhere.  And even if it's being phased
out in favor of MGLRU, it's still super useful for testing purposes, because it
gives userspace much more direct control over aging.

> Maybe we should redo the access_tracking_perf_test to only use the MGLRU
> specific interfaces/mode, and remove its classical page_idle mode altogher?

I don't want to take a hard dependency on MGLRU (unless page_idle gets fully
deprecated/removed by the kernel), and I also don't think page_idle is the main
problem with the test.
   
> The point I am trying to get across is that currently
> access_tracking_perf_test main purpose is to test that page_idle works with
> secondary paging and the fact is that it doesn't work well due to more that
> one reason:

The primary purpose of the test is to measure performance.  Asserting that 90%+
pages were dirtied is a sanity check, not an outright goal.

> The mere fact that we don't flush TLB already necessitated hacks like the 90%
> check, which for example doesn't work nested so another hack was needed, to
> skip the check completely when hypervisor is detected, etc, etc.

100% agreed here.

> And now as of 6.13, we don't propagate accessed bit when KVM zaps the SPTE at
> all, which can happen at least in theory due to other reasons than NUMA balancing.
> 
> Tomorrow there will be something else that will cause KVM to zap the SPTEs,
> and the test will fail again, and again...
> 
> What do you think?

What if we make the assertion user controllable?  I.e. let the user opt-out (or
off-by-default and opt-in) via command line?  We did something similar for the
rseq test, because the test would run far fewer iterations than expected if the
vCPU task was migrated to CPU(s) in deep sleep states.

	TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
		    "Only performed %d KVM_RUNs, task stalled too much?\n\n"
		    "  Try disabling deep sleep states to reduce CPU wakeup latency,\n"
		    "  e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
		    "  or run with -u to disable this sanity check.", i);

This is quite similar, because as you say, it's impractical for the test to account
for every possible environmental quirk.

> > Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > all of guest memory.  If that's what's happening, then a better solution for the
> > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > pin it to a single pCPU?).
> 
> Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem
> persists.  On 6.13, the only way to make the test consistently work is to
> disable NUMA balancing.

Well that's odd.  While I'm quite curious as to what's happening, my stance is
that enabling NUMA balancing with KVM is a terrible idea, so my vote is to sweep
it under the rug and let the user disable the sanity check.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
  2025-02-26  0:50       ` Sean Christopherson
@ 2025-02-26 18:39         ` Maxim Levitsky
  2025-02-27  0:51           ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Maxim Levitsky @ 2025-02-26 18:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm,
	linux-kernel

On Tue, 2025-02-25 at 16:50 -0800, Sean Christopherson wrote:
> On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-18 at 17:13 -0800, Sean Christopherson wrote:
> > > My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
> > > effectively isn't supported by MGLRU.
> > > 
> > > [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com
> > 
> > Hi,
> > 
> > Reading this mail makes me think that the page idle interface isn't really
> > used anymore.
> 
> I'm sure it's still used in production somewhere.  And even if it's being phased
> out in favor of MGLRU, it's still super useful for testing purposes, because it
> gives userspace much more direct control over aging.

I also think that page_idle is used somewhere in production, and it probably works
more or less correctly with regular non VM processes, although I have no idea how well it works
when MGLRU is enabled.

My point was that using page_idle to track guest memory is something that is probably
not used because it doesn't work that well, and nobody seems to complain.
However I don't ask for it to be removed, although a note of deprecation might
be worth it if really nobody uses it.

> 
> > Maybe we should redo the access_tracking_perf_test to only use the MGLRU
> > specific interfaces/mode, and remove its classical page_idle mode altogher?
> 
> I don't want to take a hard dependency on MGLRU (unless page_idle gets fully
> deprecated/removed by the kernel), and I also don't think page_idle is the main
> problem with the test.
>    
> > The point I am trying to get across is that currently
> > access_tracking_perf_test main purpose is to test that page_idle works with
> > secondary paging and the fact is that it doesn't work well due to more that
> > one reason:
> 
> The primary purpose of the test is to measure performance.  Asserting that 90%+
> pages were dirtied is a sanity check, not an outright goal.

From my POV, a performance test can't really be a selftest unless it actually fails
when it detects an unusually low performance. 

Otherwise who is going to be alarmed when a regression happens and
things actually get slower?

> 
> > The mere fact that we don't flush TLB already necessitated hacks like the 90%
> > check, which for example doesn't work nested so another hack was needed, to
> > skip the check completely when hypervisor is detected, etc, etc.
> 
> 100% agreed here.
> 
> > And now as of 6.13, we don't propagate accessed bit when KVM zaps the SPTE at
> > all, which can happen at least in theory due to other reasons than NUMA balancing.
> > 
> > Tomorrow there will be something else that will cause KVM to zap the SPTEs,
> > and the test will fail again, and again...
> > 
> > What do you think?
> 
> What if we make the assertion user controllable?  I.e. let the user opt-out (or
> off-by-default and opt-in) via command line?  We did something similar for the
> rseq test, because the test would run far fewer iterations than expected if the
> vCPU task was migrated to CPU(s) in deep sleep states.
> 
> 	TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
> 		    "Only performed %d KVM_RUNs, task stalled too much?\n\n"
> 		    "  Try disabling deep sleep states to reduce CPU wakeup latency,\n"
> 		    "  e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
> 		    "  or run with -u to disable this sanity check.", i);
> 
> This is quite similar, because as you say, it's impractical for the test to account
> for every possible environmental quirk.

No objections in principle, especially if sanity check is skipped by default, 
although this does sort of defeats the purpose of the check. 
I guess that the check might still be used for developers.


> 
> > > Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > > all of guest memory.  If that's what's happening, then a better solution for the
> > > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > > pin it to a single pCPU?).
> > 
> > Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem
> > persists.  On 6.13, the only way to make the test consistently work is to
> > disable NUMA balancing.
> 
> Well that's odd.  While I'm quite curious as to what's happening, my stance is
> that enabling NUMA balancing with KVM is a terrible idea, so my vote is to sweep
> it under the rug and let the user disable the sanity check.
> 

One thing for sure, with NUMA balancing off, the test passes well (shows on average
around 100-200 idle pages) and I have run it for a long time.


Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
  2025-02-26 18:39         ` Maxim Levitsky
@ 2025-02-27  0:51           ` Sean Christopherson
  2025-02-27  1:54             ` Maxim Levitsky
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2025-02-27  0:51 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: James Houghton, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm,
	linux-kernel

On Wed, Feb 26, 2025, Maxim Levitsky wrote:
> On Tue, 2025-02-25 at 16:50 -0800, Sean Christopherson wrote:
> > On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> > What if we make the assertion user controllable?  I.e. let the user opt-out (or
> > off-by-default and opt-in) via command line?  We did something similar for the
> > rseq test, because the test would run far fewer iterations than expected if the
> > vCPU task was migrated to CPU(s) in deep sleep states.
> > 
> > 	TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
> > 		    "Only performed %d KVM_RUNs, task stalled too much?\n\n"
> > 		    "  Try disabling deep sleep states to reduce CPU wakeup latency,\n"
> > 		    "  e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
> > 		    "  or run with -u to disable this sanity check.", i);
> > 
> > This is quite similar, because as you say, it's impractical for the test to account
> > for every possible environmental quirk.
> 
> No objections in principle, especially if sanity check is skipped by default, 
> although this does sort of defeats the purpose of the check. 
> I guess that the check might still be used for developers.

A middle ground would be to enable the check by default if NUMA balancing is off.
We can always revisit the default setting if it turns out there are other problematic
"features".

> > > > Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > > > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > > > all of guest memory.  If that's what's happening, then a better solution for the
> > > > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > > > pin it to a single pCPU?).
> > > 
> > > Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem
> > > persists.  On 6.13, the only way to make the test consistently work is to
> > > disable NUMA balancing.
> > 
> > Well that's odd.  While I'm quite curious as to what's happening,

Gah, chatting about this offline jogged my memory.  NUMA balancing doesn't zap
(mark PROT_NONE/PROT_NUMA) PTEs for paging the kernel thinks are being accessed
remotely, it zaps PTEs to see if they're are being accessed remotely.  So yeah,
whenever NUMA balancing kicks in, the guest will see a large amount of its memory
get re-faulted.

Which is why it's such a terribly feature to pair with KVM, at least as-is.  NUMA
balancing is predicated on inducing and resolving the #PF being relatively cheap,
but that doesn't hold true for secondary MMUs due to the coarse nature of mmu_notifiers.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
  2025-02-27  0:51           ` Sean Christopherson
@ 2025-02-27  1:54             ` Maxim Levitsky
  0 siblings, 0 replies; 27+ messages in thread
From: Maxim Levitsky @ 2025-02-27  1:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm,
	linux-kernel

On Wed, 2025-02-26 at 16:51 -0800, Sean Christopherson wrote:
> On Wed, Feb 26, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-25 at 16:50 -0800, Sean Christopherson wrote:
> > > On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> > > What if we make the assertion user controllable?  I.e. let the user opt-out (or
> > > off-by-default and opt-in) via command line?  We did something similar for the
> > > rseq test, because the test would run far fewer iterations than expected if the
> > > vCPU task was migrated to CPU(s) in deep sleep states.
> > > 
> > > 	TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
> > > 		    "Only performed %d KVM_RUNs, task stalled too much?\n\n"
> > > 		    "  Try disabling deep sleep states to reduce CPU wakeup latency,\n"
> > > 		    "  e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
> > > 		    "  or run with -u to disable this sanity check.", i);
> > > 
> > > This is quite similar, because as you say, it's impractical for the test to account
> > > for every possible environmental quirk.
> > 
> > No objections in principle, especially if sanity check is skipped by default, 
> > although this does sort of defeats the purpose of the check. 
> > I guess that the check might still be used for developers.
> 
> A middle ground would be to enable the check by default if NUMA balancing is off.
> We can always revisit the default setting if it turns out there are other problematic
> "features".

That works for me.
I can send a patch for this then.

> 
> > > > > Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > > > > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > > > > all of guest memory.  If that's what's happening, then a better solution for the
> > > > > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > > > > pin it to a single pCPU?).
> > > > 
> > > > Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem
> > > > persists.  On 6.13, the only way to make the test consistently work is to
> > > > disable NUMA balancing.
> > > 
> > > Well that's odd.  While I'm quite curious as to what's happening,
> 
> Gah, chatting about this offline jogged my memory.  NUMA balancing doesn't zap
> (mark PROT_NONE/PROT_NUMA) PTEs for paging the kernel thinks are being accessed
> remotely, it zaps PTEs to see if they're are being accessed remotely.  So yeah,
> whenever NUMA balancing kicks in, the guest will see a large amount of its memory
> get re-faulted.
> 
> Which is why it's such a terribly feature to pair with KVM, at least as-is.  NUMA
> balancing is predicated on inducing and resolving the #PF being relatively cheap,
> but that doesn't hold true for secondary MMUs due to the coarse nature of mmu_notifiers.
> 

I also think so, not to mention that VM exits aren't that cheap either, and the general
direction is to avoid them as much as possible, and here this feature pretty much
yanks the memory out of the guest every two seconds or so, causing lots of VM exits.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-02-27  1:54 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
2025-02-04  0:40 ` [PATCH v9 01/11] KVM: Rename kvm_handle_hva_range() James Houghton
2025-02-04  0:40 ` [PATCH v9 02/11] KVM: Add lockless memslot walk to KVM James Houghton
2025-02-14 15:26   ` Sean Christopherson
2025-02-14 19:27     ` James Houghton
2025-02-04  0:40 ` [PATCH v9 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine James Houghton
2025-02-04  0:40 ` [PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn() James Houghton
2025-02-12 22:07   ` Sean Christopherson
2025-02-13  0:25     ` James Houghton
2025-02-04  0:40 ` [PATCH v9 05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write() James Houghton
2025-02-12 22:09   ` Sean Christopherson
2025-02-13  0:26     ` James Houghton
2025-02-04  0:40 ` [PATCH v9 06/11] KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young James Houghton
2025-02-04  0:40 ` [PATCH v9 07/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 James Houghton
2025-02-04  0:40 ` [PATCH v9 08/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock James Houghton
2025-02-04  0:40 ` [PATCH v9 09/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock James Houghton
2025-02-04  0:40 ` [PATCH v9 10/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs James Houghton
2025-02-04  0:40 ` [PATCH v9 11/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns James Houghton
2025-02-15  0:50 ` [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly Sean Christopherson
2025-02-18 19:29 ` Maxim Levitsky
2025-02-19  1:13   ` Sean Christopherson
2025-02-19 18:56     ` James Houghton
2025-02-25 22:00     ` Maxim Levitsky
2025-02-26  0:50       ` Sean Christopherson
2025-02-26 18:39         ` Maxim Levitsky
2025-02-27  0:51           ` Sean Christopherson
2025-02-27  1:54             ` Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox