[PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly
@ 2024-11-05 18:43 James Houghton
  2024-11-05 18:43 ` [PATCH v8 01/11] KVM: Remove kvm_handle_hva_range helper functions James Houghton
                   ` (11 more replies)
  0 siblings, 12 replies; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

Andrew has queued patches to make MGLRU consult KVM when doing aging[8].
Now, make aging lockless for the shadow MMU and the TDP MMU. This allows
us to reduce the time/CPU it takes to do aging and the performance
impact on the vCPUs while we are aging.

The final patch in this series modifies access_tracking_stress_test to
age using MGLRU. There is a mode (-p) where it will age while the vCPUs
are faulting memory in. Here are some results with that mode:

TDP MMU disabled, no optimization:
$ ./access_tracking_perf_test -l -r /dev/cgroup/memory -v 64 -p
lru_gen avg pass duration     : 15.954388970s, (passes:1, total:15.954388970s)

TDP MMU disabled, lockless:
$ ./access_tracking_perf_test -l -r /dev/cgroup/memory -v 64 -p
lru_gen avg pass duration     : 0.527091929s, (passes:35, total:18.448217547s)

The vCPU time difference in these runs with the shadow MMU vary quite a
lot, and there doesn't seem to be a notable improvement with this
particular test.

There are some more results with the TDP MMU from v4[4].

I have also tested with Sean's mmu_stress_test changes[1].

Note: the new MGLRU mode for access_tracking_perf_test will verify that
aging is functional. It will only be functional with the MGLRU patches
that have been sent to Andrew separately[8].

=== Previous Versions ===

Since v7[7]:
 - Dropped MGLRU changes (Andrew has queued them[8]).
 - Dropped DAMON cleanup (Andrew has queued this[9]).
 - Dropped MMU notifier changes completely.
 - Made shadow MMU aging *always* lockless, not just lockless when the
   now-removed "fast_only" clear notifier was used.
 - Given that the MGLRU changes no longer introduce a new MGLRU
   capability, drop the new capability check from the selftest.
 - Rebased on top of latest kvm-x86/next, including the x86 mmu changes
   for marking pages as dirty.

Since v6[6]:
 - Rebased on top of kvm-x86/next and Sean's lockless rmap walking
   changes.
 - Removed HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY (thanks DavidM).
 - Split up kvm_age_gfn() / kvm_test_age_gfn() optimizations (thanks
   DavidM and Sean).
 - Improved new MMU notifier documentation (thanks DavidH).
 - Dropped arm64 locking change.
 - No longer retry for CAS failure in TDP MMU non-A/D case (thanks
   Sean).
 - Added some R-bys and A-bys.

Since v5[5]:
 - Reworked test_clear_young_fast_only() into a new parameter for the
   existing notifiers (thanks Sean).
 - Added mmu_notifier.has_fast_aging to tell mm if calling fast-only
   notifiers should be done.
 - Added mm_has_fast_young_notifiers() to inform users if calling
   fast-only notifier helpers is worthwhile (for look-around to use).
 - Changed MGLRU to invoke a single notifier instead of two when
   aging and doing look-around (thanks Yu).
 - For KVM/x86, check indirect_shadow_pages > 0 instead of
   kvm_memslots_have_rmaps() when collecting age information
   (thanks Sean).
 - For KVM/arm, some fixes from Oliver.
 - Small fixes to access_tracking_perf_test.
 - Added missing !MMU_NOTIFIER version of mmu_notifier_clear_young().

Since v4[4]:
 - Removed Kconfig that controlled when aging was enabled. Aging will
   be done whenever the architecture supports it (thanks Yu).
 - Added a new MMU notifier, test_clear_young_fast_only(), specifically
   for MGLRU to use.
 - Add kvm_fast_{test_,}age_gfn, implemented by x86.
 - Fix locking for clear_flush_young().
 - Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes
   (thanks Sean).
 - Fix WARN_ON and other cleanup for the arm64 locking changes
   (thanks Oliver).

Since v3[3]:
 - Vastly simplified the series (thanks David). Removed mmu notifier
   batching logic entirely.
 - Cleaned up how locking is done for mmu_notifier_test/clear_young
   (thanks David).
 - Look-around is now only done when there are no secondary MMUs
   subscribed to MMU notifiers.
 - CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
 - Fixed the lockless implementation of kvm_{test,}age_gfn for x86
   (thanks David).
 - Added MGLRU functional and performance tests to
   access_tracking_perf_test (thanks Axel).
 - In v3, an mm would be completely ignored (for aging) if there was a
   secondary MMU but support for secondary MMU walking was missing. Now,
   missing secondary MMU walking support simply skips the notifier
   calls (except for eviction).
 - Added a sanity check for that range->lockless and range->on_lock are
   never both provided for the memslot walk.

For the changes since v2[2], see v3.

Based on latest kvm-x86/next.

[1]: https://lore.kernel.org/kvm/20241009154953.1073471-1-seanjc@google.com/
[2]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuzhao@google.com/
[3]: https://lore.kernel.org/linux-mm/20240401232946.1837665-1-jthoughton@google.com/
[4]: https://lore.kernel.org/linux-mm/20240529180510.2295118-1-jthoughton@google.com/
[5]: https://lore.kernel.org/linux-mm/20240611002145.2078921-1-jthoughton@google.com/
[6]: https://lore.kernel.org/linux-mm/20240724011037.3671523-1-jthoughton@google.com/
[7]: https://lore.kernel.org/kvm/20240926013506.860253-1-jthoughton@google.com/
[8]: https://lore.kernel.org/linux-mm/20241019012940.3656292-1-jthoughton@google.com/
[9]: https://lore.kernel.org/linux-mm/20241021160212.9935-1-jthoughton@google.com/

James Houghton (7):
  KVM: Remove kvm_handle_hva_range helper functions
  KVM: Add lockless memslot walk to KVM
  KVM: x86/mmu: Factor out spte atomic bit clearing routine
  KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn
  KVM: x86/mmu: Only check gfn age in shadow MMU if
    indirect_shadow_pages > 0
  KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test

Sean Christopherson (4):
  KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o
    mmu_lock
  KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of
    mmu_lock
  KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
  KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging
    gfns

 arch/x86/include/asm/kvm_host.h               |   4 +-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        | 338 ++++++++++-----
 arch/x86/kvm/mmu/tdp_iter.h                   |  27 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  23 +-
 include/linux/kvm_host.h                      |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/access_tracking_perf_test.c | 366 ++++++++++++++--
 .../selftests/kvm/include/lru_gen_util.h      |  55 +++
 .../testing/selftests/kvm/lib/lru_gen_util.c  | 391 ++++++++++++++++++
 virt/kvm/Kconfig                              |   2 +
 virt/kvm/kvm_main.c                           | 102 +++--
 12 files changed, 1124 insertions(+), 187 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h
 create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c


base-commit: a27e0515592ec9ca28e0d027f42568c47b314784
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v8 01/11] KVM: Remove kvm_handle_hva_range helper functions
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
@ 2024-11-05 18:43 ` James Houghton
  2025-01-10 22:15   ` Sean Christopherson
  2024-11-05 18:43 ` [PATCH v8 02/11] KVM: Add lockless memslot walk to KVM James Houghton
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

kvm_handle_hva_range is only used by the young notifiers. In a later
patch, it will be even further tied to the young notifiers. Instead of
renaming kvm_handle_hva_range to something like
kvm_handle_hva_range_young, simply remove kvm_handle_hva_range. This
seems slightly more readable, though there is slightly more code
duplication.

Finally, rename __kvm_handle_hva_range to kvm_handle_hva_range, now that
the name is available.

Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 virt/kvm/kvm_main.c | 74 +++++++++++++++++++++++----------------------
 1 file changed, 38 insertions(+), 36 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 27186b06518a..8b234a9acdb3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -551,8 +551,8 @@ static void kvm_null_fn(void)
 	     node;							     \
 	     node = interval_tree_iter_next(node, start, last))	     \
 
-static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
-							   const struct kvm_mmu_notifier_range *range)
+static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
+							 const struct kvm_mmu_notifier_range *range)
 {
 	struct kvm_mmu_notifier_return r = {
 		.ret = false,
@@ -628,33 +628,6 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 	return r;
 }
 
-static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
-						unsigned long start,
-						unsigned long end,
-						gfn_handler_t handler,
-						bool flush_on_ret)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_mmu_notifier_range range = {
-		.start		= start,
-		.end		= end,
-		.handler	= handler,
-		.on_lock	= (void *)kvm_null_fn,
-		.flush_on_ret	= flush_on_ret,
-		.may_block	= false,
-	};
-
-	return __kvm_handle_hva_range(kvm, &range).ret;
-}
-
-static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
-							 unsigned long start,
-							 unsigned long end,
-							 gfn_handler_t handler)
-{
-	return kvm_handle_hva_range(mn, start, end, handler, false);
-}
-
 void kvm_mmu_invalidate_begin(struct kvm *kvm)
 {
 	lockdep_assert_held_write(&kvm->mmu_lock);
@@ -747,7 +720,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * that guest memory has been reclaimed.  This needs to be done *after*
 	 * dropping mmu_lock, as x86's reclaim path is slooooow.
 	 */
-	if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
+	if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
 		kvm_arch_guest_memory_reclaimed(kvm);
 
 	return 0;
@@ -793,7 +766,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	};
 	bool wake;
 
-	__kvm_handle_hva_range(kvm, &hva_range);
+	kvm_handle_hva_range(kvm, &hva_range);
 
 	/* Pairs with the increment in range_start(). */
 	spin_lock(&kvm->mn_invalidate_lock);
@@ -815,10 +788,20 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 					      unsigned long start,
 					      unsigned long end)
 {
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_mmu_notifier_range range = {
+		.start		= start,
+		.end		= end,
+		.handler	= kvm_age_gfn,
+		.on_lock	= (void *)kvm_null_fn,
+		.flush_on_ret	=
+			!IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG),
+		.may_block	= false,
+	};
+
 	trace_kvm_age_hva(start, end);
 
-	return kvm_handle_hva_range(mn, start, end, kvm_age_gfn,
-				    !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG));
+	return kvm_handle_hva_range(kvm, &range).ret;
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
@@ -826,6 +809,16 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 					unsigned long start,
 					unsigned long end)
 {
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_mmu_notifier_range range = {
+		.start		= start,
+		.end		= end,
+		.handler	= kvm_age_gfn,
+		.on_lock	= (void *)kvm_null_fn,
+		.flush_on_ret	= false,
+		.may_block	= false,
+	};
+
 	trace_kvm_age_hva(start, end);
 
 	/*
@@ -841,17 +834,26 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * cadence. If we find this inaccurate, we might come up with a
 	 * more sophisticated heuristic later.
 	 */
-	return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
+	return kvm_handle_hva_range(kvm, &range).ret;
 }
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long address)
 {
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_mmu_notifier_range range = {
+		.start		= address,
+		.end		= address + 1,
+		.handler	= kvm_test_age_gfn,
+		.on_lock	= (void *)kvm_null_fn,
+		.flush_on_ret	= false,
+		.may_block	= false,
+	};
+
 	trace_kvm_test_age_hva(address);
 
-	return kvm_handle_hva_range_no_flush(mn, address, address + 1,
-					     kvm_test_age_gfn);
+	return kvm_handle_hva_range(kvm, &range).ret;
 }
 
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 01/11] KVM: Remove kvm_handle_hva_range helper functions
  2024-11-05 18:43 ` [PATCH v8 01/11] KVM: Remove kvm_handle_hva_range helper functions James Houghton
@ 2025-01-10 22:15   ` Sean Christopherson
  2025-01-27 19:50     ` James Houghton
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2025-01-10 22:15 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, Nov 05, 2024, James Houghton wrote:
> kvm_handle_hva_range is only used by the young notifiers. In a later
> patch, it will be even further tied to the young notifiers. Instead of
> renaming kvm_handle_hva_range to something like

When referencing functions, include parantheses so its super obvious that the
symbol is a function(), e.g. kvm_handle_hva_range(), kvm_handle_hva_range_young(),
etc.

> kvm_handle_hva_range_young, simply remove kvm_handle_hva_range. This
> seems slightly more readable, 

I disagree, quite strongly in fact.  The amount of duplication makes it harder
to see the differences between the three aging flow, and the fewer instances of
this pattern:

	return kvm_handle_hva_range(kvm, &range).ret;

the better.  I added the tuple return as a way to avoid an out-param (which I
still think is a good tradeoff), but there's definitely a cost to it.

> though there is slightly more code duplication.

Heh, you have a different definition of "slightly".  The total lines of code may
be close to a wash, but at the end of the series there's ~10 lines of code that
is nearly identical in three different places.

My vote is for this:

---
 virt/kvm/kvm_main.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de2c11dae231..bf4670e9fcc6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -551,8 +551,8 @@ static void kvm_null_fn(void)
 	     node;							     \
 	     node = interval_tree_iter_next(node, start, last))	     \
 
-static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
-							   const struct kvm_mmu_notifier_range *range)
+static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
+							 const struct kvm_mmu_notifier_range *range)
 {
 	struct kvm_mmu_notifier_return r = {
 		.ret = false,
@@ -628,7 +628,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 	return r;
 }
 
-static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
+static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn,
 						unsigned long start,
 						unsigned long end,
 						gfn_handler_t handler,
@@ -647,10 +647,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 	return __kvm_handle_hva_range(kvm, &range).ret;
 }
 
-static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
-							 unsigned long start,
-							 unsigned long end,
-							 gfn_handler_t handler)
+static __always_inline int kvm_age_hva_range_no_flush(struct mmu_notifier *mn,
+						      unsigned long start,
+						      unsigned long end,
+						      gfn_handler_t handler)
 {
 	return kvm_handle_hva_range(mn, start, end, handler, false);
 }
@@ -747,7 +747,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * that guest memory has been reclaimed.  This needs to be done *after*
 	 * dropping mmu_lock, as x86's reclaim path is slooooow.
 	 */
-	if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
+	if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
 		kvm_arch_guest_memory_reclaimed(kvm);
 
 	return 0;
@@ -793,7 +793,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	};
 	bool wake;
 
-	__kvm_handle_hva_range(kvm, &hva_range);
+	kvm_handle_hva_range(kvm, &hva_range);
 
 	/* Pairs with the increment in range_start(). */
 	spin_lock(&kvm->mn_invalidate_lock);
@@ -817,8 +817,8 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 {
 	trace_kvm_age_hva(start, end);
 
-	return kvm_handle_hva_range(mn, start, end, kvm_age_gfn,
-				    !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG));
+	return kvm_age_hva_range(mn, start, end, kvm_age_gfn,
+				 !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG));
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
@@ -841,7 +841,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * cadence. If we find this inaccurate, we might come up with a
 	 * more sophisticated heuristic later.
 	 */
-	return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
+	return kvm_age_hva_range_no_flush(mn, start, end, kvm_age_gfn);
 }
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
@@ -850,8 +850,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 {
 	trace_kvm_test_age_hva(address);
 
-	return kvm_handle_hva_range_no_flush(mn, address, address + 1,
-					     kvm_test_age_gfn);
+	return kvm_age_hva_range_no_flush(mn, address, address + 1, kvm_test_age_gfn);
 }
 
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,

base-commit: 2d5faa6a8402435d6332e8e8f3c3f18cca382d83
-- 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 01/11] KVM: Remove kvm_handle_hva_range helper functions
  2025-01-10 22:15   ` Sean Christopherson
@ 2025-01-27 19:50     ` James Houghton
  0 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2025-01-27 19:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Fri, Jan 10, 2025 at 2:15 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Nov 05, 2024, James Houghton wrote:
> > kvm_handle_hva_range is only used by the young notifiers. In a later
> > patch, it will be even further tied to the young notifiers. Instead of
> > renaming kvm_handle_hva_range to something like
>
> When referencing functions, include parantheses so its super obvious that the
> symbol is a function(), e.g. kvm_handle_hva_range(), kvm_handle_hva_range_young(),
> etc.

Thanks Sean, I think I've fixed up all the cases now.

>
> > kvm_handle_hva_range_young, simply remove kvm_handle_hva_range. This
> > seems slightly more readable,
>
> I disagree, quite strongly in fact.  The amount of duplication makes it harder
> to see the differences between the three aging flow, and the fewer instances of
> this pattern:
>
>         return kvm_handle_hva_range(kvm, &range).ret;
>
> the better.  I added the tuple return as a way to avoid an out-param (which I
> still think is a good tradeoff), but there's definitely a cost to it.
>
> > though there is slightly more code duplication.
>
> Heh, you have a different definition of "slightly".  The total lines of code may
> be close to a wash, but at the end of the series there's ~10 lines of code that
> is nearly identical in three different places.
>
> My vote is for this:

I applied this patch verbatim as a replacement for the original one.

Since [1], the "refactor" in this original patch makes much less sense. Thanks!

[1]: commit 28f8b61a69b5c ("KVM: Allow arch code to elide TLB flushes
when aging a young page")


> ---
>  virt/kvm/kvm_main.c | 27 +++++++++++++--------------
>  1 file changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index de2c11dae231..bf4670e9fcc6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -551,8 +551,8 @@ static void kvm_null_fn(void)
>              node;                                                           \
>              node = interval_tree_iter_next(node, start, last))      \
>
> -static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> -                                                          const struct kvm_mmu_notifier_range *range)
> +static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> +                                                        const struct kvm_mmu_notifier_range *range)
>  {
>         struct kvm_mmu_notifier_return r = {
>                 .ret = false,
> @@ -628,7 +628,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>         return r;
>  }
>
> -static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> +static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn,
>                                                 unsigned long start,
>                                                 unsigned long end,
>                                                 gfn_handler_t handler,
> @@ -647,10 +647,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
>         return __kvm_handle_hva_range(kvm, &range).ret;
>  }
>
> -static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
> -                                                        unsigned long start,
> -                                                        unsigned long end,
> -                                                        gfn_handler_t handler)
> +static __always_inline int kvm_age_hva_range_no_flush(struct mmu_notifier *mn,
> +                                                     unsigned long start,
> +                                                     unsigned long end,
> +                                                     gfn_handler_t handler)
>  {
>         return kvm_handle_hva_range(mn, start, end, handler, false);
>  }
> @@ -747,7 +747,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>          * that guest memory has been reclaimed.  This needs to be done *after*
>          * dropping mmu_lock, as x86's reclaim path is slooooow.
>          */
> -       if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
> +       if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
>                 kvm_arch_guest_memory_reclaimed(kvm);
>
>         return 0;
> @@ -793,7 +793,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>         };
>         bool wake;
>
> -       __kvm_handle_hva_range(kvm, &hva_range);
> +       kvm_handle_hva_range(kvm, &hva_range);
>
>         /* Pairs with the increment in range_start(). */
>         spin_lock(&kvm->mn_invalidate_lock);
> @@ -817,8 +817,8 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
>  {
>         trace_kvm_age_hva(start, end);
>
> -       return kvm_handle_hva_range(mn, start, end, kvm_age_gfn,
> -                                   !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG));
> +       return kvm_age_hva_range(mn, start, end, kvm_age_gfn,
> +                                !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG));
>  }
>
>  static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> @@ -841,7 +841,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
>          * cadence. If we find this inaccurate, we might come up with a
>          * more sophisticated heuristic later.
>          */
> -       return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> +       return kvm_age_hva_range_no_flush(mn, start, end, kvm_age_gfn);
>  }
>
>  static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> @@ -850,8 +850,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
>  {
>         trace_kvm_test_age_hva(address);
>
> -       return kvm_handle_hva_range_no_flush(mn, address, address + 1,
> -                                            kvm_test_age_gfn);
> +       return kvm_age_hva_range_no_flush(mn, address, address + 1, kvm_test_age_gfn);
>  }
>
>  static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
>
> base-commit: 2d5faa6a8402435d6332e8e8f3c3f18cca382d83
> --

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v8 02/11] KVM: Add lockless memslot walk to KVM
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
  2024-11-05 18:43 ` [PATCH v8 01/11] KVM: Remove kvm_handle_hva_range helper functions James Houghton
@ 2024-11-05 18:43 ` James Houghton
  2025-01-10 22:26   ` Sean Christopherson
  2024-11-05 18:43 ` [PATCH v8 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine James Houghton
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

Provide flexibility to the architecture to synchronize as optimally as
they can instead of always taking the MMU lock for writing.

Architectures that do their own locking must select
CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS.

The immediate application is to allow architectures to implement the
test/clear_young MMU notifiers more cheaply.

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/Kconfig         |  2 ++
 virt/kvm/kvm_main.c      | 28 +++++++++++++++++++++-------
 3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 18a1672ffcbf..ab0318dbb8bd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -260,6 +260,7 @@ struct kvm_gfn_range {
 	gfn_t end;
 	union kvm_mmu_notifier_arg arg;
 	bool may_block;
+	bool lockless;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..b50e4e629ac9 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -102,6 +102,8 @@ config KVM_GENERIC_MMU_NOTIFIER
 
 config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
        depends on KVM_GENERIC_MMU_NOTIFIER
+
+config KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
        bool
 
 config KVM_GENERIC_MEMORY_ATTRIBUTES
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b234a9acdb3..218edf037917 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -517,6 +517,7 @@ struct kvm_mmu_notifier_range {
 	on_lock_fn_t on_lock;
 	bool flush_on_ret;
 	bool may_block;
+	bool lockless;
 };
 
 /*
@@ -571,6 +572,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 			 IS_KVM_NULL_FN(range->handler)))
 		return r;
 
+	/* on_lock will never be called for lockless walks */
+	if (WARN_ON_ONCE(range->lockless && !IS_KVM_NULL_FN(range->on_lock)))
+		return r;
+
 	idx = srcu_read_lock(&kvm->srcu);
 
 	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
@@ -602,15 +607,18 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
+			gfn_range.lockless = range->lockless;
 
 			if (!r.found_memslot) {
 				r.found_memslot = true;
-				KVM_MMU_LOCK(kvm);
-				if (!IS_KVM_NULL_FN(range->on_lock))
-					range->on_lock(kvm);
-
-				if (IS_KVM_NULL_FN(range->handler))
-					goto mmu_unlock;
+				if (!range->lockless) {
+					KVM_MMU_LOCK(kvm);
+					if (!IS_KVM_NULL_FN(range->on_lock))
+						range->on_lock(kvm);
+
+					if (IS_KVM_NULL_FN(range->handler))
+						goto mmu_unlock;
+				}
 			}
 			r.ret |= range->handler(kvm, &gfn_range);
 		}
@@ -620,7 +628,7 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 		kvm_flush_remote_tlbs(kvm);
 
 mmu_unlock:
-	if (r.found_memslot)
+	if (r.found_memslot && !range->lockless)
 		KVM_MMU_UNLOCK(kvm);
 
 	srcu_read_unlock(&kvm->srcu, idx);
@@ -797,6 +805,8 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 		.flush_on_ret	=
 			!IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG),
 		.may_block	= false,
+		.lockless	=
+			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
 	};
 
 	trace_kvm_age_hva(start, end);
@@ -817,6 +827,8 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= false,
 		.may_block	= false,
+		.lockless	=
+			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
 	};
 
 	trace_kvm_age_hva(start, end);
@@ -849,6 +861,8 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= false,
 		.may_block	= false,
+		.lockless	=
+			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
 	};
 
 	trace_kvm_test_age_hva(address);
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 02/11] KVM: Add lockless memslot walk to KVM
  2024-11-05 18:43 ` [PATCH v8 02/11] KVM: Add lockless memslot walk to KVM James Houghton
@ 2025-01-10 22:26   ` Sean Christopherson
  2025-01-27 19:51     ` James Houghton
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2025-01-10 22:26 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, Nov 05, 2024, James Houghton wrote:
> Provide flexibility to the architecture to synchronize as optimally as

Similar to my nit on "locking" below, "synchronize" is somewhat misleading.  There's
no requirement for architectures to synchronize during aging.  There is all but
guaranteed to be some form of "synchronization", e.g. to prevent walking freed
page tables, but the aging itself never synchronizes, and protecting a walk by
disabling IRQs (as x86's shadow MMU does in some flows) only "synchronizes" in a
loose sense of the word.

> they can instead of always taking the MMU lock for writing.
> 
> Architectures that do their own locking must select
> CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS.

This is backwards and could be misleading, and for the TDP MMU outright wrong.
If some hypothetical architecture took _additional_ locks, then it can do so
without needing to select CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS.

What you want to say is that architectures that select
CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS are responsible for ensuring correctness.
And it's specifically all about correctness.  Common KVM doesn't care if the
arch does it's own locking, e.g. taking mmu_lock for read, or has some completely
lock-free scheme for aging.

> The immediate application is to allow architectures to implement the

"immediate application" is pretty redundant.  There's really only one reason to
not take mmu_lock in this path.

> test/clear_young MMU notifiers more cheaply.

IMO, "more cheaply" is vague, and doesn't add much beyond the opening sentence
about "synchronize as optimally as they can".  I would just delete this last
sentence.

> Suggested-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> Reviewed-by: David Matlack <dmatlack@google.com>
> ---
> @@ -797,6 +805,8 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
>  		.flush_on_ret	=
>  			!IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG),
>  		.may_block	= false,
> +		.lockless	=
> +			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),

		.lockess	= IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),

Random thought, maybe CONFIG_KVM_MMU_NOTIFIER_AGING_LOCKLESS or
CONFIG_KVM_MMU_NOTIFIER_AGE_LOCKLESS instead of "YOUNG"?

>  	};
>  
>  	trace_kvm_age_hva(start, end);
> @@ -817,6 +827,8 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
>  		.on_lock	= (void *)kvm_null_fn,
>  		.flush_on_ret	= false,
>  		.may_block	= false,
> +		.lockless	=
> +			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
>  	};
>  
>  	trace_kvm_age_hva(start, end);
> @@ -849,6 +861,8 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
>  		.on_lock	= (void *)kvm_null_fn,
>  		.flush_on_ret	= false,
>  		.may_block	= false,
> +		.lockless	=
> +			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
>  	};
>  
>  	trace_kvm_test_age_hva(address);
> -- 
> 2.47.0.199.ga7371fff76-goog
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 02/11] KVM: Add lockless memslot walk to KVM
  2025-01-10 22:26   ` Sean Christopherson
@ 2025-01-27 19:51     ` James Houghton
  0 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2025-01-27 19:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Fri, Jan 10, 2025 at 2:26 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Nov 05, 2024, James Houghton wrote:
> > Provide flexibility to the architecture to synchronize as optimally as
>
> Similar to my nit on "locking" below, "synchronize" is somewhat misleading.  There's
> no requirement for architectures to synchronize during aging.  There is all but
> guaranteed to be some form of "synchronization", e.g. to prevent walking freed
> page tables, but the aging itself never synchronizes, and protecting a walk by
> disabling IRQs (as x86's shadow MMU does in some flows) only "synchronizes" in a
> loose sense of the word.
>
> > they can instead of always taking the MMU lock for writing.
> >
> > Architectures that do their own locking must select
> > CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS.
>
> This is backwards and could be misleading, and for the TDP MMU outright wrong.
> If some hypothetical architecture took _additional_ locks, then it can do so
> without needing to select CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS.
>
> What you want to say is that architectures that select
> CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS are responsible for ensuring correctness.
> And it's specifically all about correctness.  Common KVM doesn't care if the
> arch does it's own locking, e.g. taking mmu_lock for read, or has some completely
> lock-free scheme for aging.
>
> > The immediate application is to allow architectures to implement the
>
> "immediate application" is pretty redundant.  There's really only one reason to
> not take mmu_lock in this path.
>
> > test/clear_young MMU notifiers more cheaply.
>
> IMO, "more cheaply" is vague, and doesn't add much beyond the opening sentence
> about "synchronize as optimally as they can".  I would just delete this last
> sentence.

Thanks Sean. I've reworded the changelog like this:

    It is possible to correctly do aging without taking the KVM MMU lock;
    this option allows such architectures to do so. Architectures that
    select CONFIG_KVM_MMU_NOTIFIER_AGING_LOCKLESS are responsible for
    ensuring correctness.

>
> > Suggested-by: Yu Zhao <yuzhao@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > Reviewed-by: David Matlack <dmatlack@google.com>
> > ---
> > @@ -797,6 +805,8 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> >               .flush_on_ret   =
> >                       !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG),
> >               .may_block      = false,
> > +             .lockless       =
> > +                     IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
>
>                 .lockess        = IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
>
> Random thought, maybe CONFIG_KVM_MMU_NOTIFIER_AGING_LOCKLESS or
> CONFIG_KVM_MMU_NOTIFIER_AGE_LOCKLESS instead of "YOUNG"?

CONFIG_KVM_MMU_NOTIFIER_AGING_LOCKLESS sounds good. Changed.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v8 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
  2024-11-05 18:43 ` [PATCH v8 01/11] KVM: Remove kvm_handle_hva_range helper functions James Houghton
  2024-11-05 18:43 ` [PATCH v8 02/11] KVM: Add lockless memslot walk to KVM James Houghton
@ 2024-11-05 18:43 ` James Houghton
  2024-11-05 18:45   ` Yu Zhao
  2025-01-10 22:34   ` Sean Christopherson
  2024-11-05 18:43 ` [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

This new function, tdp_mmu_clear_spte_bits_atomic(), will be used in a
follow-up patch to enable lockless Accessed and R/W/X bit clearing.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.h | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 2880fd392e0c..a24fca3f9e7f 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
 	return xchg(rcu_dereference(sptep), new_spte);
 }
 
+static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
+{
+	atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
+
+	return (u64)atomic64_fetch_and(~mask, sptep_atomic);
+}
+
 static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 {
 	KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte));
@@ -63,12 +70,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
 static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte,
 					  u64 mask, int level)
 {
-	atomic64_t *sptep_atomic;
-
-	if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) {
-		sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
-		return (u64)atomic64_fetch_and(~mask, sptep_atomic);
-	}
+	if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level))
+		return tdp_mmu_clear_spte_bits_atomic(sptep, mask);
 
 	__kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask);
 	return old_spte;
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine
  2024-11-05 18:43 ` [PATCH v8 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine James Houghton
@ 2024-11-05 18:45   ` Yu Zhao
  2025-01-10 22:34   ` Sean Christopherson
  1 sibling, 0 replies; 41+ messages in thread
From: Yu Zhao @ 2024-11-05 18:45 UTC (permalink / raw)
  To: James Houghton
  Cc: Sean Christopherson, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Axel Rasmussen, kvm,
	linux-kernel

On Tue, Nov 5, 2024 at 11:43 AM James Houghton <jthoughton@google.com> wrote:
>
> This new function, tdp_mmu_clear_spte_bits_atomic(), will be used in a
> follow-up patch to enable lockless Accessed and R/W/X bit clearing.
>
> Signed-off-by: James Houghton <jthoughton@google.com>

Acked-by: Yu Zhao <yuzhao@google.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine
  2024-11-05 18:43 ` [PATCH v8 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine James Houghton
  2024-11-05 18:45   ` Yu Zhao
@ 2025-01-10 22:34   ` Sean Christopherson
  2025-01-27 19:51     ` James Houghton
  1 sibling, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2025-01-10 22:34 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, Nov 05, 2024, James Houghton wrote:
> This new function, tdp_mmu_clear_spte_bits_atomic(), will be used in a
> follow-up patch to enable lockless Accessed and R/W/X bit clearing.

This is a lie.  tdp_mmu_clear_spte_bits_atomic() can only be used to clear the
Accessed bit, clearing RWX bits for access-tracked SPTEs *must* be done with a
CMPXCHG so that the original RWX protections are preserved.

> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_iter.h | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index 2880fd392e0c..a24fca3f9e7f 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
>  	return xchg(rcu_dereference(sptep), new_spte);
>  }
>  
> +static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
> +{
> +	atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
> +
> +	return (u64)atomic64_fetch_and(~mask, sptep_atomic);
> +}
> +
>  static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  {
>  	KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte));
> @@ -63,12 +70,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
>  static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte,
>  					  u64 mask, int level)
>  {
> -	atomic64_t *sptep_atomic;
> -
> -	if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) {
> -		sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
> -		return (u64)atomic64_fetch_and(~mask, sptep_atomic);
> -	}
> +	if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level))
> +		return tdp_mmu_clear_spte_bits_atomic(sptep, mask);
>  
>  	__kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask);
>  	return old_spte;
> -- 
> 2.47.0.199.ga7371fff76-goog
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine
  2025-01-10 22:34   ` Sean Christopherson
@ 2025-01-27 19:51     ` James Houghton
  0 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2025-01-27 19:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Fri, Jan 10, 2025 at 2:34 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Nov 05, 2024, James Houghton wrote:
> > This new function, tdp_mmu_clear_spte_bits_atomic(), will be used in a
> > follow-up patch to enable lockless Accessed and R/W/X bit clearing.
>
> This is a lie.  tdp_mmu_clear_spte_bits_atomic() can only be used to clear the
> Accessed bit, clearing RWX bits for access-tracked SPTEs *must* be done with a
> CMPXCHG so that the original RWX protections are preserved.

I'm not sure why I wrote it like that. I've dropped the "and R/W/X" piece:

    This new function, tdp_mmu_clear_spte_bits_atomic(), will be used in a
    follow-up patch to enable lockless Accessed bit clearing.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (2 preceding siblings ...)
  2024-11-05 18:43 ` [PATCH v8 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine James Houghton
@ 2024-11-05 18:43 ` James Houghton
  2024-11-06  8:22   ` kernel test robot
                     ` (2 more replies)
  2024-11-05 18:43 ` [PATCH v8 05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn James Houghton
                   ` (7 subsequent siblings)
  11 siblings, 3 replies; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

Walk the TDP MMU in an RCU read-side critical section without holding
mmu_lock when harvesting and potentially updating age information on
sptes. This requires a way to do RCU-safe walking of the tdp_mmu_roots;
do this with a new macro. The PTE modifications are now done atomically,
and kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for
the fact that kvm_age_gfn can now locklessly update the accessed bit and
the W/R/X bits).

If the cmpxchg for marking the spte for access tracking fails, leave it
as is and treat it as if it were young, as if the spte is being actively
modified, it is most likely young.

Harvesting age information from the shadow MMU is still done while
holding the MMU write lock.

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/Kconfig            |  1 +
 arch/x86/kvm/mmu/mmu.c          | 10 ++++++++--
 arch/x86/kvm/mmu/tdp_iter.h     | 12 ++++++------
 arch/x86/kvm/mmu/tdp_mmu.c      | 23 ++++++++++++++++-------
 5 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 70c7ed0ef184..84ee08078686 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1455,6 +1455,7 @@ struct kvm_arch {
 	 * tdp_mmu_page set.
 	 *
 	 * For reads, this list is protected by:
+	 *	RCU alone or
 	 *	the MMU lock in read mode + RCU or
 	 *	the MMU lock in write mode
 	 *
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 1ed1e4f5d51c..97f747d60fe9 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -23,6 +23,7 @@ config KVM_X86
 	select KVM_COMMON
 	select KVM_GENERIC_MMU_NOTIFIER
 	select KVM_ELIDE_TLB_FLUSH_IF_YOUNG
+	select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_PFNCACHE
 	select HAVE_KVM_DIRTY_RING_TSO
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 443845bb2e01..26797ccd34d8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1586,8 +1586,11 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm))
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
 		young = kvm_rmap_age_gfn_range(kvm, range, false);
+		write_unlock(&kvm->mmu_lock);
+	}
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
@@ -1599,8 +1602,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm))
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
 		young = kvm_rmap_age_gfn_range(kvm, range, true);
+		write_unlock(&kvm->mmu_lock);
+	}
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index a24fca3f9e7f..f26d0b60d2dd 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 }
 
 /*
- * SPTEs must be modified atomically if they are shadow-present, leaf
- * SPTEs, and have volatile bits, i.e. has bits that can be set outside
- * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
- * handler, and Accessed and Dirty bits can be set by the CPU.
+ * SPTEs must be modified atomically if they have bits that can be set outside
+ * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
+ * Writable bit can be set by KVM's fast page fault handler, the Accessed and
+ * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be
+ * cleared by age_gfn_range().
  *
  * Note, non-leaf SPTEs do have Accessed bits and those bits are
  * technically volatile, but KVM doesn't consume the Accessed bit of
@@ -53,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
 {
 	return is_shadow_present_pte(old_spte) &&
-	       is_last_spte(old_spte, level) &&
-	       spte_has_volatile_bits(old_spte);
+	       is_last_spte(old_spte, level);
 }
 
 static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4508d868f1cd..f5b4f1060fff 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -178,6 +178,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		     ((_only_valid) && (_root)->role.invalid))) {		\
 		} else
 
+/*
+ * Iterate over all TDP MMU roots in an RCU read-side critical section.
+ */
+#define for_each_valid_tdp_mmu_root_rcu(_kvm, _root, _as_id)			\
+	list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)		\
+		if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||	\
+		    (_root)->role.invalid) {					\
+		} else
+
 #define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
 	__for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
 
@@ -1168,16 +1177,16 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
 	u64 new_spte;
 
 	if (spte_ad_enabled(iter->old_spte)) {
-		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
-							 iter->old_spte,
-							 shadow_accessed_mask,
-							 iter->level);
+		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
+						shadow_accessed_mask);
 		new_spte = iter->old_spte & ~shadow_accessed_mask;
 	} else {
 		new_spte = mark_spte_for_access_track(iter->old_spte);
-		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
-							iter->old_spte, new_spte,
-							iter->level);
+		/*
+		 * It is safe for the following cmpxchg to fail. Leave the
+		 * Accessed bit set, as the spte is most likely young anyway.
+		 */
+		(void)__tdp_mmu_set_spte_atomic(iter, new_spte);
 	}
 
 	trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-11-05 18:43 ` [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
@ 2024-11-06  8:22   ` kernel test robot
  2024-11-08  3:00     ` James Houghton
  2025-01-10 22:47   ` Sean Christopherson
  2025-01-27 19:57   ` James Houghton
  2 siblings, 1 reply; 41+ messages in thread
From: kernel test robot @ 2024-11-06  8:22 UTC (permalink / raw)
  To: James Houghton, Sean Christopherson, Paolo Bonzini
  Cc: oe-kbuild-all, David Matlack, David Rientjes, James Houghton,
	Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm,
	linux-kernel

Hi James,

kernel test robot noticed the following build warnings:

[auto build test WARNING on a27e0515592ec9ca28e0d027f42568c47b314784]

url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/KVM-Remove-kvm_handle_hva_range-helper-functions/20241106-025133
base:   a27e0515592ec9ca28e0d027f42568c47b314784
patch link:    https://lore.kernel.org/r/20241105184333.2305744-5-jthoughton%40google.com
patch subject: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20241106/202411061526.RAuCXKJh-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241106/202411061526.RAuCXKJh-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411061526.RAuCXKJh-lkp@intel.com/

All warnings (new ones prefixed by >>):

   arch/x86/kvm/mmu/tdp_mmu.c: In function 'kvm_tdp_mmu_age_spte':
>> arch/x86/kvm/mmu/tdp_mmu.c:1189:23: warning: ignoring return value of '__tdp_mmu_set_spte_atomic' declared with attribute 'warn_unused_result' [-Wunused-result]
    1189 |                 (void)__tdp_mmu_set_spte_atomic(iter, new_spte);
         |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +1189 arch/x86/kvm/mmu/tdp_mmu.c

  1166	
  1167	/*
  1168	 * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
  1169	 * if any of the GFNs in the range have been accessed.
  1170	 *
  1171	 * No need to mark the corresponding PFN as accessed as this call is coming
  1172	 * from the clear_young() or clear_flush_young() notifier, which uses the
  1173	 * return value to determine if the page has been accessed.
  1174	 */
  1175	static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
  1176	{
  1177		u64 new_spte;
  1178	
  1179		if (spte_ad_enabled(iter->old_spte)) {
  1180			iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
  1181							shadow_accessed_mask);
  1182			new_spte = iter->old_spte & ~shadow_accessed_mask;
  1183		} else {
  1184			new_spte = mark_spte_for_access_track(iter->old_spte);
  1185			/*
  1186			 * It is safe for the following cmpxchg to fail. Leave the
  1187			 * Accessed bit set, as the spte is most likely young anyway.
  1188			 */
> 1189			(void)__tdp_mmu_set_spte_atomic(iter, new_spte);
  1190		}
  1191	
  1192		trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
  1193					       iter->old_spte, new_spte);
  1194	}
  1195	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-11-06  8:22   ` kernel test robot
@ 2024-11-08  3:00     ` James Houghton
  2024-11-08 22:45       ` Sean Christopherson
  0 siblings, 1 reply; 41+ messages in thread
From: James Houghton @ 2024-11-08  3:00 UTC (permalink / raw)
  To: kernel test robot
  Cc: Sean Christopherson, Paolo Bonzini, oe-kbuild-all, David Matlack,
	David Rientjes, Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao,
	Axel Rasmussen, kvm, linux-kernel

On Wed, Nov 6, 2024 at 3:22 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi James,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on a27e0515592ec9ca28e0d027f42568c47b314784]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/KVM-Remove-kvm_handle_hva_range-helper-functions/20241106-025133
> base:   a27e0515592ec9ca28e0d027f42568c47b314784
> patch link:    https://lore.kernel.org/r/20241105184333.2305744-5-jthoughton%40google.com
> patch subject: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
> config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20241106/202411061526.RAuCXKJh-lkp@intel.com/config)
> compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241106/202411061526.RAuCXKJh-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202411061526.RAuCXKJh-lkp@intel.com/
>
> All warnings (new ones prefixed by >>):
>
>    arch/x86/kvm/mmu/tdp_mmu.c: In function 'kvm_tdp_mmu_age_spte':
> >> arch/x86/kvm/mmu/tdp_mmu.c:1189:23: warning: ignoring return value of '__tdp_mmu_set_spte_atomic' declared with attribute 'warn_unused_result' [-Wunused-result]
>     1189 |                 (void)__tdp_mmu_set_spte_atomic(iter, new_spte);
>          |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>

Well, I saw this compiler warning in my latest rebase and thought the
`(void)` would fix it. I guess the next best way to fix it would be to
assign to an `int __maybe_unused`. I'll do for a v9, or Sean if you're
going to take the series (maybe? :)), go ahead and apply whatever fix
you like.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-11-08  3:00     ` James Houghton
@ 2024-11-08 22:45       ` Sean Christopherson
  2024-11-11 14:45         ` James Houghton
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2024-11-08 22:45 UTC (permalink / raw)
  To: James Houghton
  Cc: kernel test robot, Paolo Bonzini, oe-kbuild-all, David Matlack,
	David Rientjes, Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao,
	Axel Rasmussen, kvm, linux-kernel

On Thu, Nov 07, 2024, James Houghton wrote:
> On Wed, Nov 6, 2024 at 3:22 AM kernel test robot <lkp@intel.com> wrote:
> >
> > Hi James,
> >
> > kernel test robot noticed the following build warnings:
> >
> > [auto build test WARNING on a27e0515592ec9ca28e0d027f42568c47b314784]
> >
> > url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/KVM-Remove-kvm_handle_hva_range-helper-functions/20241106-025133
> > base:   a27e0515592ec9ca28e0d027f42568c47b314784
> > patch link:    https://lore.kernel.org/r/20241105184333.2305744-5-jthoughton%40google.com
> > patch subject: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
> > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20241106/202411061526.RAuCXKJh-lkp@intel.com/config)
> > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241106/202411061526.RAuCXKJh-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202411061526.RAuCXKJh-lkp@intel.com/
> >
> > All warnings (new ones prefixed by >>):
> >
> >    arch/x86/kvm/mmu/tdp_mmu.c: In function 'kvm_tdp_mmu_age_spte':
> > >> arch/x86/kvm/mmu/tdp_mmu.c:1189:23: warning: ignoring return value of '__tdp_mmu_set_spte_atomic' declared with attribute 'warn_unused_result' [-Wunused-result]
> >     1189 |                 (void)__tdp_mmu_set_spte_atomic(iter, new_spte);
> >          |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> 
> Well, I saw this compiler warning in my latest rebase and thought the
> `(void)` would fix it. I guess the next best way to fix it would be to
> assign to an `int __maybe_unused`. I'll do for a v9, or Sean if you're
> going to take the series (maybe? :)), go ahead and apply whatever fix
> you like.

Heh, actually, the compiler is correct.  Ignoring the return value is a bug.
KVM should instead return immediately, as falling through to the tracepoint will
log bogus information.  E.g. will show a !PRESENT SPTE, instead of whatever the
current SPTE actually is (iter->old_spte will have been updating to the current
value of the SPTE).

	trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
				       iter->old_spte, new_spte);

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f5b4f1060fff..cc8ae998b7c8 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1186,7 +1186,8 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
                 * It is safe for the following cmpxchg to fail. Leave the
                 * Accessed bit set, as the spte is most likely young anyway.
                 */
-               (void)__tdp_mmu_set_spte_atomic(iter, new_spte);
+               if (__tdp_mmu_set_spte_atomic(iter, new_spte))
+                       return;
        }
 
        trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-11-08 22:45       ` Sean Christopherson
@ 2024-11-11 14:45         ` James Houghton
  0 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2024-11-11 14:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kernel test robot, Paolo Bonzini, oe-kbuild-all, David Matlack,
	David Rientjes, Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao,
	Axel Rasmussen, kvm, linux-kernel

On Fri, Nov 8, 2024 at 5:45 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Nov 07, 2024, James Houghton wrote:
> > On Wed, Nov 6, 2024 at 3:22 AM kernel test robot <lkp@intel.com> wrote:
> > >
> > > Hi James,
> > >
> > > kernel test robot noticed the following build warnings:
> > >
> > > [auto build test WARNING on a27e0515592ec9ca28e0d027f42568c47b314784]
> > >
> > > url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/KVM-Remove-kvm_handle_hva_range-helper-functions/20241106-025133
> > > base:   a27e0515592ec9ca28e0d027f42568c47b314784
> > > patch link:    https://lore.kernel.org/r/20241105184333.2305744-5-jthoughton%40google.com
> > > patch subject: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
> > > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20241106/202411061526.RAuCXKJh-lkp@intel.com/config)
> > > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> > > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241106/202411061526.RAuCXKJh-lkp@intel.com/reproduce)
> > >
> > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > the same patch/commit), kindly add following tags
> > > | Reported-by: kernel test robot <lkp@intel.com>
> > > | Closes: https://lore.kernel.org/oe-kbuild-all/202411061526.RAuCXKJh-lkp@intel.com/
> > >
> > > All warnings (new ones prefixed by >>):
> > >
> > >    arch/x86/kvm/mmu/tdp_mmu.c: In function 'kvm_tdp_mmu_age_spte':
> > > >> arch/x86/kvm/mmu/tdp_mmu.c:1189:23: warning: ignoring return value of '__tdp_mmu_set_spte_atomic' declared with attribute 'warn_unused_result' [-Wunused-result]
> > >     1189 |                 (void)__tdp_mmu_set_spte_atomic(iter, new_spte);
> > >          |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > >
> >
> > Well, I saw this compiler warning in my latest rebase and thought the
> > `(void)` would fix it. I guess the next best way to fix it would be to
> > assign to an `int __maybe_unused`. I'll do for a v9, or Sean if you're
> > going to take the series (maybe? :)), go ahead and apply whatever fix
> > you like.
>
> Heh, actually, the compiler is correct.  Ignoring the return value is a bug.
> KVM should instead return immediately, as falling through to the tracepoint will
> log bogus information.  E.g. will show a !PRESENT SPTE, instead of whatever the
> current SPTE actually is (iter->old_spte will have been updating to the current
> value of the SPTE).
>
>         trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
>                                        iter->old_spte, new_spte);
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f5b4f1060fff..cc8ae998b7c8 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1186,7 +1186,8 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
>                  * It is safe for the following cmpxchg to fail. Leave the
>                  * Accessed bit set, as the spte is most likely young anyway.
>                  */
> -               (void)__tdp_mmu_set_spte_atomic(iter, new_spte);
> +               if (__tdp_mmu_set_spte_atomic(iter, new_spte))
> +                       return;
>         }
>
>         trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
>

Oh yes, you're right. Thanks Sean! The diff LGTM.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-11-05 18:43 ` [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
  2024-11-06  8:22   ` kernel test robot
@ 2025-01-10 22:47   ` Sean Christopherson
  2025-01-27 19:52     ` James Houghton
  2025-01-27 19:57   ` James Houghton
  2 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2025-01-10 22:47 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, Nov 05, 2024, James Houghton wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index a24fca3f9e7f..f26d0b60d2dd 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  }
>  
>  /*
> - * SPTEs must be modified atomically if they are shadow-present, leaf
> - * SPTEs, and have volatile bits, i.e. has bits that can be set outside
> - * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
> - * handler, and Accessed and Dirty bits can be set by the CPU.
> + * SPTEs must be modified atomically if they have bits that can be set outside
> + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
> + * Writable bit can be set by KVM's fast page fault handler, the Accessed and
> + * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be
> + * cleared by age_gfn_range().
>   *
>   * Note, non-leaf SPTEs do have Accessed bits and those bits are
>   * technically volatile, but KVM doesn't consume the Accessed bit of
> @@ -53,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
>  {
>  	return is_shadow_present_pte(old_spte) &&
> -	       is_last_spte(old_spte, level) &&
> -	       spte_has_volatile_bits(old_spte);
> +	       is_last_spte(old_spte, level);

I don't like this change on multiple fronts.  First and foremost, it results in
spte_has_volatile_bits() being wrong for the TDP MMU.  Second, the same logic
applies to the shadow MMU; the rmap lock prevents a use-after-free of the page
that owns the SPTE, but the zapping of the SPTE happens before the writer grabs
the rmap lock.

Lastly, I'm very, very tempted to say we should omit Accessed state from
spte_has_volatile_bits() and rename it to something like spte_needs_atomic_write().
KVM x86 no longer flushes TLBs on aging, so we're already committed to incorrectly
thinking a page is old in rare cases, for the sake of performance.  The odds of
KVM clobbering the Accessed bit are probably smaller than the odds of missing an
Accessed update due to a stale TLB entry.

Note, only the shadow_accessed_mask check can be removed.  KVM needs to ensure
access-tracked SPTEs are zapped properly, and dirty logging can't have false
negatives.

>  }
>  
>  static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4508d868f1cd..f5b4f1060fff 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -178,6 +178,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>  		     ((_only_valid) && (_root)->role.invalid))) {		\
>  		} else
>  
> +/*
> + * Iterate over all TDP MMU roots in an RCU read-side critical section.

Heh, that's pretty darn obvious.  It would be far more helpful if the comment
explained the usage rules, e.g. what is safe (at a high level).

> + */
> +#define for_each_valid_tdp_mmu_root_rcu(_kvm, _root, _as_id)			\
> +	list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)		\
> +		if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||	\
> +		    (_root)->role.invalid) {					\
> +		} else
> +
>  #define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
>  	__for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
>  
> @@ -1168,16 +1177,16 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
>  	u64 new_spte;
>  
>  	if (spte_ad_enabled(iter->old_spte)) {
> -		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> -							 iter->old_spte,
> -							 shadow_accessed_mask,
> -							 iter->level);
> +		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> +						shadow_accessed_mask);

Align, and let this poke past 80:

		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
								shadow_accessed_mask);

>  		new_spte = iter->old_spte & ~shadow_accessed_mask;
>  	} else {
>  		new_spte = mark_spte_for_access_track(iter->old_spte);
> -		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
> -							iter->old_spte, new_spte,
> -							iter->level);
> +		/*
> +		 * It is safe for the following cmpxchg to fail. Leave the
> +		 * Accessed bit set, as the spte is most likely young anyway.
> +		 */
> +		(void)__tdp_mmu_set_spte_atomic(iter, new_spte);

Just a reminder that this needs to be:

		if (__tdp_mmu_set_spte_atomic(iter, new_spte))
			return;

>  	}
>  
>  	trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
> -- 
> 2.47.0.199.ga7371fff76-goog
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2025-01-10 22:47   ` Sean Christopherson
@ 2025-01-27 19:52     ` James Houghton
  0 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2025-01-27 19:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Fri, Jan 10, 2025 at 2:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Nov 05, 2024, James Houghton wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index a24fca3f9e7f..f26d0b60d2dd 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> >  }
> >
> >  /*
> > - * SPTEs must be modified atomically if they are shadow-present, leaf
> > - * SPTEs, and have volatile bits, i.e. has bits that can be set outside
> > - * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
> > - * handler, and Accessed and Dirty bits can be set by the CPU.
> > + * SPTEs must be modified atomically if they have bits that can be set outside
> > + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
> > + * Writable bit can be set by KVM's fast page fault handler, the Accessed and
> > + * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be
> > + * cleared by age_gfn_range().
> >   *
> >   * Note, non-leaf SPTEs do have Accessed bits and those bits are
> >   * technically volatile, but KVM doesn't consume the Accessed bit of
> > @@ -53,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> >  static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
> >  {
> >       return is_shadow_present_pte(old_spte) &&
> > -            is_last_spte(old_spte, level) &&
> > -            spte_has_volatile_bits(old_spte);
> > +            is_last_spte(old_spte, level);
>
> I don't like this change on multiple fronts.  First and foremost, it results in
> spte_has_volatile_bits() being wrong for the TDP MMU.  Second, the same logic
> applies to the shadow MMU; the rmap lock prevents a use-after-free of the page
> that owns the SPTE, but the zapping of the SPTE happens before the writer grabs
> the rmap lock.

Thanks Sean, yes I forgot about the shadow MMU case.

> Lastly, I'm very, very tempted to say we should omit Accessed state from
> spte_has_volatile_bits() and rename it to something like spte_needs_atomic_write().
> KVM x86 no longer flushes TLBs on aging, so we're already committed to incorrectly
> thinking a page is old in rare cases, for the sake of performance.  The odds of
> KVM clobbering the Accessed bit are probably smaller than the odds of missing an
> Accessed update due to a stale TLB entry.
>
> Note, only the shadow_accessed_mask check can be removed.  KVM needs to ensure
> access-tracked SPTEs are zapped properly, and dirty logging can't have false
> negatives.

I've dropped the change to kvm_tdp_mmu_spte_need_atomic_write() and
instead applied this diff.

--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -142,8 +142,14 @@ bool spte_has_volatile_bits(u64 spte)
                return true;

        if (spte_ad_enabled(spte)) {
-               if (!(spte & shadow_accessed_mask) ||
-                   (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
+               /*
+                * Do not check the Accessed bit. It can be set (by the CPU)
+                * and cleared (by kvm_tdp_mmu_age_spte()) without holding
+                * the mmu_lock, but when clearing the Accessed bit, we do
+                * not invalidate the TLB, so we can already miss Accessed bit
+                * updates.
+                */
+               if (is_writable_pte(spte) && !(spte & shadow_dirty_mask))
                        return true;
        }

I've also included a new patch to rename spte_has_volatile_bits() to
spte_needs_atomic_write() like you suggested. I merely renamed it in
all locations, including documentation; I haven't reworded the
documentation's use of the word "volatile."

>
> >  }
> >
> >  static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 4508d868f1cd..f5b4f1060fff 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -178,6 +178,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> >                    ((_only_valid) && (_root)->role.invalid))) {               \
> >               } else
> >
> > +/*
> > + * Iterate over all TDP MMU roots in an RCU read-side critical section.
>
> Heh, that's pretty darn obvious.  It would be far more helpful if the comment
> explained the usage rules, e.g. what is safe (at a high level).

How's this?

+/*
+ * Iterate over all TDP MMU roots in an RCU read-side critical section.
+ * It is safe to iterate over the SPTEs under the root, but their values will
+ * be unstable, so all writes must be atomic. As this routine is meant to be
+ * used without holding the mmu_lock at all, any bits that are flipped must
+ * be reflected in kvm_tdp_mmu_spte_need_atomic_write().
+ */

> > + */
> > +#define for_each_valid_tdp_mmu_root_rcu(_kvm, _root, _as_id)                 \
> > +     list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)         \
> > +             if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||     \
> > +                 (_root)->role.invalid) {                                    \
> > +             } else
> > +
> >  #define for_each_tdp_mmu_root(_kvm, _root, _as_id)                   \
> >       __for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
> >
> > @@ -1168,16 +1177,16 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
> >       u64 new_spte;
> >
> >       if (spte_ad_enabled(iter->old_spte)) {
> > -             iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> > -                                                      iter->old_spte,
> > -                                                      shadow_accessed_mask,
> > -                                                      iter->level);
> > +             iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> > +                                             shadow_accessed_mask);
>
> Align, and let this poke past 80:
>
>                 iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
>                                                                 shadow_accessed_mask);

Done.

> >               new_spte = iter->old_spte & ~shadow_accessed_mask;
> >       } else {
> >               new_spte = mark_spte_for_access_track(iter->old_spte);
> > -             iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
> > -                                                     iter->old_spte, new_spte,
> > -                                                     iter->level);
> > +             /*
> > +              * It is safe for the following cmpxchg to fail. Leave the
> > +              * Accessed bit set, as the spte is most likely young anyway.
> > +              */
> > +             (void)__tdp_mmu_set_spte_atomic(iter, new_spte);
>
> Just a reminder that this needs to be:
>
>                 if (__tdp_mmu_set_spte_atomic(iter, new_spte))
>                         return;
>

Already applied, thanks!


> >       }
> >
> >       trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
> > --
> > 2.47.0.199.ga7371fff76-goog
> >

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-11-05 18:43 ` [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
  2024-11-06  8:22   ` kernel test robot
  2025-01-10 22:47   ` Sean Christopherson
@ 2025-01-27 19:57   ` James Houghton
  2025-01-27 20:09     ` James Houghton
  2 siblings, 1 reply; 41+ messages in thread
From: James Houghton @ 2025-01-27 19:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, Marc Zyngier, Oliver Upton, Wei Xu,
	Yu Zhao, Axel Rasmussen, kvm, linux-kernel, Venkatesh Srinivas

On Tue, Nov 5, 2024 at 10:43 AM James Houghton <jthoughton@google.com> wrote:
>
>  static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4508d868f1cd..f5b4f1060fff 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -178,6 +178,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>                      ((_only_valid) && (_root)->role.invalid))) {               \
>                 } else
>
> +/*
> + * Iterate over all TDP MMU roots in an RCU read-side critical section.
> + */
> +#define for_each_valid_tdp_mmu_root_rcu(_kvm, _root, _as_id)                   \
> +       list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)         \
> +               if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||     \
> +                   (_root)->role.invalid) {                                    \
> +               } else
> +

Venkatesh noticed that this function is unused in this patch. This was
a mistake in the latest rebase. The diff should have been applied:

@@ -1192,15 +1206,15 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm,
        struct tdp_iter iter;
        bool ret = false;

+       guard(rcu)();
+
        /*
         * Don't support rescheduling, none of the MMU notifiers that funnel
         * into this helper allow blocking; it'd be dead, wasteful code.  Note,
         * this helper must NOT be used to unmap GFNs, as it processes only
         * valid roots!
         */
-       for_each_valid_tdp_mmu_root(kvm, root, range->slot->as_id) {
-               guard(rcu)();
-
+       for_each_valid_tdp_mmu_root_rcu(kvm, root, range->slot->as_id) {
                tdp_root_for_each_leaf_pte(iter, root, range->start,
range->end) {
                        if (!is_accessed_spte(iter.old_spte))
                                continue;

This bug will show up as a LOCKDEP warning on CONFIG_PROVE_RCU_LIST=y
kernels, as list_for_each_entry_rcu() is called outside of an RCU
read-side critical section.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2025-01-27 19:57   ` James Houghton
@ 2025-01-27 20:09     ` James Houghton
  0 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2025-01-27 20:09 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, Marc Zyngier, Oliver Upton, Wei Xu,
	Yu Zhao, Axel Rasmussen, kvm, linux-kernel, Venkatesh Srinivas

On Mon, Jan 27, 2025 at 11:57 AM James Houghton <jthoughton@google.com> wrote:
>
> On Tue, Nov 5, 2024 at 10:43 AM James Houghton <jthoughton@google.com> wrote:
> >
> >  static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 4508d868f1cd..f5b4f1060fff 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -178,6 +178,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> >                      ((_only_valid) && (_root)->role.invalid))) {               \
> >                 } else
> >
> > +/*
> > + * Iterate over all TDP MMU roots in an RCU read-side critical section.
> > + */
> > +#define for_each_valid_tdp_mmu_root_rcu(_kvm, _root, _as_id)                   \
> > +       list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)         \
> > +               if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||     \
> > +                   (_root)->role.invalid) {                                    \
> > +               } else
> > +
>
> Venkatesh noticed that this function is unused in this patch. This was
> a mistake in the latest rebase. The diff should have been applied:
>
> @@ -1192,15 +1206,15 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm,
>         struct tdp_iter iter;
>         bool ret = false;
>
> +       guard(rcu)();
> +
>         /*
>          * Don't support rescheduling, none of the MMU notifiers that funnel
>          * into this helper allow blocking; it'd be dead, wasteful code.  Note,
>          * this helper must NOT be used to unmap GFNs, as it processes only
>          * valid roots!
>          */
> -       for_each_valid_tdp_mmu_root(kvm, root, range->slot->as_id) {
> -               guard(rcu)();
> -
> +       for_each_valid_tdp_mmu_root_rcu(kvm, root, range->slot->as_id) {
>                 tdp_root_for_each_leaf_pte(iter, root, range->start,
> range->end) {
>                         if (!is_accessed_spte(iter.old_spte))
>                                 continue;
>
> This bug will show up as a LOCKDEP warning on CONFIG_PROVE_RCU_LIST=y
> kernels, as list_for_each_entry_rcu() is called outside of an RCU
> read-side critical section.

There I go lying again. The LOCKDEP warning that will show up is the
lockdep_assert_held_write(mmu_lock) in
kvm_lockdep_assert_mmu_lock_held().

The RCU lockdep WARN would hit if I had used
for_each_valid_tdp_mmu_root_rcu() without moving the RCU lock
acquisition.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v8 05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (3 preceding siblings ...)
  2024-11-05 18:43 ` [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
@ 2024-11-05 18:43 ` James Houghton
  2024-11-05 18:46   ` Yu Zhao
  2025-01-10 22:59   ` Sean Christopherson
  2024-11-05 18:43 ` [PATCH v8 06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 James Houghton
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

Reorder the TDP MMU check to be first for both kvm_test_age_gfn and
kvm_age_gfn. For kvm_test_age_gfn, this allows us to completely avoid
needing to grab the MMU lock when the TDP MMU reports that the page is
young. Do the same for kvm_age_gfn merely for consistency.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 26797ccd34d8..793565a3a573 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1586,15 +1586,15 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
+	if (tdp_mmu_enabled)
+		young = kvm_tdp_mmu_age_gfn_range(kvm, range);
+
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		young = kvm_rmap_age_gfn_range(kvm, range, false);
+		young |= kvm_rmap_age_gfn_range(kvm, range, false);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-	if (tdp_mmu_enabled)
-		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
-
 	return young;
 }
 
@@ -1602,15 +1602,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm)) {
+	if (tdp_mmu_enabled)
+		young = kvm_tdp_mmu_test_age_gfn(kvm, range);
+
+	if (!young && kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		young = kvm_rmap_age_gfn_range(kvm, range, true);
+		young |= kvm_rmap_age_gfn_range(kvm, range, true);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-	if (tdp_mmu_enabled)
-		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
-
 	return young;
 }
 
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn
  2024-11-05 18:43 ` [PATCH v8 05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn James Houghton
@ 2024-11-05 18:46   ` Yu Zhao
  2025-01-10 22:59   ` Sean Christopherson
  1 sibling, 0 replies; 41+ messages in thread
From: Yu Zhao @ 2024-11-05 18:46 UTC (permalink / raw)
  To: James Houghton
  Cc: Sean Christopherson, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Axel Rasmussen, kvm,
	linux-kernel

On Tue, Nov 5, 2024 at 11:43 AM James Houghton <jthoughton@google.com> wrote:
>
> Reorder the TDP MMU check to be first for both kvm_test_age_gfn and
> kvm_age_gfn. For kvm_test_age_gfn, this allows us to completely avoid
> needing to grab the MMU lock when the TDP MMU reports that the page is
> young. Do the same for kvm_age_gfn merely for consistency.
>
> Signed-off-by: James Houghton <jthoughton@google.com>

Acked-by: Yu Zhao <yuzhao@google.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn
  2024-11-05 18:43 ` [PATCH v8 05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn James Houghton
  2024-11-05 18:46   ` Yu Zhao
@ 2025-01-10 22:59   ` Sean Christopherson
  2025-01-27 19:58     ` James Houghton
  1 sibling, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2025-01-10 22:59 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, Nov 05, 2024, James Houghton wrote:
> Reorder the TDP MMU check to be first for both kvm_test_age_gfn and

() on functions, i.e. kvm_test_age_gfn().  That said, even better would be to
avoid using the function names.  Let the patch itself communicate which functions
are affected, and instead write the changelog as you would verbally communicate
the change.

> kvm_age_gfn. For kvm_test_age_gfn, this allows us to completely avoid

No "us" or "we".

> needing to grab the MMU lock when the TDP MMU reports that the page is
> young.

The changelog should make it clear that the patch actually does this, i.e. that
there is a functional change beyond just changing the ordering.  Ooh, and that
definitely needs to be captured in the shortlog.  I would even go so far as to
say it should be the focal point of the shortlog.

E.g. something like:

KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young

Reorder the processing of the TDP MMU versus the shadow MMU when aging
SPTEs, and skip the shadow MMU entirely in the test-only case if the TDP
MMU reports that the page is young, i.e. completely avoid taking mmu_lock
if the TDP MMU SPTE is young.  Swap the order for the test-and-age helper
as well for consistency.

> Do the same for kvm_age_gfn merely for consistency.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn
  2025-01-10 22:59   ` Sean Christopherson
@ 2025-01-27 19:58     ` James Houghton
  0 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2025-01-27 19:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Fri, Jan 10, 2025 at 2:59 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Nov 05, 2024, James Houghton wrote:
> > Reorder the TDP MMU check to be first for both kvm_test_age_gfn and
>
> () on functions, i.e. kvm_test_age_gfn().  That said, even better would be to
> avoid using the function names.  Let the patch itself communicate which functions
> are affected, and instead write the changelog as you would verbally communicate
> the change.
>
> > kvm_age_gfn. For kvm_test_age_gfn, this allows us to completely avoid
>
> No "us" or "we".
>
>
> > needing to grab the MMU lock when the TDP MMU reports that the page is
> > young.
>
> The changelog should make it clear that the patch actually does this, i.e. that
> there is a functional change beyond just changing the ordering.  Ooh, and that
> definitely needs to be captured in the shortlog.  I would even go so far as to
> say it should be the focal point of the shortlog.
>
> E.g. something like:
>
> KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young
>
> Reorder the processing of the TDP MMU versus the shadow MMU when aging
> SPTEs, and skip the shadow MMU entirely in the test-only case if the TDP
> MMU reports that the page is young, i.e. completely avoid taking mmu_lock
> if the TDP MMU SPTE is young.  Swap the order for the test-and-age helper
> as well for consistency.

Thanks, I think this is worded very clearly. Applied verbatim.

Noted your tips for future changelogs.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v8 06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (4 preceding siblings ...)
  2024-11-05 18:43 ` [PATCH v8 05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn James Houghton
@ 2024-11-05 18:43 ` James Houghton
  2024-11-05 18:49   ` Yu Zhao
  2025-01-10 23:05   ` Sean Christopherson
  2024-11-05 18:43 ` [PATCH v8 07/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock James Houghton
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

Optimize both kvm_age_gfn and kvm_test_age_gfn's interaction with the
shadow MMU by, rather than checking if our memslot has rmaps, check if
there are any indirect_shadow_pages at all.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 793565a3a573..125d4c3ccceb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1582,6 +1582,11 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
 	return young;
 }
 
+static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
+{
+	return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
@@ -1589,7 +1594,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		young = kvm_tdp_mmu_age_gfn_range(kvm, range);
 
-	if (kvm_memslots_have_rmaps(kvm)) {
+	if (kvm_has_shadow_mmu_sptes(kvm)) {
 		write_lock(&kvm->mmu_lock);
 		young |= kvm_rmap_age_gfn_range(kvm, range, false);
 		write_unlock(&kvm->mmu_lock);
@@ -1605,7 +1610,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		young = kvm_tdp_mmu_test_age_gfn(kvm, range);
 
-	if (!young && kvm_memslots_have_rmaps(kvm)) {
+	if (!young && kvm_has_shadow_mmu_sptes(kvm)) {
 		write_lock(&kvm->mmu_lock);
 		young |= kvm_rmap_age_gfn_range(kvm, range, true);
 		write_unlock(&kvm->mmu_lock);
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0
  2024-11-05 18:43 ` [PATCH v8 06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 James Houghton
@ 2024-11-05 18:49   ` Yu Zhao
  2025-01-10 23:05   ` Sean Christopherson
  1 sibling, 0 replies; 41+ messages in thread
From: Yu Zhao @ 2024-11-05 18:49 UTC (permalink / raw)
  To: James Houghton
  Cc: Sean Christopherson, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Axel Rasmussen, kvm,
	linux-kernel

On Tue, Nov 5, 2024 at 11:43 AM James Houghton <jthoughton@google.com> wrote:
>
> Optimize both kvm_age_gfn and kvm_test_age_gfn's interaction with the
> shadow MMU by, rather than checking if our memslot has rmaps, check if
> there are any indirect_shadow_pages at all.
>
> Signed-off-by: James Houghton <jthoughton@google.com>

Acked-by: Yu Zhao <yuzhao@google.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0
  2024-11-05 18:43 ` [PATCH v8 06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 James Houghton
  2024-11-05 18:49   ` Yu Zhao
@ 2025-01-10 23:05   ` Sean Christopherson
  2025-01-27 19:58     ` James Houghton
  1 sibling, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2025-01-10 23:05 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, Nov 05, 2024, James Houghton wrote:
> Optimize both kvm_age_gfn and kvm_test_age_gfn's interaction with the
> shadow MMU by, rather than checking if our memslot has rmaps, check if

No "our" (pronouns bad).

> there are any indirect_shadow_pages at all.

Again, use wording that is more conversational.  I also think it's worthwhile to
call out when this optimization is helpful.  E.g.

When aging SPTEs and the TDP MMU is enabled, process the shadow MMU if and
only if the VM has at least one shadow page, as opposed to checking if the
VM has rmaps.  Checking for rmaps will effectively yield a false positive
if the VM ran nested TDP VMs in the past, but is not currently doing so.

> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 793565a3a573..125d4c3ccceb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1582,6 +1582,11 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
>  	return young;
>  }
>  
> +static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)

I think this should be kvm_may_have_shadow_mmu_sptes(), or something along those
lines that makes it clear the check is imprecise.  E.g. to avoid someone thinking
that KVM is guaranteed to have shadow MMU SPTEs if it returns true.

> +{
> +	return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
> +}
> +
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	bool young = false;
> @@ -1589,7 +1594,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  	if (tdp_mmu_enabled)
>  		young = kvm_tdp_mmu_age_gfn_range(kvm, range);
>  
> -	if (kvm_memslots_have_rmaps(kvm)) {
> +	if (kvm_has_shadow_mmu_sptes(kvm)) {
>  		write_lock(&kvm->mmu_lock);
>  		young |= kvm_rmap_age_gfn_range(kvm, range, false);
>  		write_unlock(&kvm->mmu_lock);
> @@ -1605,7 +1610,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  	if (tdp_mmu_enabled)
>  		young = kvm_tdp_mmu_test_age_gfn(kvm, range);
>  
> -	if (!young && kvm_memslots_have_rmaps(kvm)) {
> +	if (!young && kvm_has_shadow_mmu_sptes(kvm)) {
>  		write_lock(&kvm->mmu_lock);
>  		young |= kvm_rmap_age_gfn_range(kvm, range, true);
>  		write_unlock(&kvm->mmu_lock);
> -- 
> 2.47.0.199.ga7371fff76-goog
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0
  2025-01-10 23:05   ` Sean Christopherson
@ 2025-01-27 19:58     ` James Houghton
  0 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2025-01-27 19:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Fri, Jan 10, 2025 at 3:05 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Nov 05, 2024, James Houghton wrote:
> > Optimize both kvm_age_gfn and kvm_test_age_gfn's interaction with the
> > shadow MMU by, rather than checking if our memslot has rmaps, check if
>
> No "our" (pronouns bad).
>
> > there are any indirect_shadow_pages at all.
>
> Again, use wording that is more conversational.  I also think it's worthwhile to
> call out when this optimization is helpful.  E.g.
>
> When aging SPTEs and the TDP MMU is enabled, process the shadow MMU if and
> only if the VM has at least one shadow page, as opposed to checking if the
> VM has rmaps.  Checking for rmaps will effectively yield a false positive
> if the VM ran nested TDP VMs in the past, but is not currently doing so.

Applied verbatim. Thanks.

> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 793565a3a573..125d4c3ccceb 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1582,6 +1582,11 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> >       return young;
> >  }
> >
> > +static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
>
> I think this should be kvm_may_have_shadow_mmu_sptes(), or something along those
> lines that makes it clear the check is imprecise.  E.g. to avoid someone thinking
> that KVM is guaranteed to have shadow MMU SPTEs if it returns true.

Sounds good to me. Renamed.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v8 07/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (5 preceding siblings ...)
  2024-11-05 18:43 ` [PATCH v8 06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 James Houghton
@ 2024-11-05 18:43 ` James Houghton
  2024-11-05 18:43 ` [PATCH v8 08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock James Houghton
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

From: Sean Christopherson <seanjc@google.com>

Refactor the pte_list and rmap code to always read and write rmap_head->val
exactly once, e.g. by collecting changes in a local variable and then
propagating those changes back to rmap_head->val as appropriate.  This will
allow implementing a per-rmap rwlock (of sorts) by adding a LOCKED bit into
the rmap value alongside the MANY bit.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 83 +++++++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 125d4c3ccceb..145ea180963e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -858,21 +858,24 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
 static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
+	unsigned long old_val, new_val;
 	struct pte_list_desc *desc;
 	int count = 0;
 
-	if (!rmap_head->val) {
-		rmap_head->val = (unsigned long)spte;
-	} else if (!(rmap_head->val & KVM_RMAP_MANY)) {
+	old_val = rmap_head->val;
+
+	if (!old_val) {
+		new_val = (unsigned long)spte;
+	} else if (!(old_val & KVM_RMAP_MANY)) {
 		desc = kvm_mmu_memory_cache_alloc(cache);
-		desc->sptes[0] = (u64 *)rmap_head->val;
+		desc->sptes[0] = (u64 *)old_val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
 		desc->tail_count = 0;
-		rmap_head->val = (unsigned long)desc | KVM_RMAP_MANY;
+		new_val = (unsigned long)desc | KVM_RMAP_MANY;
 		++count;
 	} else {
-		desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+		desc = (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY);
 		count = desc->tail_count + desc->spte_count;
 
 		/*
@@ -881,21 +884,25 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		 */
 		if (desc->spte_count == PTE_LIST_EXT) {
 			desc = kvm_mmu_memory_cache_alloc(cache);
-			desc->more = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+			desc->more = (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY);
 			desc->spte_count = 0;
 			desc->tail_count = count;
-			rmap_head->val = (unsigned long)desc | KVM_RMAP_MANY;
+			new_val = (unsigned long)desc | KVM_RMAP_MANY;
+		} else {
+			new_val = old_val;
 		}
 		desc->sptes[desc->spte_count++] = spte;
 	}
+
+	rmap_head->val = new_val;
+
 	return count;
 }
 
-static void pte_list_desc_remove_entry(struct kvm *kvm,
-				       struct kvm_rmap_head *rmap_head,
+static void pte_list_desc_remove_entry(struct kvm *kvm, unsigned long *rmap_val,
 				       struct pte_list_desc *desc, int i)
 {
-	struct pte_list_desc *head_desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	struct pte_list_desc *head_desc = (struct pte_list_desc *)(*rmap_val & ~KVM_RMAP_MANY);
 	int j = head_desc->spte_count - 1;
 
 	/*
@@ -922,9 +929,9 @@ static void pte_list_desc_remove_entry(struct kvm *kvm,
 	 * head at the next descriptor, i.e. the new head.
 	 */
 	if (!head_desc->more)
-		rmap_head->val = 0;
+		*rmap_val = 0;
 	else
-		rmap_head->val = (unsigned long)head_desc->more | KVM_RMAP_MANY;
+		*rmap_val = (unsigned long)head_desc->more | KVM_RMAP_MANY;
 	mmu_free_pte_list_desc(head_desc);
 }
 
@@ -932,24 +939,26 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 			    struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
+	unsigned long rmap_val;
 	int i;
 
-	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_head->val, kvm))
-		return;
+	rmap_val = rmap_head->val;
+	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
+		goto out;
 
-	if (!(rmap_head->val & KVM_RMAP_MANY)) {
-		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_head->val != spte, kvm))
-			return;
+	if (!(rmap_val & KVM_RMAP_MANY)) {
+		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_val != spte, kvm))
+			goto out;
 
-		rmap_head->val = 0;
+		rmap_val = 0;
 	} else {
-		desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+		desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 		while (desc) {
 			for (i = 0; i < desc->spte_count; ++i) {
 				if (desc->sptes[i] == spte) {
-					pte_list_desc_remove_entry(kvm, rmap_head,
+					pte_list_desc_remove_entry(kvm, &rmap_val,
 								   desc, i);
-					return;
+					goto out;
 				}
 			}
 			desc = desc->more;
@@ -957,6 +966,9 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 
 		KVM_BUG_ON_DATA_CORRUPTION(true, kvm);
 	}
+
+out:
+	rmap_head->val = rmap_val;
 }
 
 static void kvm_zap_one_rmap_spte(struct kvm *kvm,
@@ -971,17 +983,19 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 				   struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc, *next;
+	unsigned long rmap_val;
 	int i;
 
-	if (!rmap_head->val)
+	rmap_val = rmap_head->val;
+	if (!rmap_val)
 		return false;
 
-	if (!(rmap_head->val & KVM_RMAP_MANY)) {
-		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val);
+	if (!(rmap_val & KVM_RMAP_MANY)) {
+		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_val);
 		goto out;
 	}
 
-	desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 
 	for (; desc; desc = next) {
 		for (i = 0; i < desc->spte_count; i++)
@@ -997,14 +1011,15 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
+	unsigned long rmap_val = rmap_head->val;
 	struct pte_list_desc *desc;
 
-	if (!rmap_head->val)
+	if (!rmap_val)
 		return 0;
-	else if (!(rmap_head->val & KVM_RMAP_MANY))
+	else if (!(rmap_val & KVM_RMAP_MANY))
 		return 1;
 
-	desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	return desc->tail_count + desc->spte_count;
 }
 
@@ -1047,6 +1062,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
  */
 struct rmap_iterator {
 	/* private fields */
+	struct rmap_head *head;
 	struct pte_list_desc *desc;	/* holds the sptep if not NULL */
 	int pos;			/* index of the sptep */
 };
@@ -1061,18 +1077,19 @@ struct rmap_iterator {
 static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
+	unsigned long rmap_val = rmap_head->val;
 	u64 *sptep;
 
-	if (!rmap_head->val)
+	if (!rmap_val)
 		return NULL;
 
-	if (!(rmap_head->val & KVM_RMAP_MANY)) {
+	if (!(rmap_val & KVM_RMAP_MANY)) {
 		iter->desc = NULL;
-		sptep = (u64 *)rmap_head->val;
+		sptep = (u64 *)rmap_val;
 		goto out;
 	}
 
-	iter->desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	iter->desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	iter->pos = 0;
 	sptep = iter->desc->sptes[iter->pos];
 out:
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v8 08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (6 preceding siblings ...)
  2024-11-05 18:43 ` [PATCH v8 07/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock James Houghton
@ 2024-11-05 18:43 ` James Houghton
  2025-01-10 23:18   ` Sean Christopherson
  2025-01-27 21:52   ` James Houghton
  2024-11-05 18:43 ` [PATCH v8 09/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs James Houghton
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

From: Sean Christopherson <seanjc@google.com>

Steal another bit from rmap entries (which are word aligned pointers, i.e.
have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
the bit to implement a *very* rudimentary per-rmap spinlock.  The only
anticipated usage of the lock outside of mmu_lock is for aging gfns, and
collisions between aging and other MMU rmap operations are quite rare,
e.g. unless userspace is being silly and aging a tiny range over and over
in a tight loop, time between contention when aging an actively running VM
is O(seconds).  In short, a more sophisticated locking scheme shouldn't be
necessary.

Note, the lock only protects the rmap structure itself, SPTEs that are
pointed at by a locked rmap can still be modified and zapped by another
task (KVM drops/zaps SPTEs before deleting the rmap entries)

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +-
 arch/x86/kvm/mmu/mmu.c          | 129 +++++++++++++++++++++++++++++---
 2 files changed, 120 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 84ee08078686..378b87ff5b1f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -26,6 +26,7 @@
 #include <linux/irqbypass.h>
 #include <linux/hyperv.h>
 #include <linux/kfifo.h>
+#include <linux/atomic.h>
 
 #include <asm/apic.h>
 #include <asm/pvclock-abi.h>
@@ -402,7 +403,7 @@ union kvm_cpu_role {
 };
 
 struct kvm_rmap_head {
-	unsigned long val;
+	atomic_long_t val;
 };
 
 struct kvm_pio_request {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 145ea180963e..1cdb77df0a4d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -847,11 +847,117 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
  * About rmap_head encoding:
  *
  * If the bit zero of rmap_head->val is clear, then it points to the only spte
- * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
+ * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct
  * pte_list_desc containing more mappings.
  */
 #define KVM_RMAP_MANY	BIT(0)
 
+/*
+ * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
+ * operates with mmu_lock held for write), but rmaps can be walked without
+ * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
+ * being zapped/dropped _while the rmap is locked_.
+ *
+ * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
+ * done while holding mmu_lock for write.  This allows a task walking rmaps
+ * without holding mmu_lock to concurrently walk the same entries as a task
+ * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
+ * the rmaps, thus the walks are stable.
+ *
+ * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
+ * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
+ * ensures all "struct pte_list_desc" fields are stable.
+ */
+#define KVM_RMAP_LOCKED	BIT(1)
+
+static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+{
+	unsigned long old_val, new_val;
+
+	/*
+	 * Elide the lock if the rmap is empty, as lockless walkers (read-only
+	 * mode) don't need to (and can't) walk an empty rmap, nor can they add
+	 * entries to the rmap.  I.e. the only paths that process empty rmaps
+	 * do so while holding mmu_lock for write, and are mutually exclusive.
+	 */
+	old_val = atomic_long_read(&rmap_head->val);
+	if (!old_val)
+		return 0;
+
+	do {
+		/*
+		 * If the rmap is locked, wait for it to be unlocked before
+		 * trying acquire the lock, e.g. to bounce the cache line.
+		 */
+		while (old_val & KVM_RMAP_LOCKED) {
+			old_val = atomic_long_read(&rmap_head->val);
+			cpu_relax();
+		}
+
+		/*
+		 * Recheck for an empty rmap, it may have been purged by the
+		 * task that held the lock.
+		 */
+		if (!old_val)
+			return 0;
+
+		new_val = old_val | KVM_RMAP_LOCKED;
+	/*
+	 * Use try_cmpxchg_acquire to prevent reads and writes to the rmap
+	 * from being reordered outside of the critical section created by
+	 * __kvm_rmap_lock.
+	 *
+	 * Pairs with smp_store_release in kvm_rmap_unlock.
+	 *
+	 * For the !old_val case, no ordering is needed, as there is no rmap
+	 * to walk.
+	 */
+	} while (!atomic_long_try_cmpxchg_acquire(&rmap_head->val, &old_val, new_val));
+
+	/* Return the old value, i.e. _without_ the LOCKED bit set. */
+	return old_val;
+}
+
+static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
+			    unsigned long new_val)
+{
+	WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
+	/*
+	 * Ensure that all accesses to the rmap have completed
+	 * before we actually unlock the rmap.
+	 *
+	 * Pairs with the atomic_long_try_cmpxchg_acquire in __kvm_rmap_lock.
+	 */
+	atomic_long_set_release(&rmap_head->val, new_val);
+}
+
+static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
+{
+	return atomic_long_read(&rmap_head->val) & ~KVM_RMAP_LOCKED;
+}
+
+/*
+ * If mmu_lock isn't held, rmaps can only locked in read-only mode.  The actual
+ * locking is the same, but the caller is disallowed from modifying the rmap,
+ * and so the unlock flow is a nop if the rmap is/was empty.
+ */
+__maybe_unused
+static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
+{
+	return __kvm_rmap_lock(rmap_head);
+}
+
+__maybe_unused
+static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
+				     unsigned long old_val)
+{
+	if (!old_val)
+		return;
+
+	KVM_MMU_WARN_ON(old_val != kvm_rmap_get(rmap_head));
+	atomic_long_set(&rmap_head->val, old_val);
+}
+
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
@@ -862,7 +968,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 	struct pte_list_desc *desc;
 	int count = 0;
 
-	old_val = rmap_head->val;
+	old_val = kvm_rmap_lock(rmap_head);
 
 	if (!old_val) {
 		new_val = (unsigned long)spte;
@@ -894,7 +1000,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		desc->sptes[desc->spte_count++] = spte;
 	}
 
-	rmap_head->val = new_val;
+	kvm_rmap_unlock(rmap_head, new_val);
 
 	return count;
 }
@@ -942,7 +1048,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = rmap_head->val;
+	rmap_val = kvm_rmap_lock(rmap_head);
 	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
 		goto out;
 
@@ -968,7 +1074,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	}
 
 out:
-	rmap_head->val = rmap_val;
+	kvm_rmap_unlock(rmap_head, rmap_val);
 }
 
 static void kvm_zap_one_rmap_spte(struct kvm *kvm,
@@ -986,7 +1092,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = rmap_head->val;
+	rmap_val = kvm_rmap_lock(rmap_head);
 	if (!rmap_val)
 		return false;
 
@@ -1005,13 +1111,13 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	}
 out:
 	/* rmap_head is meaningless now, remember to reset it */
-	rmap_head->val = 0;
+	kvm_rmap_unlock(rmap_head, 0);
 	return true;
 }
 
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
-	unsigned long rmap_val = rmap_head->val;
+	unsigned long rmap_val = kvm_rmap_get(rmap_head);
 	struct pte_list_desc *desc;
 
 	if (!rmap_val)
@@ -1077,7 +1183,7 @@ struct rmap_iterator {
 static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
-	unsigned long rmap_val = rmap_head->val;
+	unsigned long rmap_val = kvm_rmap_get(rmap_head);
 	u64 *sptep;
 
 	if (!rmap_val)
@@ -1412,7 +1518,7 @@ static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
 	while (++iterator->rmap <= iterator->end_rmap) {
 		iterator->gfn += KVM_PAGES_PER_HPAGE(iterator->level);
 
-		if (iterator->rmap->val)
+		if (atomic_long_read(&iterator->rmap->val))
 			return;
 	}
 
@@ -2450,7 +2556,8 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 			 * avoids retaining a large number of stale nested SPs.
 			 */
 			if (tdp_enabled && invalid_list &&
-			    child->role.guest_mode && !child->parent_ptes.val)
+			    child->role.guest_mode &&
+			    !atomic_long_read(&child->parent_ptes.val))
 				return kvm_mmu_prepare_zap_page(kvm, child,
 								invalid_list);
 		}
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-11-05 18:43 ` [PATCH v8 08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock James Houghton
@ 2025-01-10 23:18   ` Sean Christopherson
  2025-01-27 21:42     ` James Houghton
  2025-01-27 21:52   ` James Houghton
  1 sibling, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2025-01-10 23:18 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, Nov 05, 2024, James Houghton wrote:
> +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> +{
> +	unsigned long old_val, new_val;
> +
> +	/*
> +	 * Elide the lock if the rmap is empty, as lockless walkers (read-only
> +	 * mode) don't need to (and can't) walk an empty rmap, nor can they add
> +	 * entries to the rmap.  I.e. the only paths that process empty rmaps
> +	 * do so while holding mmu_lock for write, and are mutually exclusive.
> +	 */
> +	old_val = atomic_long_read(&rmap_head->val);
> +	if (!old_val)
> +		return 0;
> +
> +	do {
> +		/*
> +		 * If the rmap is locked, wait for it to be unlocked before
> +		 * trying acquire the lock, e.g. to bounce the cache line.
> +		 */
> +		while (old_val & KVM_RMAP_LOCKED) {
> +			old_val = atomic_long_read(&rmap_head->val);
> +			cpu_relax();
> +		}

As Lai Jiangshan pointed out[1][2], this should PAUSE first, then re-read the SPTE,
and KVM needs to disable preemption while holding the lock, because this is nothing
more than a rudimentary spinlock.

[1] https://lore.kernel.org/all/ZrooozABEWSnwzxh@google.com
[2] https://lore.kernel.org/all/Zrt5eNArfQA7x1qj@google.com

I think this?

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1a0950b77126..9dac1bbb77d4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -873,6 +873,8 @@ static unsigned long __kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
 {
        unsigned long old_val, new_val;
 
+       lockdep_assert_preemption_disabled();
+
        /*
         * Elide the lock if the rmap is empty, as lockless walkers (read-only
         * mode) don't need to (and can't) walk an empty rmap, nor can they add
@@ -889,8 +891,8 @@ static unsigned long __kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
                 * trying acquire the lock, e.g. to bounce the cache line.
                 */
                while (old_val & KVM_RMAP_LOCKED) {
-                       old_val = atomic_long_read(&rmap_head->val);
                        cpu_relax();
+                       old_val = atomic_long_read(&rmap_head->val);
                }
 
                /*
@@ -931,6 +933,8 @@ static unsigned long kvm_rmap_lock(struct kvm *kvm,
 static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
                            unsigned long new_val)
 {
+       lockdep_assert_held_write(&kvm->mmu_lock);
+
        KVM_MMU_WARN_ON(new_val & KVM_RMAP_LOCKED);
        /*
         * Ensure that all accesses to the rmap have completed
@@ -948,12 +952,21 @@ static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
 
 /*
  * If mmu_lock isn't held, rmaps can only locked in read-only mode.  The actual
- * locking is the same, but the caller is disallowed from modifying the rmap,
- * and so the unlock flow is a nop if the rmap is/was empty.
+ * locking is the same, but preemption needs to be manually disabled (because
+ * a spinlock isn't already held) and the caller is disallowed from modifying
+ * the rmap, and so the unlock flow is a nop if the rmap is/was empty.  Note,
+ * preemption must be disable *before* acquiring the bitlock.  If the rmap is
+ * empty, i.e. isn't truly locked, immediately re-enable preemption.
  */
 static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
 {
-       return __kvm_rmap_lock(rmap_head);
+       unsigned rmap_val;
+       preempt_disable();
+
+       rmap_val = __kvm_rmap_lock(rmap_head);
+       if (!rmap_val)
+               preempt_enable();
+       return rmap_val;
 }
 
 static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
@@ -964,6 +977,7 @@ static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
 
        KVM_MMU_WARN_ON(old_val != kvm_rmap_get(rmap_head));
        atomic_long_set(&rmap_head->val, old_val);
+       preempt_enable();
 }
 
 /*

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2025-01-10 23:18   ` Sean Christopherson
@ 2025-01-27 21:42     ` James Houghton
  0 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2025-01-27 21:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Fri, Jan 10, 2025 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Nov 05, 2024, James Houghton wrote:
> > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > +{
> > +     unsigned long old_val, new_val;
> > +
> > +     /*
> > +      * Elide the lock if the rmap is empty, as lockless walkers (read-only
> > +      * mode) don't need to (and can't) walk an empty rmap, nor can they add
> > +      * entries to the rmap.  I.e. the only paths that process empty rmaps
> > +      * do so while holding mmu_lock for write, and are mutually exclusive.
> > +      */
> > +     old_val = atomic_long_read(&rmap_head->val);
> > +     if (!old_val)
> > +             return 0;
> > +
> > +     do {
> > +             /*
> > +              * If the rmap is locked, wait for it to be unlocked before
> > +              * trying acquire the lock, e.g. to bounce the cache line.
> > +              */
> > +             while (old_val & KVM_RMAP_LOCKED) {
> > +                     old_val = atomic_long_read(&rmap_head->val);
> > +                     cpu_relax();
> > +             }
>
> As Lai Jiangshan pointed out[1][2], this should PAUSE first, then re-read the SPTE,
> and KVM needs to disable preemption while holding the lock, because this is nothing
> more than a rudimentary spinlock.

Ah! Sorry for missing this.

>
> [1] https://lore.kernel.org/all/ZrooozABEWSnwzxh@google.com
> [2] https://lore.kernel.org/all/Zrt5eNArfQA7x1qj@google.com
>
> I think this?
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1a0950b77126..9dac1bbb77d4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -873,6 +873,8 @@ static unsigned long __kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
>  {
>         unsigned long old_val, new_val;
>
> +       lockdep_assert_preemption_disabled();
> +
>         /*
>          * Elide the lock if the rmap is empty, as lockless walkers (read-only
>          * mode) don't need to (and can't) walk an empty rmap, nor can they add
> @@ -889,8 +891,8 @@ static unsigned long __kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
>                  * trying acquire the lock, e.g. to bounce the cache line.
>                  */
>                 while (old_val & KVM_RMAP_LOCKED) {
> -                       old_val = atomic_long_read(&rmap_head->val);
>                         cpu_relax();
> +                       old_val = atomic_long_read(&rmap_head->val);
>                 }
>
>                 /*
> @@ -931,6 +933,8 @@ static unsigned long kvm_rmap_lock(struct kvm *kvm,
>  static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
>                             unsigned long new_val)
>  {
> +       lockdep_assert_held_write(&kvm->mmu_lock);
> +
>         KVM_MMU_WARN_ON(new_val & KVM_RMAP_LOCKED);
>         /*
>          * Ensure that all accesses to the rmap have completed
> @@ -948,12 +952,21 @@ static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
>
>  /*
>   * If mmu_lock isn't held, rmaps can only locked in read-only mode.  The actual
> - * locking is the same, but the caller is disallowed from modifying the rmap,
> - * and so the unlock flow is a nop if the rmap is/was empty.
> + * locking is the same, but preemption needs to be manually disabled (because
> + * a spinlock isn't already held) and the caller is disallowed from modifying
> + * the rmap, and so the unlock flow is a nop if the rmap is/was empty.  Note,
> + * preemption must be disable *before* acquiring the bitlock.  If the rmap is
> + * empty, i.e. isn't truly locked, immediately re-enable preemption.
>   */
>  static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
>  {
> -       return __kvm_rmap_lock(rmap_head);
> +       unsigned rmap_val;
> +       preempt_disable();
> +
> +       rmap_val = __kvm_rmap_lock(rmap_head);
> +       if (!rmap_val)
> +               preempt_enable();
> +       return rmap_val;
>  }
>
>  static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
> @@ -964,6 +977,7 @@ static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
>
>         KVM_MMU_WARN_ON(old_val != kvm_rmap_get(rmap_head));
>         atomic_long_set(&rmap_head->val, old_val);
> +       preempt_enable();
>  }
>
>  /*

I don't see anything wrong with these changes. Thanks! Applied.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-11-05 18:43 ` [PATCH v8 08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock James Houghton
  2025-01-10 23:18   ` Sean Christopherson
@ 2025-01-27 21:52   ` James Houghton
  1 sibling, 0 replies; 41+ messages in thread
From: James Houghton @ 2025-01-27 21:52 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, Marc Zyngier, Oliver Upton, Wei Xu,
	Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Tue, Nov 5, 2024 at 10:43 AM James Houghton <jthoughton@google.com> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Steal another bit from rmap entries (which are word aligned pointers, i.e.
> have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
> the bit to implement a *very* rudimentary per-rmap spinlock.  The only
> anticipated usage of the lock outside of mmu_lock is for aging gfns, and
> collisions between aging and other MMU rmap operations are quite rare,
> e.g. unless userspace is being silly and aging a tiny range over and over
> in a tight loop, time between contention when aging an actively running VM
> is O(seconds).  In short, a more sophisticated locking scheme shouldn't be
> necessary.
>
> Note, the lock only protects the rmap structure itself, SPTEs that are
> pointed at by a locked rmap can still be modified and zapped by another
> task (KVM drops/zaps SPTEs before deleting the rmap entries)
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: James Houghton <jthoughton@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   3 +-
>  arch/x86/kvm/mmu/mmu.c          | 129 +++++++++++++++++++++++++++++---
>  2 files changed, 120 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 84ee08078686..378b87ff5b1f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -26,6 +26,7 @@
>  #include <linux/irqbypass.h>
>  #include <linux/hyperv.h>
>  #include <linux/kfifo.h>
> +#include <linux/atomic.h>
>
>  #include <asm/apic.h>
>  #include <asm/pvclock-abi.h>
> @@ -402,7 +403,7 @@ union kvm_cpu_role {
>  };
>
>  struct kvm_rmap_head {
> -       unsigned long val;
> +       atomic_long_t val;
>  };
>
>  struct kvm_pio_request {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 145ea180963e..1cdb77df0a4d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -847,11 +847,117 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
>   * About rmap_head encoding:
>   *
>   * If the bit zero of rmap_head->val is clear, then it points to the only spte
> - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
> + * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct
>   * pte_list_desc containing more mappings.
>   */
>  #define KVM_RMAP_MANY  BIT(0)
>
> +/*
> + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> + * operates with mmu_lock held for write), but rmaps can be walked without
> + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> + * being zapped/dropped _while the rmap is locked_.
> + *
> + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> + * done while holding mmu_lock for write.  This allows a task walking rmaps
> + * without holding mmu_lock to concurrently walk the same entries as a task
> + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> + * the rmaps, thus the walks are stable.
> + *
> + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> + * ensures all "struct pte_list_desc" fields are stable.
> + */
> +#define KVM_RMAP_LOCKED        BIT(1)
> +
> +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> +{
> +       unsigned long old_val, new_val;
> +
> +       /*
> +        * Elide the lock if the rmap is empty, as lockless walkers (read-only
> +        * mode) don't need to (and can't) walk an empty rmap, nor can they add
> +        * entries to the rmap.  I.e. the only paths that process empty rmaps
> +        * do so while holding mmu_lock for write, and are mutually exclusive.
> +        */
> +       old_val = atomic_long_read(&rmap_head->val);
> +       if (!old_val)
> +               return 0;
> +
> +       do {
> +               /*
> +                * If the rmap is locked, wait for it to be unlocked before
> +                * trying acquire the lock, e.g. to bounce the cache line.
> +                */
> +               while (old_val & KVM_RMAP_LOCKED) {
> +                       old_val = atomic_long_read(&rmap_head->val);
> +                       cpu_relax();
> +               }
> +
> +               /*
> +                * Recheck for an empty rmap, it may have been purged by the
> +                * task that held the lock.
> +                */
> +               if (!old_val)
> +                       return 0;
> +
> +               new_val = old_val | KVM_RMAP_LOCKED;
> +       /*
> +        * Use try_cmpxchg_acquire to prevent reads and writes to the rmap
> +        * from being reordered outside of the critical section created by
> +        * __kvm_rmap_lock.
> +        *
> +        * Pairs with smp_store_release in kvm_rmap_unlock.
> +        *
> +        * For the !old_val case, no ordering is needed, as there is no rmap
> +        * to walk.
> +        */
> +       } while (!atomic_long_try_cmpxchg_acquire(&rmap_head->val, &old_val, new_val));
> +
> +       /* Return the old value, i.e. _without_ the LOCKED bit set. */
> +       return old_val;
> +}
> +
> +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> +                           unsigned long new_val)
> +{
> +       WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
> +       /*
> +        * Ensure that all accesses to the rmap have completed
> +        * before we actually unlock the rmap.
> +        *
> +        * Pairs with the atomic_long_try_cmpxchg_acquire in __kvm_rmap_lock.
> +        */
> +       atomic_long_set_release(&rmap_head->val, new_val);
> +}
> +
> +static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
> +{
> +       return atomic_long_read(&rmap_head->val) & ~KVM_RMAP_LOCKED;
> +}
> +
> +/*
> + * If mmu_lock isn't held, rmaps can only locked in read-only mode.  The actual
> + * locking is the same, but the caller is disallowed from modifying the rmap,
> + * and so the unlock flow is a nop if the rmap is/was empty.
> + */
> +__maybe_unused
> +static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
> +{
> +       return __kvm_rmap_lock(rmap_head);
> +}
> +
> +__maybe_unused
> +static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
> +                                    unsigned long old_val)
> +{
> +       if (!old_val)
> +               return;
> +
> +       KVM_MMU_WARN_ON(old_val != kvm_rmap_get(rmap_head));
> +       atomic_long_set(&rmap_head->val, old_val);

Trying not to unnecessarily extend the conversion we already had about
memory ordering here[1]....

I'm pretty sure this should actually be atomic_long_set_release(),
just like kvm_rmap_unlock(), as we cannot permit (at least) the
compiler to reorder rmap reads past this atomic store.

I *think* I mistakenly thought it was okay to leave it as
atomic_long_set() because this routine is only reading, but of course,
those reads must stay within the critical section.

Anyway, I've refactored it like this:

static void __kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
                              unsigned long val)
{
        KVM_MMU_WARN_ON(val & KVM_RMAP_LOCKED);
        /*
         * Ensure that all accesses to the rmap have completed
         * before we actually unlock the rmap.
         *
         * Pairs with the atomic_long_try_cmpxchg_acquire in __kvm_rmap_lock.
         */
        atomic_long_set_release(&rmap_head->val, val);
}

static void kvm_rmap_unlock(struct kvm *kvm,
                            struct kvm_rmap_head *rmap_head,
                            unsigned long new_val)
{
        lockdep_assert_held_write(&kvm->mmu_lock);

        __kvm_rmap_unlock(rmap_head, new_val);
}

static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
                                     unsigned long old_val)
{
        if (!old_val)
                return;

        KVM_MMU_WARN_ON(old_val != kvm_rmap_get(rmap_head));

        __kvm_rmap_unlock(rmap_head, old_val);
        preempt_enable();
}

It's still true that the !old_val case needs no such ordering, as
!old_val means there is nothing to walk.

[1]: https://lore.kernel.org/all/ZuG4YYzozOddPRCm@google.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v8 09/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (7 preceding siblings ...)
  2024-11-05 18:43 ` [PATCH v8 08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock James Houghton
@ 2024-11-05 18:43 ` James Houghton
  2024-11-05 18:43 ` [PATCH v8 10/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns James Houghton
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

From: Sean Christopherson <seanjc@google.com>

Add a lockless version of for_each_rmap_spte(), which is pretty much the
same as the normal version, except that it doesn't BUG() the host if a
non-present SPTE is encountered.  When mmu_lock is held, it should be
impossible for a different task to zap a SPTE, _and_ zapped SPTEs must
be removed from their rmap chain prior to dropping mmu_lock.  Thus, the
normal walker BUG()s if a non-present SPTE is encountered as something is
wildly broken.

When walking rmaps without holding mmu_lock, the SPTEs pointed at by the
rmap chain can be zapped/dropped, and so a lockless walk can observe a
non-present SPTE if it runs concurrently with a different operation that
is zapping SPTEs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 75 +++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1cdb77df0a4d..71019762a28a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -870,7 +870,7 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
  */
 #define KVM_RMAP_LOCKED	BIT(1)
 
-static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+static unsigned long __kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
 {
 	unsigned long old_val, new_val;
 
@@ -914,14 +914,25 @@ static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
 	 */
 	} while (!atomic_long_try_cmpxchg_acquire(&rmap_head->val, &old_val, new_val));
 
-	/* Return the old value, i.e. _without_ the LOCKED bit set. */
+	/*
+	 * Return the old value, i.e. _without_ the LOCKED bit set.  It's
+	 * impossible for the return value to be 0 (see above), i.e. the read-
+	 * only unlock flow can't get a false positive and fail to unlock.
+	 */
 	return old_val;
 }
 
+static unsigned long kvm_rmap_lock(struct kvm *kvm,
+				   struct kvm_rmap_head *rmap_head)
+{
+	lockdep_assert_held_write(&kvm->mmu_lock);
+	return __kvm_rmap_lock(rmap_head);
+}
+
 static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
 			    unsigned long new_val)
 {
-	WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
+	KVM_MMU_WARN_ON(new_val & KVM_RMAP_LOCKED);
 	/*
 	 * Ensure that all accesses to the rmap have completed
 	 * before we actually unlock the rmap.
@@ -961,14 +972,14 @@ static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
-			struct kvm_rmap_head *rmap_head)
+static int pte_list_add(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
+			u64 *spte, struct kvm_rmap_head *rmap_head)
 {
 	unsigned long old_val, new_val;
 	struct pte_list_desc *desc;
 	int count = 0;
 
-	old_val = kvm_rmap_lock(rmap_head);
+	old_val = kvm_rmap_lock(kvm, rmap_head);
 
 	if (!old_val) {
 		new_val = (unsigned long)spte;
@@ -1048,7 +1059,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = kvm_rmap_lock(rmap_head);
+	rmap_val = kvm_rmap_lock(kvm, rmap_head);
 	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
 		goto out;
 
@@ -1092,7 +1103,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = kvm_rmap_lock(rmap_head);
+	rmap_val = kvm_rmap_lock(kvm, rmap_head);
 	if (!rmap_val)
 		return false;
 
@@ -1184,23 +1195,18 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
 	unsigned long rmap_val = kvm_rmap_get(rmap_head);
-	u64 *sptep;
 
 	if (!rmap_val)
 		return NULL;
 
 	if (!(rmap_val & KVM_RMAP_MANY)) {
 		iter->desc = NULL;
-		sptep = (u64 *)rmap_val;
-		goto out;
+		return (u64 *)rmap_val;
 	}
 
 	iter->desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	iter->pos = 0;
-	sptep = iter->desc->sptes[iter->pos];
-out:
-	BUG_ON(!is_shadow_present_pte(*sptep));
-	return sptep;
+	return iter->desc->sptes[iter->pos];
 }
 
 /*
@@ -1210,14 +1216,11 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
  */
 static u64 *rmap_get_next(struct rmap_iterator *iter)
 {
-	u64 *sptep;
-
 	if (iter->desc) {
 		if (iter->pos < PTE_LIST_EXT - 1) {
 			++iter->pos;
-			sptep = iter->desc->sptes[iter->pos];
-			if (sptep)
-				goto out;
+			if (iter->desc->sptes[iter->pos])
+				return iter->desc->sptes[iter->pos];
 		}
 
 		iter->desc = iter->desc->more;
@@ -1225,20 +1228,24 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
 		if (iter->desc) {
 			iter->pos = 0;
 			/* desc->sptes[0] cannot be NULL */
-			sptep = iter->desc->sptes[iter->pos];
-			goto out;
+			return iter->desc->sptes[iter->pos];
 		}
 	}
 
 	return NULL;
-out:
-	BUG_ON(!is_shadow_present_pte(*sptep));
-	return sptep;
 }
 
-#define for_each_rmap_spte(_rmap_head_, _iter_, _spte_)			\
-	for (_spte_ = rmap_get_first(_rmap_head_, _iter_);		\
-	     _spte_; _spte_ = rmap_get_next(_iter_))
+#define __for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)	\
+	for (_sptep_ = rmap_get_first(_rmap_head_, _iter_);	\
+	     _sptep_; _sptep_ = rmap_get_next(_iter_))
+
+#define for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)			\
+	__for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)			\
+		if (!WARN_ON_ONCE(!is_shadow_present_pte(*(_sptep_))))	\
+
+#define for_each_rmap_spte_lockless(_rmap_head_, _iter_, _sptep_, _spte_)	\
+	__for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)			\
+		if (is_shadow_present_pte(_spte_ = mmu_spte_get_lockless(sptep)))
 
 static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
@@ -1324,12 +1331,13 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	struct rmap_iterator iter;
 	bool flush = false;
 
-	for_each_rmap_spte(rmap_head, &iter, sptep)
+	for_each_rmap_spte(rmap_head, &iter, sptep) {
 		if (spte_ad_need_write_protect(*sptep))
 			flush |= test_and_clear_bit(PT_WRITABLE_SHIFT,
 						    (unsigned long *)sptep);
 		else
 			flush |= spte_clear_dirty(sptep);
+	}
 
 	return flush;
 }
@@ -1650,7 +1658,7 @@ static void __rmap_add(struct kvm *kvm,
 	kvm_update_page_stats(kvm, sp->role.level, 1);
 
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-	rmap_count = pte_list_add(cache, spte, rmap_head);
+	rmap_count = pte_list_add(kvm, cache, spte, rmap_head);
 
 	if (rmap_count > kvm->stat.max_mmu_rmap_size)
 		kvm->stat.max_mmu_rmap_size = rmap_count;
@@ -1796,13 +1804,14 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
+static void mmu_page_add_parent_pte(struct kvm *kvm,
+				    struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
 	if (!parent_pte)
 		return;
 
-	pte_list_add(cache, parent_pte, &sp->parent_ptes);
+	pte_list_add(kvm, cache, parent_pte, &sp->parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
@@ -2492,7 +2501,7 @@ static void __link_shadow_page(struct kvm *kvm,
 
 	mmu_spte_set(sptep, spte);
 
-	mmu_page_add_parent_pte(cache, sp, sptep);
+	mmu_page_add_parent_pte(kvm, cache, sp, sptep);
 
 	/*
 	 * The non-direct sub-pagetable must be updated before linking.  For
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v8 10/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (8 preceding siblings ...)
  2024-11-05 18:43 ` [PATCH v8 09/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs James Houghton
@ 2024-11-05 18:43 ` James Houghton
  2024-11-05 18:43 ` [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton
  2024-11-05 19:21 ` [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly Yu Zhao
  11 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

From: Sean Christopherson <seanjc@google.com>

When A/D bits are supported on sptes, it is safe to simply clear the
Accessed bits.

The less obvious case is marking sptes for access tracking in the
non-A/D case (for EPT only). In this case, we have to be sure that it is
okay for TLB entries to exist for non-present sptes. For example, when
doing dirty tracking, if we come across a non-present SPTE, we need to
know that we need to do a TLB invalidation.

This case is already supported today (as we already support *not* doing
TLBIs for clear_young(); there is a separate notifier for clearing *and*
flushing, clear_flush_young()). This works today because GET_DIRTY_LOG
flushes the TLB before returning to userspace.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 72 +++++++++++++++++++++++-------------------
 1 file changed, 39 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 71019762a28a..bdd6abf9b44e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -952,13 +952,11 @@ static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
  * locking is the same, but the caller is disallowed from modifying the rmap,
  * and so the unlock flow is a nop if the rmap is/was empty.
  */
-__maybe_unused
 static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
 {
 	return __kvm_rmap_lock(rmap_head);
 }
 
-__maybe_unused
 static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
 				     unsigned long old_val)
 {
@@ -1677,37 +1675,48 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 }
 
 static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
-				   struct kvm_gfn_range *range, bool test_only)
+				   struct kvm_gfn_range *range,
+				   bool test_only)
 {
-	struct slot_rmap_walk_iterator iterator;
+	struct kvm_rmap_head *rmap_head;
 	struct rmap_iterator iter;
+	unsigned long rmap_val;
 	bool young = false;
 	u64 *sptep;
+	gfn_t gfn;
+	int level;
+	u64 spte;
 
-	for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
-				 range->start, range->end - 1, &iterator) {
-		for_each_rmap_spte(iterator.rmap, &iter, sptep) {
-			u64 spte = *sptep;
+	for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+		for (gfn = range->start; gfn < range->end;
+		     gfn += KVM_PAGES_PER_HPAGE(level)) {
+			rmap_head = gfn_to_rmap(gfn, level, range->slot);
+			rmap_val = kvm_rmap_lock_readonly(rmap_head);
 
-			if (!is_accessed_spte(spte))
-				continue;
+			for_each_rmap_spte_lockless(rmap_head, &iter, sptep, spte) {
+				if (!is_accessed_spte(spte))
+					continue;
+
+				if (test_only) {
+					kvm_rmap_unlock_readonly(rmap_head, rmap_val);
+					return true;
+				}
 
-			if (test_only)
-				return true;
-
-			if (spte_ad_enabled(spte)) {
-				clear_bit((ffs(shadow_accessed_mask) - 1),
-					(unsigned long *)sptep);
-			} else {
-				/*
-				 * WARN if mmu_spte_update() signals the need
-				 * for a TLB flush, as Access tracking a SPTE
-				 * should never trigger an _immediate_ flush.
-				 */
-				spte = mark_spte_for_access_track(spte);
-				WARN_ON_ONCE(mmu_spte_update(sptep, spte));
+				if (spte_ad_enabled(spte))
+					clear_bit((ffs(shadow_accessed_mask) - 1),
+						  (unsigned long *)sptep);
+				else
+					/*
+					 * If the following cmpxchg fails, the
+					 * spte is being concurrently modified
+					 * and should most likely stay young.
+					 */
+					cmpxchg64(sptep, spte,
+					      mark_spte_for_access_track(spte));
+				young = true;
 			}
-			young = true;
+
+			kvm_rmap_unlock_readonly(rmap_head, rmap_val);
 		}
 	}
 	return young;
@@ -1725,11 +1734,8 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		young = kvm_tdp_mmu_age_gfn_range(kvm, range);
 
-	if (kvm_has_shadow_mmu_sptes(kvm)) {
-		write_lock(&kvm->mmu_lock);
+	if (kvm_has_shadow_mmu_sptes(kvm))
 		young |= kvm_rmap_age_gfn_range(kvm, range, false);
-		write_unlock(&kvm->mmu_lock);
-	}
 
 	return young;
 }
@@ -1741,11 +1747,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		young = kvm_tdp_mmu_test_age_gfn(kvm, range);
 
-	if (!young && kvm_has_shadow_mmu_sptes(kvm)) {
-		write_lock(&kvm->mmu_lock);
+	if (young)
+		return young;
+
+	if (kvm_has_shadow_mmu_sptes(kvm))
 		young |= kvm_rmap_age_gfn_range(kvm, range, true);
-		write_unlock(&kvm->mmu_lock);
-	}
 
 	return young;
 }
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (9 preceding siblings ...)
  2024-11-05 18:43 ` [PATCH v8 10/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns James Houghton
@ 2024-11-05 18:43 ` James Houghton
  2025-01-11  0:12   ` Sean Christopherson
  2025-01-11  0:21   ` Yosry Ahmed
  2024-11-05 19:21 ` [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly Yu Zhao
  11 siblings, 2 replies; 41+ messages in thread
From: James Houghton @ 2024-11-05 18:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Matlack, David Rientjes, James Houghton, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

This test now has two modes of operation:
1. (default) To check how much vCPU performance was affected by access
             tracking (previously existed, now supports MGLRU aging).
2. (-p) To also benchmark how fast MGLRU can do aging while vCPUs are
        faulting in memory.

Mode (1) also serves as a way to verify that aging is working properly
for pages only accessed by KVM.  It will fail if one does not have the
0x8 lru_gen feature bit.

To support MGLRU, the test creates a memory cgroup, moves itself into
it, then uses the lru_gen debugfs output to track memory in that cgroup.
The logic to parse the lru_gen debugfs output has been put into
selftests/kvm/lib/lru_gen_util.c.

Co-developed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/access_tracking_perf_test.c | 366 ++++++++++++++--
 .../selftests/kvm/include/lru_gen_util.h      |  55 +++
 .../testing/selftests/kvm/lib/lru_gen_util.c  | 391 ++++++++++++++++++
 4 files changed, 783 insertions(+), 30 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h
 create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index f186888f0e00..542548e6e8ba 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -22,6 +22,7 @@ LIBKVM += lib/elf.c
 LIBKVM += lib/guest_modes.c
 LIBKVM += lib/io.c
 LIBKVM += lib/kvm_util.c
+LIBKVM += lib/lru_gen_util.c
 LIBKVM += lib/memstress.c
 LIBKVM += lib/guest_sprintf.c
 LIBKVM += lib/rbtree.c
diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..8d6c2ce4b98a 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -38,6 +38,7 @@
 #include <inttypes.h>
 #include <limits.h>
 #include <pthread.h>
+#include <stdio.h>
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
@@ -47,6 +48,19 @@
 #include "memstress.h"
 #include "guest_modes.h"
 #include "processor.h"
+#include "lru_gen_util.h"
+
+static const char *TEST_MEMCG_NAME = "access_tracking_perf_test";
+static const int LRU_GEN_ENABLED = 0x1;
+static const int LRU_GEN_MM_WALK = 0x2;
+static const char *CGROUP_PROCS = "cgroup.procs";
+/*
+ * If using MGLRU, this test assumes a cgroup v2 or cgroup v1 memory hierarchy
+ * is mounted at cgroup_root.
+ *
+ * Can be changed with -r.
+ */
+static const char *cgroup_root = "/sys/fs/cgroup";
 
 /* Global variable used to synchronize all of the vCPU threads. */
 static int iteration;
@@ -62,6 +76,9 @@ static enum {
 /* The iteration that was last completed by each vCPU. */
 static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
 
+/* The time at which the last iteration was completed */
+static struct timespec vcpu_last_completed_time[KVM_MAX_VCPUS];
+
 /* Whether to overlap the regions of memory vCPUs access. */
 static bool overlap_memory_access;
 
@@ -74,6 +91,12 @@ struct test_params {
 
 	/* The number of vCPUs to create in the VM. */
 	int nr_vcpus;
+
+	/* Whether to use lru_gen aging instead of idle page tracking. */
+	bool lru_gen;
+
+	/* Whether to test the performance of aging itself. */
+	bool benchmark_lru_gen;
 };
 
 static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
@@ -89,6 +112,50 @@ static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
 
 }
 
+static void write_file_long(const char *path, long v)
+{
+	FILE *f;
+
+	f = fopen(path, "w");
+	TEST_ASSERT(f, "fopen(%s) failed", path);
+	TEST_ASSERT(fprintf(f, "%ld\n", v) > 0,
+		    "fprintf to %s failed", path);
+	TEST_ASSERT(!fclose(f), "fclose(%s) failed", path);
+}
+
+static char *path_join(const char *parent, const char *child)
+{
+	char *out = NULL;
+
+	return asprintf(&out, "%s/%s", parent, child) >= 0 ? out : NULL;
+}
+
+static char *memcg_path(const char *memcg)
+{
+	return path_join(cgroup_root, memcg);
+}
+
+static char *memcg_file_path(const char *memcg, const char *file)
+{
+	char *mp = memcg_path(memcg);
+	char *fp;
+
+	if (!mp)
+		return NULL;
+	fp = path_join(mp, file);
+	free(mp);
+	return fp;
+}
+
+static void move_to_memcg(const char *memcg, pid_t pid)
+{
+	char *procs = memcg_file_path(memcg, CGROUP_PROCS);
+
+	TEST_ASSERT(procs, "Failed to construct cgroup.procs path");
+	write_file_long(procs, pid);
+	free(procs);
+}
+
 #define PAGEMAP_PRESENT (1ULL << 63)
 #define PAGEMAP_PFN_MASK ((1ULL << 55) - 1)
 
@@ -242,6 +309,8 @@ static void vcpu_thread_main(struct memstress_vcpu_args *vcpu_args)
 		};
 
 		vcpu_last_completed_iteration[vcpu_idx] = current_iteration;
+		clock_gettime(CLOCK_MONOTONIC,
+			      &vcpu_last_completed_time[vcpu_idx]);
 	}
 }
 
@@ -253,38 +322,68 @@ static void spin_wait_for_vcpu(int vcpu_idx, int target_iteration)
 	}
 }
 
+static bool all_vcpus_done(int target_iteration, int nr_vcpus)
+{
+	for (int i = 0; i < nr_vcpus; ++i)
+		if (READ_ONCE(vcpu_last_completed_iteration[i]) !=
+		    target_iteration)
+			return false;
+
+	return true;
+}
+
 /* The type of memory accesses to perform in the VM. */
 enum access_type {
 	ACCESS_READ,
 	ACCESS_WRITE,
 };
 
-static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description)
+static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description,
+			  bool wait)
 {
-	struct timespec ts_start;
-	struct timespec ts_elapsed;
 	int next_iteration, i;
 
 	/* Kick off the vCPUs by incrementing iteration. */
 	next_iteration = ++iteration;
 
-	clock_gettime(CLOCK_MONOTONIC, &ts_start);
-
 	/* Wait for all vCPUs to finish the iteration. */
-	for (i = 0; i < nr_vcpus; i++)
-		spin_wait_for_vcpu(i, next_iteration);
+	if (wait) {
+		struct timespec ts_start;
+		struct timespec ts_elapsed;
+
+		clock_gettime(CLOCK_MONOTONIC, &ts_start);
 
-	ts_elapsed = timespec_elapsed(ts_start);
-	pr_info("%-30s: %ld.%09lds\n",
-		description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+		for (i = 0; i < nr_vcpus; i++)
+			spin_wait_for_vcpu(i, next_iteration);
+
+		ts_elapsed = timespec_elapsed(ts_start);
+
+		pr_info("%-30s: %ld.%09lds\n",
+			description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+	} else
+		pr_info("%-30s\n", description);
 }
 
-static void access_memory(struct kvm_vm *vm, int nr_vcpus,
-			  enum access_type access, const char *description)
+static void _access_memory(struct kvm_vm *vm, int nr_vcpus,
+			   enum access_type access, const char *description,
+			   bool wait)
 {
 	memstress_set_write_percent(vm, (access == ACCESS_READ) ? 0 : 100);
 	iteration_work = ITERATION_ACCESS_MEMORY;
-	run_iteration(vm, nr_vcpus, description);
+	run_iteration(vm, nr_vcpus, description, wait);
+}
+
+static void access_memory(struct kvm_vm *vm, int nr_vcpus,
+			  enum access_type access, const char *description)
+{
+	return _access_memory(vm, nr_vcpus, access, description, true);
+}
+
+static void access_memory_async(struct kvm_vm *vm, int nr_vcpus,
+				enum access_type access,
+				const char *description)
+{
+	return _access_memory(vm, nr_vcpus, access, description, false);
 }
 
 static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
@@ -297,19 +396,115 @@ static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
 	 */
 	pr_debug("Marking VM memory idle (slow)...\n");
 	iteration_work = ITERATION_MARK_IDLE;
-	run_iteration(vm, nr_vcpus, "Mark memory idle");
+	run_iteration(vm, nr_vcpus, "Mark memory idle", true);
 }
 
-static void run_test(enum vm_guest_mode mode, void *arg)
+static void create_memcg(const char *memcg)
+{
+	const char *full_memcg_path = memcg_path(memcg);
+	int ret;
+
+	TEST_ASSERT(full_memcg_path, "Failed to construct full memcg path");
+retry:
+	ret = mkdir(full_memcg_path, 0755);
+	if (ret && errno == EEXIST) {
+		TEST_ASSERT(!rmdir(full_memcg_path),
+			    "Found existing memcg at %s, but rmdir failed",
+			    full_memcg_path);
+		goto retry;
+	}
+	TEST_ASSERT(!ret, "Creating the memcg failed: mkdir(%s) failed",
+		    full_memcg_path);
+
+	pr_info("Created memcg at %s\n", full_memcg_path);
+}
+
+/*
+ * Test lru_gen aging speed while vCPUs are faulting memory in.
+ *
+ * This test will run lru_gen aging until the vCPUs have finished all of
+ * the faulting work, reporting:
+ *  - vcpu wall time (wall time for slowest vCPU)
+ *  - average aging pass duration
+ *  - total number of aging passes
+ *  - total time spent aging
+ *
+ * This test produces the most useful results when the vcpu wall time and the
+ * total time spent aging are similar (i.e., we want to avoid timing aging
+ * while the vCPUs aren't doing any work).
+ */
+static void run_benchmark(enum vm_guest_mode mode, struct kvm_vm *vm,
+			  struct test_params *params)
 {
-	struct test_params *params = arg;
-	struct kvm_vm *vm;
 	int nr_vcpus = params->nr_vcpus;
+	struct memcg_stats stats;
+	struct timespec ts_start, ts_max, ts_vcpus_elapsed,
+			ts_aging_elapsed, ts_aging_elapsed_avg;
+	int num_passes = 0;
 
-	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
-				 params->backing_src, !overlap_memory_access);
+	printf("Running lru_gen benchmark...\n");
 
-	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
+	clock_gettime(CLOCK_MONOTONIC, &ts_start);
+	access_memory_async(vm, nr_vcpus, ACCESS_WRITE,
+			    "Populating memory (async)");
+	while (!all_vcpus_done(iteration, nr_vcpus)) {
+		lru_gen_do_aging_quiet(&stats, TEST_MEMCG_NAME);
+		++num_passes;
+	}
+
+	ts_aging_elapsed = timespec_elapsed(ts_start);
+	ts_aging_elapsed_avg = timespec_div(ts_aging_elapsed, num_passes);
+
+	/* Find out when the slowest vCPU finished. */
+	ts_max = ts_start;
+	for (int i = 0; i < nr_vcpus; ++i) {
+		struct timespec *vcpu_ts = &vcpu_last_completed_time[i];
+
+		if (ts_max.tv_sec < vcpu_ts->tv_sec ||
+		    (ts_max.tv_sec == vcpu_ts->tv_sec  &&
+		     ts_max.tv_nsec < vcpu_ts->tv_nsec))
+			ts_max = *vcpu_ts;
+	}
+
+	ts_vcpus_elapsed = timespec_sub(ts_max, ts_start);
+
+	pr_info("%-30s: %ld.%09lds\n", "vcpu wall time",
+		ts_vcpus_elapsed.tv_sec, ts_vcpus_elapsed.tv_nsec);
+
+	pr_info("%-30s: %ld.%09lds, (passes:%d, total:%ld.%09lds)\n",
+		"lru_gen avg pass duration",
+		ts_aging_elapsed_avg.tv_sec,
+		ts_aging_elapsed_avg.tv_nsec,
+		num_passes,
+		ts_aging_elapsed.tv_sec,
+		ts_aging_elapsed.tv_nsec);
+}
+
+/*
+ * Test how much access tracking affects vCPU performance.
+ *
+ * Supports two modes of access tracking:
+ * - idle page tracking
+ * - lru_gen aging
+ *
+ * When using lru_gen, this test additionally verifies that the pages are in
+ * fact getting younger and older, otherwise the performance data would be
+ * invalid.
+ *
+ * The forced lru_gen aging can race with aging that occurs naturally.
+ */
+static void run_test(enum vm_guest_mode mode, struct kvm_vm *vm,
+		     struct test_params *params)
+{
+	int nr_vcpus = params->nr_vcpus;
+	bool lru_gen = params->lru_gen;
+	struct memcg_stats stats;
+	// If guest_page_size is larger than the host's page size, the
+	// guest (memstress) will only fault in a subset of the host's pages.
+	long total_pages = nr_vcpus * params->vcpu_memory_bytes /
+			   max(memstress_args.guest_page_size,
+			       (uint64_t)getpagesize());
+	int found_gens[5];
 
 	pr_info("\n");
 	access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory");
@@ -319,11 +514,78 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");
 
 	/* Repeat on memory that has been marked as idle. */
-	mark_memory_idle(vm, nr_vcpus);
+	if (lru_gen) {
+		/* Do an initial page table scan */
+		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+		TEST_ASSERT(sum_memcg_stats(&stats) >= total_pages,
+		  "Not all pages tracked in lru_gen stats.\n"
+		  "Is lru_gen enabled? Did the memcg get created properly?");
+
+		/* Find the generation we're currently in (probably youngest) */
+		found_gens[0] = lru_gen_find_generation(&stats, total_pages);
+
+		/* Do an aging pass now */
+		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+		/* Same generation, but a newer generation has been made */
+		found_gens[1] = lru_gen_find_generation(&stats, total_pages);
+		TEST_ASSERT(found_gens[1] == found_gens[0],
+			    "unexpected gen change: %d vs. %d",
+			    found_gens[1], found_gens[0]);
+	} else
+		mark_memory_idle(vm, nr_vcpus);
+
 	access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory");
-	mark_memory_idle(vm, nr_vcpus);
+
+	if (lru_gen) {
+		/* Scan the page tables again */
+		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+		/* The pages should now be young again, so in a newer generation */
+		found_gens[2] = lru_gen_find_generation(&stats, total_pages);
+		TEST_ASSERT(found_gens[2] > found_gens[1],
+			    "pages did not get younger");
+
+		/* Do another aging pass */
+		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+		/* Same generation; new generation has been made */
+		found_gens[3] = lru_gen_find_generation(&stats, total_pages);
+		TEST_ASSERT(found_gens[3] == found_gens[2],
+			    "unexpected gen change: %d vs. %d",
+			    found_gens[3], found_gens[2]);
+	} else
+		mark_memory_idle(vm, nr_vcpus);
+
 	access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
 
+	if (lru_gen) {
+		/* Scan the pages tables again */
+		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+		/* The pages should now be young again, so in a newer generation */
+		found_gens[4] = lru_gen_find_generation(&stats, total_pages);
+		TEST_ASSERT(found_gens[4] > found_gens[3],
+			    "pages did not get younger");
+	}
+}
+
+static void setup_vm_and_run(enum vm_guest_mode mode, void *arg)
+{
+	struct test_params *params = arg;
+	int nr_vcpus = params->nr_vcpus;
+	struct kvm_vm *vm;
+
+	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+				 params->backing_src, !overlap_memory_access);
+
+	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
+
+	if (params->benchmark_lru_gen)
+		run_benchmark(mode, vm, params);
+	else
+		run_test(mode, vm, params);
+
 	memstress_join_vcpu_threads(nr_vcpus);
 	memstress_destroy_vm(vm);
 }
@@ -331,8 +593,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]  [-s mem_type]\n",
-	       name);
+	printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]"
+	       " [-s mem_type] [-l] [-r memcg_root]\n", name);
 	puts("");
 	printf(" -h: Display this help message.");
 	guest_modes_help();
@@ -342,6 +604,9 @@ static void help(char *name)
 	printf(" -v: specify the number of vCPUs to run.\n");
 	printf(" -o: Overlap guest memory accesses instead of partitioning\n"
 	       "     them into a separate region of memory for each vCPU.\n");
+	printf(" -l: Use MGLRU aging instead of idle page tracking\n");
+	printf(" -p: Benchmark MGLRU aging while faulting memory in\n");
+	printf(" -r: The memory cgroup hierarchy root to use (when -l is given)\n");
 	backing_src_help("-s");
 	puts("");
 	exit(0);
@@ -353,13 +618,15 @@ int main(int argc, char *argv[])
 		.backing_src = DEFAULT_VM_MEM_SRC,
 		.vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
 		.nr_vcpus = 1,
+		.lru_gen = false,
+		.benchmark_lru_gen = false,
 	};
 	int page_idle_fd;
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
+	while ((opt = getopt(argc, argv, "hm:b:v:os:lr:p")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -376,6 +643,15 @@ int main(int argc, char *argv[])
 		case 's':
 			params.backing_src = parse_backing_src_type(optarg);
 			break;
+		case 'l':
+			params.lru_gen = true;
+			break;
+		case 'p':
+			params.benchmark_lru_gen = true;
+			break;
+		case 'r':
+			cgroup_root = strdup(optarg);
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
@@ -383,12 +659,42 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
-	__TEST_REQUIRE(page_idle_fd >= 0,
-		       "CONFIG_IDLE_PAGE_TRACKING is not enabled");
-	close(page_idle_fd);
+	if (!params.lru_gen) {
+		page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
+		__TEST_REQUIRE(page_idle_fd >= 0,
+			       "CONFIG_IDLE_PAGE_TRACKING is not enabled");
+		close(page_idle_fd);
+	} else {
+		int lru_gen_fd, lru_gen_debug_fd;
+		long mglru_features;
+		char mglru_feature_str[8] = {};
+
+		lru_gen_fd = open("/sys/kernel/mm/lru_gen/enabled", O_RDONLY);
+		__TEST_REQUIRE(lru_gen_fd >= 0,
+			       "CONFIG_LRU_GEN is not enabled");
+		TEST_ASSERT(read(lru_gen_fd, &mglru_feature_str, 7) > 0,
+				 "couldn't read lru_gen features");
+		mglru_features = strtol(mglru_feature_str, NULL, 16);
+		__TEST_REQUIRE(mglru_features & LRU_GEN_ENABLED,
+			       "lru_gen is not enabled");
+		__TEST_REQUIRE(mglru_features & LRU_GEN_MM_WALK,
+			       "lru_gen does not support MM_WALK");
+
+		lru_gen_debug_fd = open(DEBUGFS_LRU_GEN, O_RDWR);
+		__TEST_REQUIRE(lru_gen_debug_fd >= 0,
+				"Cannot access %s", DEBUGFS_LRU_GEN);
+		close(lru_gen_debug_fd);
+	}
+
+	TEST_ASSERT(!params.benchmark_lru_gen || params.lru_gen,
+		    "-p specified without -l");
+
+	if (params.lru_gen) {
+		create_memcg(TEST_MEMCG_NAME);
+		move_to_memcg(TEST_MEMCG_NAME, getpid());
+	}
 
-	for_each_guest_mode(run_test, &params);
+	for_each_guest_mode(setup_vm_and_run, &params);
 
 	return 0;
 }
diff --git a/tools/testing/selftests/kvm/include/lru_gen_util.h b/tools/testing/selftests/kvm/include/lru_gen_util.h
new file mode 100644
index 000000000000..4eef8085a3cb
--- /dev/null
+++ b/tools/testing/selftests/kvm/include/lru_gen_util.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Tools for integrating with lru_gen, like parsing the lru_gen debugfs output.
+ *
+ * Copyright (C) 2024, Google LLC.
+ */
+#ifndef SELFTEST_KVM_LRU_GEN_UTIL_H
+#define SELFTEST_KVM_LRU_GEN_UTIL_H
+
+#include <inttypes.h>
+#include <limits.h>
+#include <stdlib.h>
+
+#include "test_util.h"
+
+#define MAX_NR_GENS 16 /* MAX_NR_GENS in include/linux/mmzone.h */
+#define MAX_NR_NODES 4 /* Maximum number of nodes we support */
+
+static const char *DEBUGFS_LRU_GEN = "/sys/kernel/debug/lru_gen";
+
+struct generation_stats {
+	int gen;
+	long age_ms;
+	long nr_anon;
+	long nr_file;
+};
+
+struct node_stats {
+	int node;
+	int nr_gens; /* Number of populated gens entries. */
+	struct generation_stats gens[MAX_NR_GENS];
+};
+
+struct memcg_stats {
+	unsigned long memcg_id;
+	int nr_nodes; /* Number of populated nodes entries. */
+	struct node_stats nodes[MAX_NR_NODES];
+};
+
+void print_memcg_stats(const struct memcg_stats *stats, const char *name);
+
+void read_memcg_stats(struct memcg_stats *stats, const char *memcg);
+
+void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg);
+
+long sum_memcg_stats(const struct memcg_stats *stats);
+
+void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg);
+
+void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg);
+
+int lru_gen_find_generation(const struct memcg_stats *stats,
+			    unsigned long total_pages);
+
+#endif /* SELFTEST_KVM_LRU_GEN_UTIL_H */
diff --git a/tools/testing/selftests/kvm/lib/lru_gen_util.c b/tools/testing/selftests/kvm/lib/lru_gen_util.c
new file mode 100644
index 000000000000..3c02a635a9f7
--- /dev/null
+++ b/tools/testing/selftests/kvm/lib/lru_gen_util.c
@@ -0,0 +1,391 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2024, Google LLC.
+ */
+
+#include <time.h>
+
+#include "lru_gen_util.h"
+
+/*
+ * Tracks state while we parse memcg lru_gen stats. The file we're parsing is
+ * structured like this (some extra whitespace elided):
+ *
+ * memcg (id) (path)
+ * node (id)
+ * (gen_nr) (age_in_ms) (nr_anon_pages) (nr_file_pages)
+ */
+struct memcg_stats_parse_context {
+	bool consumed; /* Whether or not this line was consumed */
+	/* Next parse handler to invoke */
+	void (*next_handler)(struct memcg_stats *,
+			     struct memcg_stats_parse_context *, char *);
+	int current_node_idx; /* Current index in nodes array */
+	const char *name; /* The name of the memcg we're looking for */
+};
+
+static void memcg_stats_handle_searching(struct memcg_stats *stats,
+					 struct memcg_stats_parse_context *ctx,
+					 char *line);
+static void memcg_stats_handle_in_memcg(struct memcg_stats *stats,
+					struct memcg_stats_parse_context *ctx,
+					char *line);
+static void memcg_stats_handle_in_node(struct memcg_stats *stats,
+				       struct memcg_stats_parse_context *ctx,
+				       char *line);
+
+struct split_iterator {
+	char *str;
+	char *save;
+};
+
+static char *split_next(struct split_iterator *it)
+{
+	char *ret = strtok_r(it->str, " \t\n\r", &it->save);
+
+	it->str = NULL;
+	return ret;
+}
+
+static void memcg_stats_handle_searching(struct memcg_stats *stats,
+					 struct memcg_stats_parse_context *ctx,
+					 char *line)
+{
+	struct split_iterator it = { .str = line };
+	char *prefix = split_next(&it);
+	char *memcg_id = split_next(&it);
+	char *memcg_name = split_next(&it);
+	char *end;
+
+	ctx->consumed = true;
+
+	if (!prefix || strcmp("memcg", prefix))
+		return; /* Not a memcg line (maybe empty), skip */
+
+	TEST_ASSERT(memcg_id && memcg_name,
+		    "malformed memcg line; no memcg id or memcg_name");
+
+	if (strcmp(memcg_name + 1, ctx->name))
+		return; /* Wrong memcg, skip */
+
+	/* Found it! */
+
+	stats->memcg_id = strtoul(memcg_id, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed memcg id '%s'", memcg_id);
+	if (!stats->memcg_id)
+		return; /* Removed memcg? */
+
+	ctx->next_handler = memcg_stats_handle_in_memcg;
+}
+
+static void memcg_stats_handle_in_memcg(struct memcg_stats *stats,
+					struct memcg_stats_parse_context *ctx,
+					char *line)
+{
+	struct split_iterator it = { .str = line };
+	char *prefix = split_next(&it);
+	char *id = split_next(&it);
+	long found_node_id;
+	char *end;
+
+	ctx->consumed = true;
+	ctx->current_node_idx = -1;
+
+	if (!prefix)
+		return; /* Skip empty lines */
+
+	if (!strcmp("memcg", prefix)) {
+		/* Memcg done, found next one; stop. */
+		ctx->next_handler = NULL;
+		return;
+	} else if (strcmp("node", prefix))
+		TEST_ASSERT(false, "found malformed line after 'memcg ...',"
+				   "token: '%s'", prefix);
+
+	/* At this point we know we have a node line. Parse the ID. */
+
+	TEST_ASSERT(id, "malformed node line; no node id");
+
+	found_node_id = strtol(id, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed node id '%s'", id);
+
+	ctx->current_node_idx = stats->nr_nodes++;
+	TEST_ASSERT(ctx->current_node_idx < MAX_NR_NODES,
+		    "memcg has stats for too many nodes, max is %d",
+		    MAX_NR_NODES);
+	stats->nodes[ctx->current_node_idx].node = found_node_id;
+
+	ctx->next_handler = memcg_stats_handle_in_node;
+}
+
+static void memcg_stats_handle_in_node(struct memcg_stats *stats,
+				       struct memcg_stats_parse_context *ctx,
+				       char *line)
+{
+	/* Have to copy since we might not consume */
+	char *my_line = strdup(line);
+	struct split_iterator it = { .str = my_line };
+	char *gen, *age, *nr_anon, *nr_file;
+	struct node_stats *node_stats;
+	struct generation_stats *gen_stats;
+	char *end;
+
+	TEST_ASSERT(it.str, "failed to copy input line");
+
+	gen = split_next(&it);
+
+	/* Skip empty lines */
+	if (!gen)
+		goto out_consume; /* Skip empty lines */
+
+	if (!strcmp("memcg", gen) || !strcmp("node", gen)) {
+		/*
+		 * Reached next memcg or node section. Don't consume, let the
+		 * other handler deal with this.
+		 */
+		ctx->next_handler = memcg_stats_handle_in_memcg;
+		goto out;
+	}
+
+	node_stats = &stats->nodes[ctx->current_node_idx];
+	TEST_ASSERT(node_stats->nr_gens < MAX_NR_GENS,
+		    "found too many generation lines; max is %d",
+		    MAX_NR_GENS);
+	gen_stats = &node_stats->gens[node_stats->nr_gens++];
+
+	age = split_next(&it);
+	nr_anon = split_next(&it);
+	nr_file = split_next(&it);
+
+	TEST_ASSERT(age && nr_anon && nr_file,
+		    "malformed generation line; not enough tokens");
+
+	gen_stats->gen = (int)strtol(gen, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed generation number '%s'", gen);
+
+	gen_stats->age_ms = strtol(age, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed generation age '%s'", age);
+
+	gen_stats->nr_anon = strtol(nr_anon, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed anonymous page count '%s'",
+		    nr_anon);
+
+	gen_stats->nr_file = strtol(nr_file, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed file page count '%s'", nr_file);
+
+out_consume:
+	ctx->consumed = true;
+out:
+	free(my_line);
+}
+
+/* Pretty-print lru_gen @stats. */
+void print_memcg_stats(const struct memcg_stats *stats, const char *name)
+{
+	int node, gen;
+
+	fprintf(stderr, "stats for memcg %s (id %lu):\n",
+			name, stats->memcg_id);
+	for (node = 0; node < stats->nr_nodes; ++node) {
+		fprintf(stderr, "\tnode %d\n", stats->nodes[node].node);
+		for (gen = 0; gen < stats->nodes[node].nr_gens; ++gen) {
+			const struct generation_stats *gstats =
+				&stats->nodes[node].gens[gen];
+
+			fprintf(stderr,
+				"\t\tgen %d\tage_ms %ld"
+				"\tnr_anon %ld\tnr_file %ld\n",
+				gstats->gen, gstats->age_ms, gstats->nr_anon,
+				gstats->nr_file);
+		}
+	}
+}
+
+/* Re-read lru_gen debugfs information for @memcg into @stats. */
+void read_memcg_stats(struct memcg_stats *stats, const char *memcg)
+{
+	FILE *f;
+	ssize_t read = 0;
+	char *line = NULL;
+	size_t bufsz;
+	struct memcg_stats_parse_context ctx = {
+		.next_handler = memcg_stats_handle_searching,
+		.name = memcg,
+	};
+
+	memset(stats, 0, sizeof(struct memcg_stats));
+
+	f = fopen(DEBUGFS_LRU_GEN, "r");
+	TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
+
+	while (ctx.next_handler && (read = getline(&line, &bufsz, f)) > 0) {
+		ctx.consumed = false;
+
+		do {
+			ctx.next_handler(stats, &ctx, line);
+			if (!ctx.next_handler)
+				break;
+		} while (!ctx.consumed);
+	}
+
+	if (read < 0 && !feof(f))
+		TEST_ASSERT(false, "getline(%s) failed", DEBUGFS_LRU_GEN);
+
+	TEST_ASSERT(stats->memcg_id > 0, "Couldn't find memcg: %s\n"
+		    "Did the memcg get created in the proper mount?",
+		    memcg);
+	if (line)
+		free(line);
+	TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
+}
+
+/*
+ * Find all pages tracked by lru_gen for this memcg in generation @target_gen.
+ *
+ * If @target_gen is negative, look for all generations.
+ */
+static long sum_memcg_stats_for_gen(int target_gen,
+				    const struct memcg_stats *stats)
+{
+	int node, gen;
+	long total_nr = 0;
+
+	for (node = 0; node < stats->nr_nodes; ++node) {
+		const struct node_stats *node_stats = &stats->nodes[node];
+
+		for (gen = 0; gen < node_stats->nr_gens; ++gen) {
+			const struct generation_stats *gen_stats =
+				&node_stats->gens[gen];
+
+			if (target_gen >= 0 && gen_stats->gen != target_gen)
+				continue;
+
+			total_nr += gen_stats->nr_anon + gen_stats->nr_file;
+		}
+	}
+
+	return total_nr;
+}
+
+/* Find all pages tracked by lru_gen for this memcg. */
+long sum_memcg_stats(const struct memcg_stats *stats)
+{
+	return sum_memcg_stats_for_gen(-1, stats);
+}
+
+/* Read the memcg stats and optionally print if this is a debug build. */
+void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg)
+{
+	read_memcg_stats(stats, memcg);
+#ifdef DEBUG
+	print_memcg_stats(stats, memcg);
+#endif
+}
+
+/*
+ * If lru_gen aging should force page table scanning.
+ *
+ * If you want to set this to false, you will need to do eviction
+ * before doing extra aging passes.
+ */
+static const bool force_scan = true;
+
+static void run_aging_impl(unsigned long memcg_id, int node_id, int max_gen)
+{
+	FILE *f = fopen(DEBUGFS_LRU_GEN, "w");
+	char *command;
+	size_t sz;
+
+	TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
+	sz = asprintf(&command, "+ %lu %d %d 1 %d\n",
+		      memcg_id, node_id, max_gen, force_scan);
+	TEST_ASSERT(sz > 0, "creating aging command failed");
+
+	pr_debug("Running aging command: %s", command);
+	if (fwrite(command, sizeof(char), sz, f) < sz) {
+		TEST_ASSERT(false, "writing aging command %s to %s failed",
+			    command, DEBUGFS_LRU_GEN);
+	}
+
+	TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
+}
+
+static void _lru_gen_do_aging(struct memcg_stats *stats, const char *memcg,
+			      bool verbose)
+{
+	int node, gen;
+	struct timespec ts_start;
+	struct timespec ts_elapsed;
+
+	pr_debug("lru_gen: invoking aging...\n");
+
+	/* Must read memcg stats to construct the proper aging command. */
+	read_print_memcg_stats(stats, memcg);
+
+	if (verbose)
+		clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+	for (node = 0; node < stats->nr_nodes; ++node) {
+		int max_gen = 0;
+
+		for (gen = 0; gen < stats->nodes[node].nr_gens; ++gen) {
+			int this_gen = stats->nodes[node].gens[gen].gen;
+
+			max_gen = max_gen > this_gen ? max_gen : this_gen;
+		}
+
+		run_aging_impl(stats->memcg_id, stats->nodes[node].node,
+			       max_gen);
+	}
+
+	if (verbose) {
+		ts_elapsed = timespec_elapsed(ts_start);
+		pr_info("%-30s: %ld.%09lds\n", "lru_gen: Aging",
+			ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+	}
+
+	/* Re-read so callers get updated information */
+	read_print_memcg_stats(stats, memcg);
+}
+
+/* Do aging, and print how long it took. */
+void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg)
+{
+	return _lru_gen_do_aging(stats, memcg, true);
+}
+
+/* Do aging, don't print anything. */
+void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg)
+{
+	return _lru_gen_do_aging(stats, memcg, false);
+}
+
+/*
+ * Find which generation contains more than half of @total_pages, assuming that
+ * such a generation exists.
+ */
+int lru_gen_find_generation(const struct memcg_stats *stats,
+			    unsigned long total_pages)
+{
+	int node, gen, gen_idx, min_gen = INT_MAX, max_gen = -1;
+
+	for (node = 0; node < stats->nr_nodes; ++node)
+		for (gen_idx = 0; gen_idx < stats->nodes[node].nr_gens;
+		     ++gen_idx) {
+			gen = stats->nodes[node].gens[gen_idx].gen;
+			max_gen = gen > max_gen ? gen : max_gen;
+			min_gen = gen < min_gen ? gen : min_gen;
+		}
+
+	for (gen = min_gen; gen < max_gen; ++gen)
+		/* See if the most pages are in this generation. */
+		if (sum_memcg_stats_for_gen(gen, stats) >
+				total_pages / 2)
+			return gen;
+
+	TEST_ASSERT(false, "No generation includes majority of %lu pages.",
+		    total_pages);
+
+	/* unreachable, but make the compiler happy */
+	return -1;
+}
-- 
2.47.0.199.ga7371fff76-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test
  2024-11-05 18:43 ` [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton
@ 2025-01-11  0:12   ` Sean Christopherson
  2025-02-03 22:46     ` James Houghton
  2025-01-11  0:21   ` Yosry Ahmed
  1 sibling, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2025-01-11  0:12 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

This can/should be posted separately, no?  MGLRU support for secondary MMUs went
through mm/, the KVM changes in this series are all about making KVM x86 faster.

On Tue, Nov 05, 2024, James Houghton wrote:
> This test now has two modes of operation:

Bad Google3, bad!  Write changelogs as commands, i.e. state what the patch is
doing.  The above is also misleading (at least, it was to me), as I assumed one
of the modes would be "legacy" and the other would be MGLRU.  But it looks like
this patch adds MGLRU support *and* benchmarking.

This should be split into at least two patches, possibly three (I can't tell how
much pre-work there is).  I.e. add MGLRU support, and then add the benchmarking
stuff.  And if there's substantial refactoring and/or new utilities, do that first.

> 1. (default) To check how much vCPU performance was affected by access
>              tracking (previously existed, now supports MGLRU aging).
> 2. (-p) To also benchmark how fast MGLRU can do aging while vCPUs are
>         faulting in memory.
> 
> Mode (1) also serves as a way to verify that aging is working properly
> for pages only accessed by KVM.  It will fail if one does not have the

"one" what?

> 0x8 lru_gen feature bit.
> 
> To support MGLRU, the test creates a memory cgroup, moves itself into
> it, then uses the lru_gen debugfs output to track memory in that cgroup.
> The logic to parse the lru_gen debugfs output has been put into
> selftests/kvm/lib/lru_gen_util.c.
>
> Co-developed-by: Axel Rasmussen <axelrasmussen@google.com>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>

...

> @@ -47,6 +48,19 @@
>  #include "memstress.h"
>  #include "guest_modes.h"
>  #include "processor.h"
> +#include "lru_gen_util.h"
> +
> +static const char *TEST_MEMCG_NAME = "access_tracking_perf_test";
> +static const int LRU_GEN_ENABLED = 0x1;
> +static const int LRU_GEN_MM_WALK = 0x2;

Is there really no uAPI #define for these?

> +static const char *CGROUP_PROCS = "cgroup.procs";
> +/*
> + * If using MGLRU, this test assumes a cgroup v2 or cgroup v1 memory hierarchy

I would say "requires", not "assumes".

> + * is mounted at cgroup_root.
> + *
> + * Can be changed with -r.

This is amusingly vague.  I vote to omit explicitly referencing the command line
option, we have enough problems maintaining them as it is.  Instead simply say
something like "Default to the kernel's preferred path for mounting cgroups" to
both explain where the default comes from, and to give the reader a hint that the
path can be changed.

Actually, there's zero reason for this to be global.  More below.

Ugh, and if this test is manipulating cgroups, won't it need to root?  I doubt
y'all have tried to run this outside of devrez, i.e. on a system where you're
not logged in as root.

Oh, never mind, this test already effectively requires root.

> +	/* Whether to use lru_gen aging instead of idle page tracking. */
> +	bool lru_gen;

Needs a verb, otherwise this looks like it tracks the LRU generation.  E.g. use_lru_gen.

> +
> +	/* Whether to test the performance of aging itself. */
> +	bool benchmark_lru_gen;
>  };
>  
>  static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
> @@ -89,6 +112,50 @@ static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
>  
>  }
>  
> +static void write_file_long(const char *path, long v)
> +{
> +	FILE *f;
> +
> +	f = fopen(path, "w");
> +	TEST_ASSERT(f, "fopen(%s) failed", path);
> +	TEST_ASSERT(fprintf(f, "%ld\n", v) > 0,
> +		    "fprintf to %s failed", path);
> +	TEST_ASSERT(!fclose(f), "fclose(%s) failed", path);
> +}
> +
> +static char *path_join(const char *parent, const char *child)
> +{
> +	char *out = NULL;
> +
> +	return asprintf(&out, "%s/%s", parent, child) >= 0 ? out : NULL;
> +}

These are common utilities, no?  I.e. should be somewhere common, not buried in
this test.

> +
> +static char *memcg_path(const char *memcg)
> +{
> +	return path_join(cgroup_root, memcg);

Eh, do the join when cgroup_root is first defined.  Actually, looking at the
usage more closely, the cgroup path is only used during main().  Just do all of
the joins there, I see no reason to have these one-off helpers.

> +}
> +
> +static char *memcg_file_path(const char *memcg, const char *file)
> +{
> +	char *mp = memcg_path(memcg);
> +	char *fp;
> +
> +	if (!mp)
> +		return NULL;

Returning NULL just so that the one user can assert on !NULL is rather pointless.

> +	fp = path_join(mp, file);
> +	free(mp);
> +	return fp;
> +}
> +
> +static void move_to_memcg(const char *memcg, pid_t pid)
> +{
> +	char *procs = memcg_file_path(memcg, CGROUP_PROCS);
> +
> +	TEST_ASSERT(procs, "Failed to construct cgroup.procs path");
> +	write_file_long(procs, pid);
> +	free(procs);
> +}
> +
>  #define PAGEMAP_PRESENT (1ULL << 63)
>  #define PAGEMAP_PFN_MASK ((1ULL << 55) - 1)
>  
> @@ -242,6 +309,8 @@ static void vcpu_thread_main(struct memstress_vcpu_args *vcpu_args)
>  		};
>  
>  		vcpu_last_completed_iteration[vcpu_idx] = current_iteration;
> +		clock_gettime(CLOCK_MONOTONIC,
> +			      &vcpu_last_completed_time[vcpu_idx]);
>  	}
>  }
>  
> @@ -253,38 +322,68 @@ static void spin_wait_for_vcpu(int vcpu_idx, int target_iteration)
>  	}
>  }
>  
> +static bool all_vcpus_done(int target_iteration, int nr_vcpus)
> +{
> +	for (int i = 0; i < nr_vcpus; ++i)

Preferred style is to declare variables outside of loops.

Needs curly braces.

> +		if (READ_ONCE(vcpu_last_completed_iteration[i]) !=
> +		    target_iteration)

Don't wrap.

> +			return false;
> +
> +	return true;
> +}
> +
>  /* The type of memory accesses to perform in the VM. */
>  enum access_type {
>  	ACCESS_READ,
>  	ACCESS_WRITE,
>  };
>  
> -static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description)
> +static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description,
> +			  bool wait)
>  {
> -	struct timespec ts_start;
> -	struct timespec ts_elapsed;
>  	int next_iteration, i;
>  
>  	/* Kick off the vCPUs by incrementing iteration. */
>  	next_iteration = ++iteration;
>  
> -	clock_gettime(CLOCK_MONOTONIC, &ts_start);
> -
>  	/* Wait for all vCPUs to finish the iteration. */
> -	for (i = 0; i < nr_vcpus; i++)
> -		spin_wait_for_vcpu(i, next_iteration);
> +	if (wait) {
> +		struct timespec ts_start;
> +		struct timespec ts_elapsed;
> +
> +		clock_gettime(CLOCK_MONOTONIC, &ts_start);
>  
> -	ts_elapsed = timespec_elapsed(ts_start);
> -	pr_info("%-30s: %ld.%09lds\n",
> -		description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> +		for (i = 0; i < nr_vcpus; i++)
> +			spin_wait_for_vcpu(i, next_iteration);
> +
> +		ts_elapsed = timespec_elapsed(ts_start);
> +
> +		pr_info("%-30s: %ld.%09lds\n",
> +			description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> +	} else

Needs curly braces.

> +		pr_info("%-30s\n", description);
>  }
>  
> -static void access_memory(struct kvm_vm *vm, int nr_vcpus,
> -			  enum access_type access, const char *description)
> +static void _access_memory(struct kvm_vm *vm, int nr_vcpus,

Use double underscores (ignore any precedent in selftests that uses just one,
we're trying to purge that ugliness).

> +			   enum access_type access, const char *description,
> +			   bool wait)
>  {
>  	memstress_set_write_percent(vm, (access == ACCESS_READ) ? 0 : 100);
>  	iteration_work = ITERATION_ACCESS_MEMORY;
> -	run_iteration(vm, nr_vcpus, description);
> +	run_iteration(vm, nr_vcpus, description, wait);
> +}
> +
> +static void access_memory(struct kvm_vm *vm, int nr_vcpus,
> +			  enum access_type access, const char *description)
> +{
> +	return _access_memory(vm, nr_vcpus, access, description, true);
> +}
> +
> +static void access_memory_async(struct kvm_vm *vm, int nr_vcpus,

Maybe "nowait" instead of "async"?  Yeah, something ends up being asynchronous
(presumably), but the call itself is "synchronous", i.e. isn't spun off to a
worker or anything.

> +				enum access_type access,
> +				const char *description)
> +{
> +	return _access_memory(vm, nr_vcpus, access, description, false);
>  }
>  
>  static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
> @@ -297,19 +396,115 @@ static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
>  	 */
>  	pr_debug("Marking VM memory idle (slow)...\n");
>  	iteration_work = ITERATION_MARK_IDLE;
> -	run_iteration(vm, nr_vcpus, "Mark memory idle");
> +	run_iteration(vm, nr_vcpus, "Mark memory idle", true);
>  }
>  
> -static void run_test(enum vm_guest_mode mode, void *arg)
> +static void create_memcg(const char *memcg)
> +{
> +	const char *full_memcg_path = memcg_path(memcg);
> +	int ret;
> +
> +	TEST_ASSERT(full_memcg_path, "Failed to construct full memcg path");
> +retry:
> +	ret = mkdir(full_memcg_path, 0755);
> +	if (ret && errno == EEXIST) {
> +		TEST_ASSERT(!rmdir(full_memcg_path),
> +			    "Found existing memcg at %s, but rmdir failed",
> +			    full_memcg_path);
> +		goto retry;

	while (1) {
		ret = mkdir(full_memcg_path, 0755);
		if (!ret)
			break;

		TEST_ASSERT(errno == EEXIST,
			    Creating the memcg via 'mkdir(%s)' failed",
			    full_memcg_path);

		TEST_ASSERT(!rmdir(full_memcg_path),
			    "Found existing memcg at %s, but rmdir failed",
			    full_memcg_path);
	}

> +	}
> +	TEST_ASSERT(!ret, "Creating the memcg failed: mkdir(%s) failed",

Heh, so it failed?

> +		    full_memcg_path);
> +
> +	pr_info("Created memcg at %s\n", full_memcg_path);
> +}

...

> +/*
> + * Test how much access tracking affects vCPU performance.
> + *
> + * Supports two modes of access tracking:
> + * - idle page tracking
> + * - lru_gen aging
> + *
> + * When using lru_gen, this test additionally verifies that the pages are in
> + * fact getting younger and older, otherwise the performance data would be
> + * invalid.
> + *
> + * The forced lru_gen aging can race with aging that occurs naturally.
> + */
> +static void run_test(enum vm_guest_mode mode, struct kvm_vm *vm,
> +		     struct test_params *params)
> +{
> +	int nr_vcpus = params->nr_vcpus;
> +	bool lru_gen = params->lru_gen;
> +	struct memcg_stats stats;
> +	// If guest_page_size is larger than the host's page size, the
> +	// guest (memstress) will only fault in a subset of the host's pages.

No C++ comments, please.

> +	long total_pages = nr_vcpus * params->vcpu_memory_bytes /
> +			   max(memstress_args.guest_page_size,
> +			       (uint64_t)getpagesize());

max_t() is probably better.

> +	int found_gens[5];
>  
>  	pr_info("\n");
>  	access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory");
> @@ -319,11 +514,78 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>  	access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");
>  
>  	/* Repeat on memory that has been marked as idle. */
> -	mark_memory_idle(vm, nr_vcpus);
> +	if (lru_gen) {
> +		/* Do an initial page table scan */
> +		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> +		TEST_ASSERT(sum_memcg_stats(&stats) >= total_pages,
> +		  "Not all pages tracked in lru_gen stats.\n"
> +		  "Is lru_gen enabled? Did the memcg get created properly?");

Align indentation.

		TEST_ASSERT(sum_memcg_stats(&stats) >= total_pages,
			    "Not all pages tracked in lru_gen stats.\n"
			    "Is lru_gen enabled? Did the memcg get created properly?");

> +
> +		/* Find the generation we're currently in (probably youngest) */
> +		found_gens[0] = lru_gen_find_generation(&stats, total_pages);
> +
> +		/* Do an aging pass now */
> +		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> +
> +		/* Same generation, but a newer generation has been made */
> +		found_gens[1] = lru_gen_find_generation(&stats, total_pages);
> +		TEST_ASSERT(found_gens[1] == found_gens[0],
> +			    "unexpected gen change: %d vs. %d",
> +			    found_gens[1], found_gens[0]);

I don't have any ideas off the top of my head, but there's gotta be a way to
dedup these blocks.

> +	} else

Needs curly braces.

> +		mark_memory_idle(vm, nr_vcpus);
> +
>  	access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory");
> -	mark_memory_idle(vm, nr_vcpus);
> +
> +	if (lru_gen) {
> +		/* Scan the page tables again */
> +		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> +
> +		/* The pages should now be young again, so in a newer generation */
> +		found_gens[2] = lru_gen_find_generation(&stats, total_pages);
> +		TEST_ASSERT(found_gens[2] > found_gens[1],
> +			    "pages did not get younger");
> +
> +		/* Do another aging pass */
> +		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> +
> +		/* Same generation; new generation has been made */
> +		found_gens[3] = lru_gen_find_generation(&stats, total_pages);
> +		TEST_ASSERT(found_gens[3] == found_gens[2],
> +			    "unexpected gen change: %d vs. %d",
> +			    found_gens[3], found_gens[2]);
> +	} else

Once more, with feeling!

> +		mark_memory_idle(vm, nr_vcpus);
> +
>  	access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
>  
> +	if (lru_gen) {
> +		/* Scan the pages tables again */
> +		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> +
> +		/* The pages should now be young again, so in a newer generation */
> +		found_gens[4] = lru_gen_find_generation(&stats, total_pages);
> +		TEST_ASSERT(found_gens[4] > found_gens[3],
> +			    "pages did not get younger");
> +	}
> +}

...

> @@ -353,13 +618,15 @@ int main(int argc, char *argv[])
>  		.backing_src = DEFAULT_VM_MEM_SRC,
>  		.vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
>  		.nr_vcpus = 1,
> +		.lru_gen = false,
> +		.benchmark_lru_gen = false,
>  	};
>  	int page_idle_fd;
>  	int opt;
>  
>  	guest_modes_append_default();
>  
> -	while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
> +	while ((opt = getopt(argc, argv, "hm:b:v:os:lr:p")) != -1) {
>  		switch (opt) {
>  		case 'm':
>  			guest_modes_cmdline(optarg);
> @@ -376,6 +643,15 @@ int main(int argc, char *argv[])
>  		case 's':
>  			params.backing_src = parse_backing_src_type(optarg);
>  			break;
> +		case 'l':
> +			params.lru_gen = true;
> +			break;
> +		case 'p':
> +			params.benchmark_lru_gen = true;
> +			break;
> +		case 'r':
> +			cgroup_root = strdup(optarg);
> +			break;
>  		case 'h':
>  		default:
>  			help(argv[0]);
> @@ -383,12 +659,42 @@ int main(int argc, char *argv[])
>  		}
>  	}
>  
> -	page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> -	__TEST_REQUIRE(page_idle_fd >= 0,
> -		       "CONFIG_IDLE_PAGE_TRACKING is not enabled");
> -	close(page_idle_fd);
> +	if (!params.lru_gen) {
> +		page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> +		__TEST_REQUIRE(page_idle_fd >= 0,
> +			       "CONFIG_IDLE_PAGE_TRACKING is not enabled");
> +		close(page_idle_fd);
> +	} else {
> +		int lru_gen_fd, lru_gen_debug_fd;
> +		long mglru_features;
> +		char mglru_feature_str[8] = {};
> +
> +		lru_gen_fd = open("/sys/kernel/mm/lru_gen/enabled", O_RDONLY);
> +		__TEST_REQUIRE(lru_gen_fd >= 0,
> +			       "CONFIG_LRU_GEN is not enabled");

Noooo!  Do not assume opening a path failed because some config option.  That
reminds me, this needs to be rewritten.  I can't count the number of times this
stupid test has oh so helpfully told me CONFIG_IDLE_PAGE_TRACKING is disabled,
when in fact the problem is that I didn't run it as root.

		page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
		__TEST_REQUIRE(page_idle_fd >= 0,
			       "CONFIG_IDLE_PAGE_TRACKING is not enabled");


By all means, give the user a hint, but don't assume anything.  E.g.

		page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
		__TEST_REQUIRE(page_idle_fd >= 0,
			       "Failed to open blah blah blah, is CONFIG_IDLE_PAGE_TRACKING enabled?");

> +		TEST_ASSERT(read(lru_gen_fd, &mglru_feature_str, 7) > 0,
> +				 "couldn't read lru_gen features");
> +		mglru_features = strtol(mglru_feature_str, NULL, 16);
> +		__TEST_REQUIRE(mglru_features & LRU_GEN_ENABLED,
> +			       "lru_gen is not enabled");
> +		__TEST_REQUIRE(mglru_features & LRU_GEN_MM_WALK,
> +			       "lru_gen does not support MM_WALK");
> +
> +		lru_gen_debug_fd = open(DEBUGFS_LRU_GEN, O_RDWR);
> +		__TEST_REQUIRE(lru_gen_debug_fd >= 0,
> +				"Cannot access %s", DEBUGFS_LRU_GEN);
> +		close(lru_gen_debug_fd);
> +	}
> +
> +	TEST_ASSERT(!params.benchmark_lru_gen || params.lru_gen,
> +		    "-p specified without -l");
> +
> +	if (params.lru_gen) {
> +		create_memcg(TEST_MEMCG_NAME);
> +		move_to_memcg(TEST_MEMCG_NAME, getpid());

After this, cgroup_root is never used.  Hint, hint.

> +	}
>  
> -	for_each_guest_mode(run_test, &params);
> +	for_each_guest_mode(setup_vm_and_run, &params);
>  
>  	return 0;
>  }
> diff --git a/tools/testing/selftests/kvm/include/lru_gen_util.h b/tools/testing/selftests/kvm/include/lru_gen_util.h
> new file mode 100644
> index 000000000000..4eef8085a3cb
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/include/lru_gen_util.h
> @@ -0,0 +1,55 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Tools for integrating with lru_gen, like parsing the lru_gen debugfs output.
> + *
> + * Copyright (C) 2024, Google LLC.
> + */
> +#ifndef SELFTEST_KVM_LRU_GEN_UTIL_H
> +#define SELFTEST_KVM_LRU_GEN_UTIL_H
> +
> +#include <inttypes.h>
> +#include <limits.h>
> +#include <stdlib.h>
> +
> +#include "test_util.h"
> +
> +#define MAX_NR_GENS 16 /* MAX_NR_GENS in include/linux/mmzone.h */
> +#define MAX_NR_NODES 4 /* Maximum number of nodes we support */

Who is "we"?  KVM selftests?  Kernel?
> +
> +static const char *DEBUGFS_LRU_GEN = "/sys/kernel/debug/lru_gen";

So, root again?  

> +
> +struct generation_stats {
> +	int gen;
> +	long age_ms;
> +	long nr_anon;
> +	long nr_file;
> +};
> +
> +struct node_stats {
> +	int node;
> +	int nr_gens; /* Number of populated gens entries. */
> +	struct generation_stats gens[MAX_NR_GENS];
> +};
> +
> +struct memcg_stats {
> +	unsigned long memcg_id;
> +	int nr_nodes; /* Number of populated nodes entries. */
> +	struct node_stats nodes[MAX_NR_NODES];
> +};
> +
> +void print_memcg_stats(const struct memcg_stats *stats, const char *name);
> +
> +void read_memcg_stats(struct memcg_stats *stats, const char *memcg);
> +
> +void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg);

These need lru_gen_ namespacing.  As is, they are very, very misleading names.
They don't read core memcg stuff, they read LRU_GEN stuff, which AFAICT happens
to be indexed by the memcg name.

> +static void memcg_stats_handle_in_node(struct memcg_stats *stats,
> +				       struct memcg_stats_parse_context *ctx,
> +				       char *line)
> +{
> +	/* Have to copy since we might not consume */

Huh?

> +	char *my_line = strdup(line);
> +	struct split_iterator it = { .str = my_line };
> +	char *gen, *age, *nr_anon, *nr_file;
> +	struct node_stats *node_stats;
> +	struct generation_stats *gen_stats;
> +	char *end;
> +
> +	TEST_ASSERT(it.str, "failed to copy input line");
> +
> +	gen = split_next(&it);
> +
> +	/* Skip empty lines */
> +	if (!gen)
> +		goto out_consume; /* Skip empty lines */

If you say it three times, it might happen!  (Can you tell it's Friday afternoon?)

> +	if (!strcmp("memcg", gen) || !strcmp("node", gen)) {
> +		/*
> +		 * Reached next memcg or node section. Don't consume, let the
> +		 * other handler deal with this.
> +		 */
> +		ctx->next_handler = memcg_stats_handle_in_memcg;
> +		goto out;
> +	}

...

> +void print_memcg_stats(const struct memcg_stats *stats, const char *name)
> +{
> +	int node, gen;
> +
> +	fprintf(stderr, "stats for memcg %s (id %lu):\n",

Why stderr?  This is effectively wrapped with DEBUG, so why not pr_debug()?

> +			name, stats->memcg_id);
> +	for (node = 0; node < stats->nr_nodes; ++node) {
> +		fprintf(stderr, "\tnode %d\n", stats->nodes[node].node);
> +		for (gen = 0; gen < stats->nodes[node].nr_gens; ++gen) {
> +			const struct generation_stats *gstats =
> +				&stats->nodes[node].gens[gen];
> +
> +			fprintf(stderr,
> +				"\t\tgen %d\tage_ms %ld"
> +				"\tnr_anon %ld\tnr_file %ld\n",
> +				gstats->gen, gstats->age_ms, gstats->nr_anon,
> +				gstats->nr_file);
> +		}
> +	}
> +}
> +
> +/* Re-read lru_gen debugfs information for @memcg into @stats. */
> +void read_memcg_stats(struct memcg_stats *stats, const char *memcg)
> +{
> +	FILE *f;
> +	ssize_t read = 0;
> +	char *line = NULL;
> +	size_t bufsz;
> +	struct memcg_stats_parse_context ctx = {
> +		.next_handler = memcg_stats_handle_searching,
> +		.name = memcg,
> +	};
> +
> +	memset(stats, 0, sizeof(struct memcg_stats));
> +
> +	f = fopen(DEBUGFS_LRU_GEN, "r");
> +	TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
> +
> +	while (ctx.next_handler && (read = getline(&line, &bufsz, f)) > 0) {
> +		ctx.consumed = false;
> +
> +		do {
> +			ctx.next_handler(stats, &ctx, line);
> +			if (!ctx.next_handler)
> +				break;
> +		} while (!ctx.consumed);
> +	}
> +
> +	if (read < 0 && !feof(f))
> +		TEST_ASSERT(false, "getline(%s) failed", DEBUGFS_LRU_GEN);
> +
> +	TEST_ASSERT(stats->memcg_id > 0, "Couldn't find memcg: %s\n"
> +		    "Did the memcg get created in the proper mount?",
> +		    memcg);
> +	if (line)
> +		free(line);
> +	TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
> +}
> +
> +/*
> + * Find all pages tracked by lru_gen for this memcg in generation @target_gen.
> + *
> + * If @target_gen is negative, look for all generations.
> + */
> +static long sum_memcg_stats_for_gen(int target_gen,
> +				    const struct memcg_stats *stats)
> +{
> +	int node, gen;
> +	long total_nr = 0;
> +
> +	for (node = 0; node < stats->nr_nodes; ++node) {
> +		const struct node_stats *node_stats = &stats->nodes[node];
> +
> +		for (gen = 0; gen < node_stats->nr_gens; ++gen) {
> +			const struct generation_stats *gen_stats =
> +				&node_stats->gens[gen];
> +
> +			if (target_gen >= 0 && gen_stats->gen != target_gen)
> +				continue;
> +
> +			total_nr += gen_stats->nr_anon + gen_stats->nr_file;
> +		}
> +	}
> +
> +	return total_nr;
> +}
> +
> +/* Find all pages tracked by lru_gen for this memcg. */
> +long sum_memcg_stats(const struct memcg_stats *stats)
> +{
> +	return sum_memcg_stats_for_gen(-1, stats);
> +}
> +
> +/* Read the memcg stats and optionally print if this is a debug build. */
> +void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg)
> +{
> +	read_memcg_stats(stats, memcg);
> +#ifdef DEBUG
> +	print_memcg_stats(stats, memcg);

print_memcg_stats() should be static, because this is the only caller.  But I'm
guessing you made it globally visible so that the compiler wouldn't complain
about unused functions.  A better approach would be to wrap the guts with the
#ifdef.

> +#endif
> +}
> +
> +/*
> + * If lru_gen aging should force page table scanning.
> + *
> + * If you want to set this to false, you will need to do eviction
> + * before doing extra aging passes.
> + */
> +static const bool force_scan = true;
> +
> +static void run_aging_impl(unsigned long memcg_id, int node_id, int max_gen)
> +{
> +	FILE *f = fopen(DEBUGFS_LRU_GEN, "w");
> +	char *command;
> +	size_t sz;
> +
> +	TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
> +	sz = asprintf(&command, "+ %lu %d %d 1 %d\n",
> +		      memcg_id, node_id, max_gen, force_scan);
> +	TEST_ASSERT(sz > 0, "creating aging command failed");
> +
> +	pr_debug("Running aging command: %s", command);
> +	if (fwrite(command, sizeof(char), sz, f) < sz) {
> +		TEST_ASSERT(false, "writing aging command %s to %s failed",
> +			    command, DEBUGFS_LRU_GEN);
> +	}
> +
> +	TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
> +}
> +
> +static void _lru_gen_do_aging(struct memcg_stats *stats, const char *memcg,

Two underscores.

> +/* Do aging, and print how long it took. */
> +void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg)
> +{
> +	return _lru_gen_do_aging(stats, memcg, true);
> +}
> +
> +/* Do aging, don't print anything. */
> +void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg)
> +{
> +	return _lru_gen_do_aging(stats, memcg, false);

static inline helpers in the header?  Having to come all this way to see that
these are simple wrapper is annoying.

> +}
> +
> +/*
> + * Find which generation contains more than half of @total_pages, assuming that
> + * such a generation exists.
> + */
> +int lru_gen_find_generation(const struct memcg_stats *stats,
> +			    unsigned long total_pages)
> +{
> +	int node, gen, gen_idx, min_gen = INT_MAX, max_gen = -1;
> +
> +	for (node = 0; node < stats->nr_nodes; ++node)

Curly braces.

> +		for (gen_idx = 0; gen_idx < stats->nodes[node].nr_gens;
> +		     ++gen_idx) {
> +			gen = stats->nodes[node].gens[gen_idx].gen;
> +			max_gen = gen > max_gen ? gen : max_gen;
> +			min_gen = gen < min_gen ? gen : min_gen;
> +		}
> +
> +	for (gen = min_gen; gen < max_gen; ++gen)
> +		/* See if the most pages are in this generation. */
> +		if (sum_memcg_stats_for_gen(gen, stats) >
> +				total_pages / 2)
> +			return gen;
> +
> +	TEST_ASSERT(false, "No generation includes majority of %lu pages.",

TEST_FAIL.

> +		    total_pages);
> +
> +	/* unreachable, but make the compiler happy */
> +	return -1;

I _think_ selftests can use unreachable().

> +}
> -- 
> 2.47.0.199.ga7371fff76-goog
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test
  2025-01-11  0:12   ` Sean Christopherson
@ 2025-02-03 22:46     ` James Houghton
  0 siblings, 0 replies; 41+ messages in thread
From: James Houghton @ 2025-02-03 22:46 UTC (permalink / raw)
  To: Sean Christopherson, Yosry Ahmed
  Cc: Paolo Bonzini, David Matlack, David Rientjes, Marc Zyngier,
	Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm, linux-kernel

On Fri, Jan 10, 2025 at 4:12 PM Sean Christopherson <seanjc@google.com> wrote:
>
> This can/should be posted separately, no?  MGLRU support for secondary MMUs went
> through mm/, the KVM changes in this series are all about making KVM x86 faster.

I'll send this patch separately, sure.

>
> On Tue, Nov 05, 2024, James Houghton wrote:
> > This test now has two modes of operation:
>
> Bad Google3, bad!  Write changelogs as commands, i.e. state what the patch is
> doing.  The above is also misleading (at least, it was to me), as I assumed one
> of the modes would be "legacy" and the other would be MGLRU.  But it looks like
> this patch adds MGLRU support *and* benchmarking.
>
> This should be split into at least two patches, possibly three (I can't tell how
> much pre-work there is).  I.e. add MGLRU support, and then add the benchmarking
> stuff.  And if there's substantial refactoring and/or new utilities, do that first.
>
> > 1. (default) To check how much vCPU performance was affected by access
> >              tracking (previously existed, now supports MGLRU aging).
> > 2. (-p) To also benchmark how fast MGLRU can do aging while vCPUs are
> >         faulting in memory.
> >
> > Mode (1) also serves as a way to verify that aging is working properly
> > for pages only accessed by KVM.  It will fail if one does not have the
>
> "one" what?
>
> > 0x8 lru_gen feature bit.
> >
> > To support MGLRU, the test creates a memory cgroup, moves itself into
> > it, then uses the lru_gen debugfs output to track memory in that cgroup.
> > The logic to parse the lru_gen debugfs output has been put into
> > selftests/kvm/lib/lru_gen_util.c.
> >
> > Co-developed-by: Axel Rasmussen <axelrasmussen@google.com>
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
>
> ...
>
> > @@ -47,6 +48,19 @@
> >  #include "memstress.h"
> >  #include "guest_modes.h"
> >  #include "processor.h"
> > +#include "lru_gen_util.h"
> > +
> > +static const char *TEST_MEMCG_NAME = "access_tracking_perf_test";
> > +static const int LRU_GEN_ENABLED = 0x1;
> > +static const int LRU_GEN_MM_WALK = 0x2;
>
> Is there really no uAPI #define for these?
>
> > +static const char *CGROUP_PROCS = "cgroup.procs";
> > +/*
> > + * If using MGLRU, this test assumes a cgroup v2 or cgroup v1 memory hierarchy
>
> I would say "requires", not "assumes".
>
> > + * is mounted at cgroup_root.
> > + *
> > + * Can be changed with -r.
>
> This is amusingly vague.  I vote to omit explicitly referencing the command line
> option, we have enough problems maintaining them as it is.  Instead simply say
> something like "Default to the kernel's preferred path for mounting cgroups" to
> both explain where the default comes from, and to give the reader a hint that the
> path can be changed.
>
> Actually, there's zero reason for this to be global.  More below.
>
> Ugh, and if this test is manipulating cgroups, won't it need to root?  I doubt
> y'all have tried to run this outside of devrez, i.e. on a system where you're
> not logged in as root.
>
> Oh, never mind, this test already effectively requires root.
>
> > +     /* Whether to use lru_gen aging instead of idle page tracking. */
> > +     bool lru_gen;
>
> Needs a verb, otherwise this looks like it tracks the LRU generation.  E.g. use_lru_gen.
>
> > +
> > +     /* Whether to test the performance of aging itself. */
> > +     bool benchmark_lru_gen;
> >  };
> >
> >  static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
> > @@ -89,6 +112,50 @@ static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
> >
> >  }
> >
> > +static void write_file_long(const char *path, long v)
> > +{
> > +     FILE *f;
> > +
> > +     f = fopen(path, "w");
> > +     TEST_ASSERT(f, "fopen(%s) failed", path);
> > +     TEST_ASSERT(fprintf(f, "%ld\n", v) > 0,
> > +                 "fprintf to %s failed", path);
> > +     TEST_ASSERT(!fclose(f), "fclose(%s) failed", path);
> > +}
> > +
> > +static char *path_join(const char *parent, const char *child)
> > +{
> > +     char *out = NULL;
> > +
> > +     return asprintf(&out, "%s/%s", parent, child) >= 0 ? out : NULL;
> > +}
>
> These are common utilities, no?  I.e. should be somewhere common, not buried in
> this test.
>
> > +
> > +static char *memcg_path(const char *memcg)
> > +{
> > +     return path_join(cgroup_root, memcg);
>
> Eh, do the join when cgroup_root is first defined.  Actually, looking at the
> usage more closely, the cgroup path is only used during main().  Just do all of
> the joins there, I see no reason to have these one-off helpers.
>
> > +}
> > +
> > +static char *memcg_file_path(const char *memcg, const char *file)
> > +{
> > +     char *mp = memcg_path(memcg);
> > +     char *fp;
> > +
> > +     if (!mp)
> > +             return NULL;
>
> Returning NULL just so that the one user can assert on !NULL is rather pointless.
>
> > +     fp = path_join(mp, file);
> > +     free(mp);
> > +     return fp;
> > +}
> > +
> > +static void move_to_memcg(const char *memcg, pid_t pid)
> > +{
> > +     char *procs = memcg_file_path(memcg, CGROUP_PROCS);
> > +
> > +     TEST_ASSERT(procs, "Failed to construct cgroup.procs path");
> > +     write_file_long(procs, pid);
> > +     free(procs);
> > +}
> > +
> >  #define PAGEMAP_PRESENT (1ULL << 63)
> >  #define PAGEMAP_PFN_MASK ((1ULL << 55) - 1)
> >
> > @@ -242,6 +309,8 @@ static void vcpu_thread_main(struct memstress_vcpu_args *vcpu_args)
> >               };
> >
> >               vcpu_last_completed_iteration[vcpu_idx] = current_iteration;
> > +             clock_gettime(CLOCK_MONOTONIC,
> > +                           &vcpu_last_completed_time[vcpu_idx]);
> >       }
> >  }
> >
> > @@ -253,38 +322,68 @@ static void spin_wait_for_vcpu(int vcpu_idx, int target_iteration)
> >       }
> >  }
> >
> > +static bool all_vcpus_done(int target_iteration, int nr_vcpus)
> > +{
> > +     for (int i = 0; i < nr_vcpus; ++i)
>
> Preferred style is to declare variables outside of loops.
>
> Needs curly braces.
>
> > +             if (READ_ONCE(vcpu_last_completed_iteration[i]) !=
> > +                 target_iteration)
>
> Don't wrap.
>
> > +                     return false;
> > +
> > +     return true;
> > +}
> > +
> >  /* The type of memory accesses to perform in the VM. */
> >  enum access_type {
> >       ACCESS_READ,
> >       ACCESS_WRITE,
> >  };
> >
> > -static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description)
> > +static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description,
> > +                       bool wait)
> >  {
> > -     struct timespec ts_start;
> > -     struct timespec ts_elapsed;
> >       int next_iteration, i;
> >
> >       /* Kick off the vCPUs by incrementing iteration. */
> >       next_iteration = ++iteration;
> >
> > -     clock_gettime(CLOCK_MONOTONIC, &ts_start);
> > -
> >       /* Wait for all vCPUs to finish the iteration. */
> > -     for (i = 0; i < nr_vcpus; i++)
> > -             spin_wait_for_vcpu(i, next_iteration);
> > +     if (wait) {
> > +             struct timespec ts_start;
> > +             struct timespec ts_elapsed;
> > +
> > +             clock_gettime(CLOCK_MONOTONIC, &ts_start);
> >
> > -     ts_elapsed = timespec_elapsed(ts_start);
> > -     pr_info("%-30s: %ld.%09lds\n",
> > -             description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> > +             for (i = 0; i < nr_vcpus; i++)
> > +                     spin_wait_for_vcpu(i, next_iteration);
> > +
> > +             ts_elapsed = timespec_elapsed(ts_start);
> > +
> > +             pr_info("%-30s: %ld.%09lds\n",
> > +                     description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> > +     } else
>
> Needs curly braces.
>
> > +             pr_info("%-30s\n", description);
> >  }
> >
> > -static void access_memory(struct kvm_vm *vm, int nr_vcpus,
> > -                       enum access_type access, const char *description)
> > +static void _access_memory(struct kvm_vm *vm, int nr_vcpus,
>
> Use double underscores (ignore any precedent in selftests that uses just one,
> we're trying to purge that ugliness).
>
> > +                        enum access_type access, const char *description,
> > +                        bool wait)
> >  {
> >       memstress_set_write_percent(vm, (access == ACCESS_READ) ? 0 : 100);
> >       iteration_work = ITERATION_ACCESS_MEMORY;
> > -     run_iteration(vm, nr_vcpus, description);
> > +     run_iteration(vm, nr_vcpus, description, wait);
> > +}
> > +
> > +static void access_memory(struct kvm_vm *vm, int nr_vcpus,
> > +                       enum access_type access, const char *description)
> > +{
> > +     return _access_memory(vm, nr_vcpus, access, description, true);
> > +}
> > +
> > +static void access_memory_async(struct kvm_vm *vm, int nr_vcpus,
>
> Maybe "nowait" instead of "async"?  Yeah, something ends up being asynchronous
> (presumably), but the call itself is "synchronous", i.e. isn't spun off to a
> worker or anything.
>
> > +                             enum access_type access,
> > +                             const char *description)
> > +{
> > +     return _access_memory(vm, nr_vcpus, access, description, false);
> >  }
> >
> >  static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
> > @@ -297,19 +396,115 @@ static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
> >        */
> >       pr_debug("Marking VM memory idle (slow)...\n");
> >       iteration_work = ITERATION_MARK_IDLE;
> > -     run_iteration(vm, nr_vcpus, "Mark memory idle");
> > +     run_iteration(vm, nr_vcpus, "Mark memory idle", true);
> >  }
> >
> > -static void run_test(enum vm_guest_mode mode, void *arg)
> > +static void create_memcg(const char *memcg)
> > +{
> > +     const char *full_memcg_path = memcg_path(memcg);
> > +     int ret;
> > +
> > +     TEST_ASSERT(full_memcg_path, "Failed to construct full memcg path");
> > +retry:
> > +     ret = mkdir(full_memcg_path, 0755);
> > +     if (ret && errno == EEXIST) {
> > +             TEST_ASSERT(!rmdir(full_memcg_path),
> > +                         "Found existing memcg at %s, but rmdir failed",
> > +                         full_memcg_path);
> > +             goto retry;
>
>         while (1) {
>                 ret = mkdir(full_memcg_path, 0755);
>                 if (!ret)
>                         break;
>
>                 TEST_ASSERT(errno == EEXIST,
>                             Creating the memcg via 'mkdir(%s)' failed",
>                             full_memcg_path);
>
>                 TEST_ASSERT(!rmdir(full_memcg_path),
>                             "Found existing memcg at %s, but rmdir failed",
>                             full_memcg_path);
>         }
>
> > +     }
> > +     TEST_ASSERT(!ret, "Creating the memcg failed: mkdir(%s) failed",
>
> Heh, so it failed?
>
> > +                 full_memcg_path);
> > +
> > +     pr_info("Created memcg at %s\n", full_memcg_path);
> > +}
>
> ...
>
> > +/*
> > + * Test how much access tracking affects vCPU performance.
> > + *
> > + * Supports two modes of access tracking:
> > + * - idle page tracking
> > + * - lru_gen aging
> > + *
> > + * When using lru_gen, this test additionally verifies that the pages are in
> > + * fact getting younger and older, otherwise the performance data would be
> > + * invalid.
> > + *
> > + * The forced lru_gen aging can race with aging that occurs naturally.
> > + */
> > +static void run_test(enum vm_guest_mode mode, struct kvm_vm *vm,
> > +                  struct test_params *params)
> > +{
> > +     int nr_vcpus = params->nr_vcpus;
> > +     bool lru_gen = params->lru_gen;
> > +     struct memcg_stats stats;
> > +     // If guest_page_size is larger than the host's page size, the
> > +     // guest (memstress) will only fault in a subset of the host's pages.
>
> No C++ comments, please.
>
> > +     long total_pages = nr_vcpus * params->vcpu_memory_bytes /
> > +                        max(memstress_args.guest_page_size,
> > +                            (uint64_t)getpagesize());
>
> max_t() is probably better.
>
> > +     int found_gens[5];
> >
> >       pr_info("\n");
> >       access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory");
> > @@ -319,11 +514,78 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >       access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");
> >
> >       /* Repeat on memory that has been marked as idle. */
> > -     mark_memory_idle(vm, nr_vcpus);
> > +     if (lru_gen) {
> > +             /* Do an initial page table scan */
> > +             lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> > +             TEST_ASSERT(sum_memcg_stats(&stats) >= total_pages,
> > +               "Not all pages tracked in lru_gen stats.\n"
> > +               "Is lru_gen enabled? Did the memcg get created properly?");
>
> Align indentation.
>
>                 TEST_ASSERT(sum_memcg_stats(&stats) >= total_pages,
>                             "Not all pages tracked in lru_gen stats.\n"
>                             "Is lru_gen enabled? Did the memcg get created properly?");
>
> > +
> > +             /* Find the generation we're currently in (probably youngest) */
> > +             found_gens[0] = lru_gen_find_generation(&stats, total_pages);
> > +
> > +             /* Do an aging pass now */
> > +             lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> > +
> > +             /* Same generation, but a newer generation has been made */
> > +             found_gens[1] = lru_gen_find_generation(&stats, total_pages);
> > +             TEST_ASSERT(found_gens[1] == found_gens[0],
> > +                         "unexpected gen change: %d vs. %d",
> > +                         found_gens[1], found_gens[0]);
>
> I don't have any ideas off the top of my head, but there's gotta be a way to
> dedup these blocks.
>
> > +     } else
>
> Needs curly braces.
>
> > +             mark_memory_idle(vm, nr_vcpus);
> > +
> >       access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory");
> > -     mark_memory_idle(vm, nr_vcpus);
> > +
> > +     if (lru_gen) {
> > +             /* Scan the page tables again */
> > +             lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> > +
> > +             /* The pages should now be young again, so in a newer generation */
> > +             found_gens[2] = lru_gen_find_generation(&stats, total_pages);
> > +             TEST_ASSERT(found_gens[2] > found_gens[1],
> > +                         "pages did not get younger");
> > +
> > +             /* Do another aging pass */
> > +             lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> > +
> > +             /* Same generation; new generation has been made */
> > +             found_gens[3] = lru_gen_find_generation(&stats, total_pages);
> > +             TEST_ASSERT(found_gens[3] == found_gens[2],
> > +                         "unexpected gen change: %d vs. %d",
> > +                         found_gens[3], found_gens[2]);
> > +     } else
>
> Once more, with feeling!
>
> > +             mark_memory_idle(vm, nr_vcpus);
> > +
> >       access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
> >
> > +     if (lru_gen) {
> > +             /* Scan the pages tables again */
> > +             lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
> > +
> > +             /* The pages should now be young again, so in a newer generation */
> > +             found_gens[4] = lru_gen_find_generation(&stats, total_pages);
> > +             TEST_ASSERT(found_gens[4] > found_gens[3],
> > +                         "pages did not get younger");
> > +     }
> > +}
>
> ...
>
> > @@ -353,13 +618,15 @@ int main(int argc, char *argv[])
> >               .backing_src = DEFAULT_VM_MEM_SRC,
> >               .vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
> >               .nr_vcpus = 1,
> > +             .lru_gen = false,
> > +             .benchmark_lru_gen = false,
> >       };
> >       int page_idle_fd;
> >       int opt;
> >
> >       guest_modes_append_default();
> >
> > -     while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
> > +     while ((opt = getopt(argc, argv, "hm:b:v:os:lr:p")) != -1) {
> >               switch (opt) {
> >               case 'm':
> >                       guest_modes_cmdline(optarg);
> > @@ -376,6 +643,15 @@ int main(int argc, char *argv[])
> >               case 's':
> >                       params.backing_src = parse_backing_src_type(optarg);
> >                       break;
> > +             case 'l':
> > +                     params.lru_gen = true;
> > +                     break;
> > +             case 'p':
> > +                     params.benchmark_lru_gen = true;
> > +                     break;
> > +             case 'r':
> > +                     cgroup_root = strdup(optarg);
> > +                     break;
> >               case 'h':
> >               default:
> >                       help(argv[0]);
> > @@ -383,12 +659,42 @@ int main(int argc, char *argv[])
> >               }
> >       }
> >
> > -     page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> > -     __TEST_REQUIRE(page_idle_fd >= 0,
> > -                    "CONFIG_IDLE_PAGE_TRACKING is not enabled");
> > -     close(page_idle_fd);
> > +     if (!params.lru_gen) {
> > +             page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> > +             __TEST_REQUIRE(page_idle_fd >= 0,
> > +                            "CONFIG_IDLE_PAGE_TRACKING is not enabled");
> > +             close(page_idle_fd);
> > +     } else {
> > +             int lru_gen_fd, lru_gen_debug_fd;
> > +             long mglru_features;
> > +             char mglru_feature_str[8] = {};
> > +
> > +             lru_gen_fd = open("/sys/kernel/mm/lru_gen/enabled", O_RDONLY);
> > +             __TEST_REQUIRE(lru_gen_fd >= 0,
> > +                            "CONFIG_LRU_GEN is not enabled");
>
> Noooo!  Do not assume opening a path failed because some config option.  That
> reminds me, this needs to be rewritten.  I can't count the number of times this
> stupid test has oh so helpfully told me CONFIG_IDLE_PAGE_TRACKING is disabled,
> when in fact the problem is that I didn't run it as root.
>
>                 page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
>                 __TEST_REQUIRE(page_idle_fd >= 0,
>                                "CONFIG_IDLE_PAGE_TRACKING is not enabled");
>
>
> By all means, give the user a hint, but don't assume anything.  E.g.
>
>                 page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
>                 __TEST_REQUIRE(page_idle_fd >= 0,
>                                "Failed to open blah blah blah, is CONFIG_IDLE_PAGE_TRACKING enabled?");
>
> > +             TEST_ASSERT(read(lru_gen_fd, &mglru_feature_str, 7) > 0,
> > +                              "couldn't read lru_gen features");
> > +             mglru_features = strtol(mglru_feature_str, NULL, 16);
> > +             __TEST_REQUIRE(mglru_features & LRU_GEN_ENABLED,
> > +                            "lru_gen is not enabled");
> > +             __TEST_REQUIRE(mglru_features & LRU_GEN_MM_WALK,
> > +                            "lru_gen does not support MM_WALK");
> > +
> > +             lru_gen_debug_fd = open(DEBUGFS_LRU_GEN, O_RDWR);
> > +             __TEST_REQUIRE(lru_gen_debug_fd >= 0,
> > +                             "Cannot access %s", DEBUGFS_LRU_GEN);
> > +             close(lru_gen_debug_fd);
> > +     }
> > +
> > +     TEST_ASSERT(!params.benchmark_lru_gen || params.lru_gen,
> > +                 "-p specified without -l");
> > +
> > +     if (params.lru_gen) {
> > +             create_memcg(TEST_MEMCG_NAME);
> > +             move_to_memcg(TEST_MEMCG_NAME, getpid());
>
> After this, cgroup_root is never used.  Hint, hint.
>
> > +     }
> >
> > -     for_each_guest_mode(run_test, &params);
> > +     for_each_guest_mode(setup_vm_and_run, &params);
> >
> >       return 0;
> >  }
> > diff --git a/tools/testing/selftests/kvm/include/lru_gen_util.h b/tools/testing/selftests/kvm/include/lru_gen_util.h
> > new file mode 100644
> > index 000000000000..4eef8085a3cb
> > --- /dev/null
> > +++ b/tools/testing/selftests/kvm/include/lru_gen_util.h
> > @@ -0,0 +1,55 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Tools for integrating with lru_gen, like parsing the lru_gen debugfs output.
> > + *
> > + * Copyright (C) 2024, Google LLC.
> > + */
> > +#ifndef SELFTEST_KVM_LRU_GEN_UTIL_H
> > +#define SELFTEST_KVM_LRU_GEN_UTIL_H
> > +
> > +#include <inttypes.h>
> > +#include <limits.h>
> > +#include <stdlib.h>
> > +
> > +#include "test_util.h"
> > +
> > +#define MAX_NR_GENS 16 /* MAX_NR_GENS in include/linux/mmzone.h */
> > +#define MAX_NR_NODES 4 /* Maximum number of nodes we support */
>
> Who is "we"?  KVM selftests?  Kernel?
> > +
> > +static const char *DEBUGFS_LRU_GEN = "/sys/kernel/debug/lru_gen";
>
> So, root again?
>
> > +
> > +struct generation_stats {
> > +     int gen;
> > +     long age_ms;
> > +     long nr_anon;
> > +     long nr_file;
> > +};
> > +
> > +struct node_stats {
> > +     int node;
> > +     int nr_gens; /* Number of populated gens entries. */
> > +     struct generation_stats gens[MAX_NR_GENS];
> > +};
> > +
> > +struct memcg_stats {
> > +     unsigned long memcg_id;
> > +     int nr_nodes; /* Number of populated nodes entries. */
> > +     struct node_stats nodes[MAX_NR_NODES];
> > +};
> > +
> > +void print_memcg_stats(const struct memcg_stats *stats, const char *name);
> > +
> > +void read_memcg_stats(struct memcg_stats *stats, const char *memcg);
> > +
> > +void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg);
>
> These need lru_gen_ namespacing.  As is, they are very, very misleading names.
> They don't read core memcg stuff, they read LRU_GEN stuff, which AFAICT happens
> to be indexed by the memcg name.
>
> > +static void memcg_stats_handle_in_node(struct memcg_stats *stats,
> > +                                    struct memcg_stats_parse_context *ctx,
> > +                                    char *line)
> > +{
> > +     /* Have to copy since we might not consume */
>
> Huh?
>
> > +     char *my_line = strdup(line);
> > +     struct split_iterator it = { .str = my_line };
> > +     char *gen, *age, *nr_anon, *nr_file;
> > +     struct node_stats *node_stats;
> > +     struct generation_stats *gen_stats;
> > +     char *end;
> > +
> > +     TEST_ASSERT(it.str, "failed to copy input line");
> > +
> > +     gen = split_next(&it);
> > +
> > +     /* Skip empty lines */
> > +     if (!gen)
> > +             goto out_consume; /* Skip empty lines */
>
> If you say it three times, it might happen!  (Can you tell it's Friday afternoon?)
>
> > +     if (!strcmp("memcg", gen) || !strcmp("node", gen)) {
> > +             /*
> > +              * Reached next memcg or node section. Don't consume, let the
> > +              * other handler deal with this.
> > +              */
> > +             ctx->next_handler = memcg_stats_handle_in_memcg;
> > +             goto out;
> > +     }
>
> ...
>
> > +void print_memcg_stats(const struct memcg_stats *stats, const char *name)
> > +{
> > +     int node, gen;
> > +
> > +     fprintf(stderr, "stats for memcg %s (id %lu):\n",
>
> Why stderr?  This is effectively wrapped with DEBUG, so why not pr_debug()?
>
> > +                     name, stats->memcg_id);
> > +     for (node = 0; node < stats->nr_nodes; ++node) {
> > +             fprintf(stderr, "\tnode %d\n", stats->nodes[node].node);
> > +             for (gen = 0; gen < stats->nodes[node].nr_gens; ++gen) {
> > +                     const struct generation_stats *gstats =
> > +                             &stats->nodes[node].gens[gen];
> > +
> > +                     fprintf(stderr,
> > +                             "\t\tgen %d\tage_ms %ld"
> > +                             "\tnr_anon %ld\tnr_file %ld\n",
> > +                             gstats->gen, gstats->age_ms, gstats->nr_anon,
> > +                             gstats->nr_file);
> > +             }
> > +     }
> > +}
> > +
> > +/* Re-read lru_gen debugfs information for @memcg into @stats. */
> > +void read_memcg_stats(struct memcg_stats *stats, const char *memcg)
> > +{
> > +     FILE *f;
> > +     ssize_t read = 0;
> > +     char *line = NULL;
> > +     size_t bufsz;
> > +     struct memcg_stats_parse_context ctx = {
> > +             .next_handler = memcg_stats_handle_searching,
> > +             .name = memcg,
> > +     };
> > +
> > +     memset(stats, 0, sizeof(struct memcg_stats));
> > +
> > +     f = fopen(DEBUGFS_LRU_GEN, "r");
> > +     TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
> > +
> > +     while (ctx.next_handler && (read = getline(&line, &bufsz, f)) > 0) {
> > +             ctx.consumed = false;
> > +
> > +             do {
> > +                     ctx.next_handler(stats, &ctx, line);
> > +                     if (!ctx.next_handler)
> > +                             break;
> > +             } while (!ctx.consumed);
> > +     }
> > +
> > +     if (read < 0 && !feof(f))
> > +             TEST_ASSERT(false, "getline(%s) failed", DEBUGFS_LRU_GEN);
> > +
> > +     TEST_ASSERT(stats->memcg_id > 0, "Couldn't find memcg: %s\n"
> > +                 "Did the memcg get created in the proper mount?",
> > +                 memcg);
> > +     if (line)
> > +             free(line);
> > +     TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
> > +}
> > +
> > +/*
> > + * Find all pages tracked by lru_gen for this memcg in generation @target_gen.
> > + *
> > + * If @target_gen is negative, look for all generations.
> > + */
> > +static long sum_memcg_stats_for_gen(int target_gen,
> > +                                 const struct memcg_stats *stats)
> > +{
> > +     int node, gen;
> > +     long total_nr = 0;
> > +
> > +     for (node = 0; node < stats->nr_nodes; ++node) {
> > +             const struct node_stats *node_stats = &stats->nodes[node];
> > +
> > +             for (gen = 0; gen < node_stats->nr_gens; ++gen) {
> > +                     const struct generation_stats *gen_stats =
> > +                             &node_stats->gens[gen];
> > +
> > +                     if (target_gen >= 0 && gen_stats->gen != target_gen)
> > +                             continue;
> > +
> > +                     total_nr += gen_stats->nr_anon + gen_stats->nr_file;
> > +             }
> > +     }
> > +
> > +     return total_nr;
> > +}
> > +
> > +/* Find all pages tracked by lru_gen for this memcg. */
> > +long sum_memcg_stats(const struct memcg_stats *stats)
> > +{
> > +     return sum_memcg_stats_for_gen(-1, stats);
> > +}
> > +
> > +/* Read the memcg stats and optionally print if this is a debug build. */
> > +void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg)
> > +{
> > +     read_memcg_stats(stats, memcg);
> > +#ifdef DEBUG
> > +     print_memcg_stats(stats, memcg);
>
> print_memcg_stats() should be static, because this is the only caller.  But I'm
> guessing you made it globally visible so that the compiler wouldn't complain
> about unused functions.  A better approach would be to wrap the guts with the
> #ifdef.
>
> > +#endif
> > +}
> > +
> > +/*
> > + * If lru_gen aging should force page table scanning.
> > + *
> > + * If you want to set this to false, you will need to do eviction
> > + * before doing extra aging passes.
> > + */
> > +static const bool force_scan = true;
> > +
> > +static void run_aging_impl(unsigned long memcg_id, int node_id, int max_gen)
> > +{
> > +     FILE *f = fopen(DEBUGFS_LRU_GEN, "w");
> > +     char *command;
> > +     size_t sz;
> > +
> > +     TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
> > +     sz = asprintf(&command, "+ %lu %d %d 1 %d\n",
> > +                   memcg_id, node_id, max_gen, force_scan);
> > +     TEST_ASSERT(sz > 0, "creating aging command failed");
> > +
> > +     pr_debug("Running aging command: %s", command);
> > +     if (fwrite(command, sizeof(char), sz, f) < sz) {
> > +             TEST_ASSERT(false, "writing aging command %s to %s failed",
> > +                         command, DEBUGFS_LRU_GEN);
> > +     }
> > +
> > +     TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
> > +}
> > +
> > +static void _lru_gen_do_aging(struct memcg_stats *stats, const char *memcg,
>
> Two underscores.
>
> > +/* Do aging, and print how long it took. */
> > +void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg)
> > +{
> > +     return _lru_gen_do_aging(stats, memcg, true);
> > +}
> > +
> > +/* Do aging, don't print anything. */
> > +void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg)
> > +{
> > +     return _lru_gen_do_aging(stats, memcg, false);
>
> static inline helpers in the header?  Having to come all this way to see that
> these are simple wrapper is annoying.
>
> > +}
> > +
> > +/*
> > + * Find which generation contains more than half of @total_pages, assuming that
> > + * such a generation exists.
> > + */
> > +int lru_gen_find_generation(const struct memcg_stats *stats,
> > +                         unsigned long total_pages)
> > +{
> > +     int node, gen, gen_idx, min_gen = INT_MAX, max_gen = -1;
> > +
> > +     for (node = 0; node < stats->nr_nodes; ++node)
>
> Curly braces.
>
> > +             for (gen_idx = 0; gen_idx < stats->nodes[node].nr_gens;
> > +                  ++gen_idx) {
> > +                     gen = stats->nodes[node].gens[gen_idx].gen;
> > +                     max_gen = gen > max_gen ? gen : max_gen;
> > +                     min_gen = gen < min_gen ? gen : min_gen;
> > +             }
> > +
> > +     for (gen = min_gen; gen < max_gen; ++gen)
> > +             /* See if the most pages are in this generation. */
> > +             if (sum_memcg_stats_for_gen(gen, stats) >
> > +                             total_pages / 2)
> > +                     return gen;
> > +
> > +     TEST_ASSERT(false, "No generation includes majority of %lu pages.",
>
> TEST_FAIL.
>
> > +                 total_pages);
> > +
> > +     /* unreachable, but make the compiler happy */
> > +     return -1;
>
> I _think_ selftests can use unreachable().
>
> > +}
> > --
> > 2.47.0.199.ga7371fff76-goog
> >

Thanks for all the feedback. I'll apply all of it (and Yosry's
suggestion to use cgroup_util.h / otherwise de-duplicate the cgroup
logic, thanks Yosry!) for the next version of this patch. For now I'll
go ahead and send v9 of the main series.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test
  2024-11-05 18:43 ` [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton
  2025-01-11  0:12   ` Sean Christopherson
@ 2025-01-11  0:21   ` Yosry Ahmed
  1 sibling, 0 replies; 41+ messages in thread
From: Yosry Ahmed @ 2025-01-11  0:21 UTC (permalink / raw)
  To: James Houghton
  Cc: Sean Christopherson, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Yu Zhao, Axel Rasmussen, kvm,
	linux-kernel

On Tue, Nov 5, 2024 at 10:49 AM James Houghton <jthoughton@google.com> wrote:
>
> This test now has two modes of operation:
> 1. (default) To check how much vCPU performance was affected by access
>              tracking (previously existed, now supports MGLRU aging).
> 2. (-p) To also benchmark how fast MGLRU can do aging while vCPUs are
>         faulting in memory.
>
> Mode (1) also serves as a way to verify that aging is working properly
> for pages only accessed by KVM.  It will fail if one does not have the
> 0x8 lru_gen feature bit.
>
> To support MGLRU, the test creates a memory cgroup, moves itself into
> it, then uses the lru_gen debugfs output to track memory in that cgroup.
> The logic to parse the lru_gen debugfs output has been put into
> selftests/kvm/lib/lru_gen_util.c.
>
> Co-developed-by: Axel Rasmussen <axelrasmussen@google.com>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../selftests/kvm/access_tracking_perf_test.c | 366 ++++++++++++++--
>  .../selftests/kvm/include/lru_gen_util.h      |  55 +++
>  .../testing/selftests/kvm/lib/lru_gen_util.c  | 391 ++++++++++++++++++
>  4 files changed, 783 insertions(+), 30 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h
>  create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c
>
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index f186888f0e00..542548e6e8ba 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -22,6 +22,7 @@ LIBKVM += lib/elf.c
>  LIBKVM += lib/guest_modes.c
>  LIBKVM += lib/io.c
>  LIBKVM += lib/kvm_util.c
> +LIBKVM += lib/lru_gen_util.c
>  LIBKVM += lib/memstress.c
>  LIBKVM += lib/guest_sprintf.c
>  LIBKVM += lib/rbtree.c
> diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> index 3c7defd34f56..8d6c2ce4b98a 100644
> --- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
> +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> @@ -38,6 +38,7 @@
>  #include <inttypes.h>
>  #include <limits.h>
>  #include <pthread.h>
> +#include <stdio.h>
>  #include <sys/mman.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
> @@ -47,6 +48,19 @@
>  #include "memstress.h"
>  #include "guest_modes.h"
>  #include "processor.h"
> +#include "lru_gen_util.h"
> +
> +static const char *TEST_MEMCG_NAME = "access_tracking_perf_test";
> +static const int LRU_GEN_ENABLED = 0x1;
> +static const int LRU_GEN_MM_WALK = 0x2;
> +static const char *CGROUP_PROCS = "cgroup.procs";
> +/*
> + * If using MGLRU, this test assumes a cgroup v2 or cgroup v1 memory hierarchy
> + * is mounted at cgroup_root.
> + *
> + * Can be changed with -r.
> + */
> +static const char *cgroup_root = "/sys/fs/cgroup";
>
>  /* Global variable used to synchronize all of the vCPU threads. */
>  static int iteration;
> @@ -62,6 +76,9 @@ static enum {
>  /* The iteration that was last completed by each vCPU. */
>  static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
>
> +/* The time at which the last iteration was completed */
> +static struct timespec vcpu_last_completed_time[KVM_MAX_VCPUS];
> +
>  /* Whether to overlap the regions of memory vCPUs access. */
>  static bool overlap_memory_access;
>
> @@ -74,6 +91,12 @@ struct test_params {
>
>         /* The number of vCPUs to create in the VM. */
>         int nr_vcpus;
> +
> +       /* Whether to use lru_gen aging instead of idle page tracking. */
> +       bool lru_gen;
> +
> +       /* Whether to test the performance of aging itself. */
> +       bool benchmark_lru_gen;
>  };
>
>  static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
> @@ -89,6 +112,50 @@ static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
>
>  }
>
> +static void write_file_long(const char *path, long v)
> +{
> +       FILE *f;
> +
> +       f = fopen(path, "w");
> +       TEST_ASSERT(f, "fopen(%s) failed", path);
> +       TEST_ASSERT(fprintf(f, "%ld\n", v) > 0,
> +                   "fprintf to %s failed", path);
> +       TEST_ASSERT(!fclose(f), "fclose(%s) failed", path);
> +}
> +
> +static char *path_join(const char *parent, const char *child)
> +{
> +       char *out = NULL;
> +
> +       return asprintf(&out, "%s/%s", parent, child) >= 0 ? out : NULL;
> +}
> +
> +static char *memcg_path(const char *memcg)
> +{
> +       return path_join(cgroup_root, memcg);
> +}
> +
> +static char *memcg_file_path(const char *memcg, const char *file)
> +{
> +       char *mp = memcg_path(memcg);
> +       char *fp;
> +
> +       if (!mp)
> +               return NULL;
> +       fp = path_join(mp, file);
> +       free(mp);
> +       return fp;
> +}
> +
> +static void move_to_memcg(const char *memcg, pid_t pid)
> +{
> +       char *procs = memcg_file_path(memcg, CGROUP_PROCS);
> +
> +       TEST_ASSERT(procs, "Failed to construct cgroup.procs path");
> +       write_file_long(procs, pid);
> +       free(procs);
> +}
> +

From the peanut gallery, I suspect you can make use of the cgroup
helpers in tools/testing/selftests/cgroup/cgroup_util.h. There are
helpers to locate the cgroup root, create cgroups, enter cgroups, etc.
You can find example uses specifically for memory cgroups (memcgs) in
tools/testing/selftests/cgroup/test_memcontrol.c.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly
  2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
                   ` (10 preceding siblings ...)
  2024-11-05 18:43 ` [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton
@ 2024-11-05 19:21 ` Yu Zhao
  2024-11-05 19:28   ` Yu Zhao
  11 siblings, 1 reply; 41+ messages in thread
From: Yu Zhao @ 2024-11-05 19:21 UTC (permalink / raw)
  To: James Houghton, Jonathan Corbet
  Cc: Sean Christopherson, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Axel Rasmussen, kvm,
	linux-kernel

On Tue, Nov 5, 2024 at 11:43 AM James Houghton <jthoughton@google.com> wrote:
>
> Andrew has queued patches to make MGLRU consult KVM when doing aging[8].
> Now, make aging lockless for the shadow MMU and the TDP MMU. This allows
> us to reduce the time/CPU it takes to do aging and the performance
> impact on the vCPUs while we are aging.
>
> The final patch in this series modifies access_tracking_stress_test to
> age using MGLRU. There is a mode (-p) where it will age while the vCPUs
> are faulting memory in. Here are some results with that mode:

Additional background in case I didn't provide it before:

At Google we keep track of hotness/coldness of VM memory to identify
opportunities to demote cold memory into slower tiers of storage. This
is done in a controlled manner so that while we benefit from the
improved memory efficiency through improved bin-packing, without
violating customer SLOs.

However, the monitoring/tracking introduced two major overheads [1] for us:
1. the traditional (host) PFN + rmap data structures [2] used to
locate host PTEs (containing the accessed bits).
2. the KVM MMU lock required to clear the accessed bits in
secondary/shadow PTEs.

MGLRU provides the infrastructure for us to reach out into page tables
directly from a list of mm_struct's, and therefore allows us to bypass
the first problem above and reduce the CPU overhead by ~80% for our
workloads (90%+ mmaped memory). This series solves the second problem:
by supporting locklessly clearing the accessed bits in SPTEs, it would
reduce our current KVM MMU lock contention by >80% [3]. All other
existing mechanisms, e.g., Idle Page Tracking, DAMON, etc., can also
seamlessly benefit from this series when monitoring/tracking VM
memory.

[1] https://lwn.net/Articles/787611/
[2] https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
[3] https://research.google/pubs/profiling-a-warehouse-scale-computer/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly
  2024-11-05 19:21 ` [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly Yu Zhao
@ 2024-11-05 19:28   ` Yu Zhao
  0 siblings, 0 replies; 41+ messages in thread
From: Yu Zhao @ 2024-11-05 19:28 UTC (permalink / raw)
  To: James Houghton, Jonathan Corbet
  Cc: Sean Christopherson, Paolo Bonzini, David Matlack, David Rientjes,
	Marc Zyngier, Oliver Upton, Wei Xu, Axel Rasmussen, kvm,
	linux-kernel

On Tue, Nov 5, 2024 at 12:21 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Nov 5, 2024 at 11:43 AM James Houghton <jthoughton@google.com> wrote:
> >
> > Andrew has queued patches to make MGLRU consult KVM when doing aging[8].
> > Now, make aging lockless for the shadow MMU and the TDP MMU. This allows
> > us to reduce the time/CPU it takes to do aging and the performance
> > impact on the vCPUs while we are aging.
> >
> > The final patch in this series modifies access_tracking_stress_test to
> > age using MGLRU. There is a mode (-p) where it will age while the vCPUs
> > are faulting memory in. Here are some results with that mode:
>
> Additional background in case I didn't provide it before:
>
> At Google we keep track of hotness/coldness of VM memory to identify
> opportunities to demote cold memory into slower tiers of storage. This
> is done in a controlled manner so that while we benefit from the
> improved memory efficiency through improved bin-packing, without
> violating customer SLOs.
>
> However, the monitoring/tracking introduced two major overheads [1] for us:
> 1. the traditional (host) PFN + rmap data structures [2] used to
> locate host PTEs (containing the accessed bits).
> 2. the KVM MMU lock required to clear the accessed bits in
> secondary/shadow PTEs.
>
> MGLRU provides the infrastructure for us to reach out into page tables
> directly from a list of mm_struct's, and therefore allows us to bypass
> the first problem above and reduce the CPU overhead by ~80% for our
> workloads (90%+ mmaped memory). This series solves the second problem:
> by supporting locklessly clearing the accessed bits in SPTEs, it would
> reduce our current KVM MMU lock contention by >80% [3]. All other
> existing mechanisms, e.g., Idle Page Tracking, DAMON, etc., can also
> seamlessly benefit from this series when monitoring/tracking VM
> memory.
>
> [1] https://lwn.net/Articles/787611/
> [2] https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
> [3] https://research.google/pubs/profiling-a-warehouse-scale-computer/

And we also ran an A/B experiment on quarter million Chromebooks
running Android in VMs last year (with an older version of this
series):

It reduced PSI by 10% at the 99th percentile and "janks" by 8% at the
95th percentile, which resulted in an overall improvement in user
engagement by 16% at the 75th percentile (all statistically
significant).

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2025-02-03 22:46 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-05 18:43 [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
2024-11-05 18:43 ` [PATCH v8 01/11] KVM: Remove kvm_handle_hva_range helper functions James Houghton
2025-01-10 22:15   ` Sean Christopherson
2025-01-27 19:50     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 02/11] KVM: Add lockless memslot walk to KVM James Houghton
2025-01-10 22:26   ` Sean Christopherson
2025-01-27 19:51     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine James Houghton
2024-11-05 18:45   ` Yu Zhao
2025-01-10 22:34   ` Sean Christopherson
2025-01-27 19:51     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
2024-11-06  8:22   ` kernel test robot
2024-11-08  3:00     ` James Houghton
2024-11-08 22:45       ` Sean Christopherson
2024-11-11 14:45         ` James Houghton
2025-01-10 22:47   ` Sean Christopherson
2025-01-27 19:52     ` James Houghton
2025-01-27 19:57   ` James Houghton
2025-01-27 20:09     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn James Houghton
2024-11-05 18:46   ` Yu Zhao
2025-01-10 22:59   ` Sean Christopherson
2025-01-27 19:58     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 James Houghton
2024-11-05 18:49   ` Yu Zhao
2025-01-10 23:05   ` Sean Christopherson
2025-01-27 19:58     ` James Houghton
2024-11-05 18:43 ` [PATCH v8 07/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock James Houghton
2024-11-05 18:43 ` [PATCH v8 08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock James Houghton
2025-01-10 23:18   ` Sean Christopherson
2025-01-27 21:42     ` James Houghton
2025-01-27 21:52   ` James Houghton
2024-11-05 18:43 ` [PATCH v8 09/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs James Houghton
2024-11-05 18:43 ` [PATCH v8 10/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns James Houghton
2024-11-05 18:43 ` [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton
2025-01-11  0:12   ` Sean Christopherson
2025-02-03 22:46     ` James Houghton
2025-01-11  0:21   ` Yosry Ahmed
2024-11-05 19:21 ` [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly Yu Zhao
2024-11-05 19:28   ` Yu Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).