[PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging
@ 2024-07-24  1:10 James Houghton
  2024-07-24  1:10 ` [PATCH v6 01/11] KVM: Add lockless memslot walk to KVM James Houghton
                   ` (10 more replies)
  0 siblings, 11 replies; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

This patchset makes it possible for MGLRU to consult secondary MMUs
while doing aging, not just during eviction. This allows for more
accurate reclaim decisions, which is especially important for proactive
reclaim.

This series does the following:
 1. Improve locking for the existing test/clear_young notifiers for x86
    and arm64.
 2. Add a fast_only parameter into test_young() and clear_young(), and
    add helper functions for using the new parameter (e.g.
    mmu_notifier_clear_young_fast_only(). Non-trivially implement the
    fast-only test_young() and clear_young() for x86_64.
 3. Incorporate mmu_notifier_clear_young_fast_only() into MGLRU aging.
 4. Add an MGLRU mode (-l) to access_tracking_perf_test to show that
    aging is working properly.

Please note that mmu_notifier_test_young_fast_only() is added but not
used in this series. I am happy to remove it if that would be
appropriate.

The fast-only notifiers serve a particular purpose: for aging, we
neither want to delay other operations (e.g. unmapping for eviction)
nor do we want to be delayed by these other operations ourselves. By
default, the implementations of test_young() and clear_young() are meant
to be *accurate*, not fast. The fast-only notifiers will only give age
information that can be gathered fast.

The fast-only notifiers are non-trivially implemented for only x86_64
right now (as the KVM/x86 TDP MMU is the only secondary MMU that
supports lockless Accessed bit harvesting).

To make aging work for more than just x86, the fast-only clear_young()
notifier must be non-trivially implemented by those other architectures
and HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY needs to be set.

access_tracking_perf_test now has a mode (-p) to check performance of
MGLRU aging while the VM is faulting memory in. See the v4 cover
letter[2] for performance data collected with this test.

Previous versions of this series included logic in MGLRU and KVM to
support batching the updates to secondary page tables. This version
removes this logic, as it was complex and not necessary to enable
proactive reclaim. This optimization, as well as enabling aging for
arm64 and powerpc, can be done in a later series.

=== Previous Versions ===

Since v5[1]:
 - Reworked test_clear_young_fast_only() into a new parameter for the
   existing notifiers (thanks Sean).
 - Added mmu_notifier.has_fast_aging to tell mm if calling fast-only
   notifiers should be done.
 - Added mm_has_fast_young_notifiers() to inform users if calling
   fast-only notifier helpers is worthwhile (for look-around to use).
 - Changed MGLRU to invoke a single notifier instead of two when
   aging and doing look-around (thanks Yu).
 - For KVM/x86, check indirect_shadow_pages > 0 instead of
   kvm_memslots_have_rmaps() when collecting age information
   (thanks Sean).
 - For KVM/arm, some fixes from Oliver.
 - Small fixes to access_tracking_perf_test.
 - Added missing !MMU_NOTIFIER version of mmu_notifier_clear_young().

Since v4[2]:
 - Removed Kconfig that controlled when aging was enabled. Aging will
   be done whenever the architecture supports it (thanks Yu).
 - Added a new MMU notifier, test_clear_young_fast_only(), specifically
   for MGLRU to use.
 - Add kvm_fast_{test_,}age_gfn, implemented by x86.
 - Fix locking for clear_flush_young().
 - Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes
   (thanks Sean).
 - Fix WARN_ON and other cleanup for the arm64 locking changes
   (thanks Oliver).

Since v3[3]:
 - Vastly simplified the series (thanks David). Removed mmu notifier
   batching logic entirely.
 - Cleaned up how locking is done for mmu_notifier_test/clear_young
   (thanks David).
 - Look-around is now only done when there are no secondary MMUs
   subscribed to MMU notifiers.
 - CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
 - Fixed the lockless implementation of kvm_{test,}age_gfn for x86
   (thanks David).
 - Added MGLRU functional and performance tests to
   access_tracking_perf_test (thanks Axel).
 - In v3, an mm would be completely ignored (for aging) if there was a
   secondary MMU but support for secondary MMU walking was missing. Now,
   missing secondary MMU walking support simply skips the notifier
   calls (except for eviction).
 - Added a sanity check for that range->lockless and range->on_lock are
   never both provided for the memslot walk.

For the changes since v2[4], see v3.

Based on latest kvm/next.

[1]: https://lore.kernel.org/linux-mm/20240611002145.2078921-1-jthoughton@google.com/
[2]: https://lore.kernel.org/linux-mm/20240529180510.2295118-1-jthoughton@google.com/
[3]: https://lore.kernel.org/linux-mm/20240401232946.1837665-1-jthoughton@google.com/
[4]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuzhao@google.com/

James Houghton (11):
  KVM: Add lockless memslot walk to KVM
  KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER
  mm: Add fast_only bool to test_young and clear_young MMU notifiers
  mm: Add has_fast_aging to struct mmu_notifier
  KVM: Pass fast_only to kvm_{test_,}age_gfn
  KVM: x86: Optimize kvm_{test_,}age_gfn a little bit
  KVM: x86: Implement fast_only versions of kvm_{test_,}age_gfn
  mm: multi-gen LRU: Have secondary MMUs participate in aging
  KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test

 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/hyp/pgtable.c                  |  15 +-
 arch/arm64/kvm/mmu.c                          |  30 +-
 arch/x86/include/asm/kvm_host.h               |   1 +
 arch/x86/kvm/Kconfig                          |   2 +
 arch/x86/kvm/mmu/mmu.c                        |  23 +-
 arch/x86/kvm/mmu/tdp_iter.h                   |  27 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  67 ++-
 include/linux/kvm_host.h                      |   2 +
 include/linux/mmu_notifier.h                  |  67 ++-
 include/linux/mmzone.h                        |   6 +-
 include/trace/events/kvm.h                    |  19 +-
 mm/damon/vaddr.c                              |   2 -
 mm/mmu_notifier.c                             |  38 +-
 mm/rmap.c                                     |   9 +-
 mm/vmscan.c                                   | 148 +++++--
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/access_tracking_perf_test.c | 369 +++++++++++++++--
 .../selftests/kvm/include/lru_gen_util.h      |  55 +++
 .../testing/selftests/kvm/lib/lru_gen_util.c  | 391 ++++++++++++++++++
 virt/kvm/Kconfig                              |   7 +
 virt/kvm/kvm_main.c                           |  73 ++--
 23 files changed, 1194 insertions(+), 165 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h
 create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c


base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6 01/11] KVM: Add lockless memslot walk to KVM
  2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
@ 2024-07-24  1:10 ` James Houghton
  2024-07-25 16:39   ` David Matlack
  2024-07-24  1:10 ` [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

Provide flexibility to the architecture to synchronize as optimally as
they can instead of always taking the MMU lock for writing.

Architectures that do their own locking must select
CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS.

The immediate application is to allow architectures to implement the
test/clear_young MMU notifiers more cheaply.

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/Kconfig         |  3 +++
 virt/kvm/kvm_main.c      | 26 +++++++++++++++++++-------
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 689e8be873a7..8cd80f969cff 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -266,6 +266,7 @@ struct kvm_gfn_range {
 	gfn_t end;
 	union kvm_mmu_notifier_arg arg;
 	bool may_block;
+	bool lockless;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index b14e14cdbfb9..632334861001 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -100,6 +100,9 @@ config KVM_GENERIC_MMU_NOTIFIER
        select MMU_NOTIFIER
        bool
 
+config KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
+       bool
+
 config KVM_GENERIC_MEMORY_ATTRIBUTES
        depends on KVM_GENERIC_MMU_NOTIFIER
        bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..33f8997a5c29 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -555,6 +555,7 @@ struct kvm_mmu_notifier_range {
 	on_lock_fn_t on_lock;
 	bool flush_on_ret;
 	bool may_block;
+	bool lockless;
 };
 
 /*
@@ -609,6 +610,10 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 			 IS_KVM_NULL_FN(range->handler)))
 		return r;
 
+	/* on_lock will never be called for lockless walks */
+	if (WARN_ON_ONCE(range->lockless && !IS_KVM_NULL_FN(range->on_lock)))
+		return r;
+
 	idx = srcu_read_lock(&kvm->srcu);
 
 	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
@@ -640,15 +645,18 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
+			gfn_range.lockless = range->lockless;
 
 			if (!r.found_memslot) {
 				r.found_memslot = true;
-				KVM_MMU_LOCK(kvm);
-				if (!IS_KVM_NULL_FN(range->on_lock))
-					range->on_lock(kvm);
-
-				if (IS_KVM_NULL_FN(range->handler))
-					goto mmu_unlock;
+				if (!range->lockless) {
+					KVM_MMU_LOCK(kvm);
+					if (!IS_KVM_NULL_FN(range->on_lock))
+						range->on_lock(kvm);
+
+					if (IS_KVM_NULL_FN(range->handler))
+						goto mmu_unlock;
+				}
 			}
 			r.ret |= range->handler(kvm, &gfn_range);
 		}
@@ -658,7 +666,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 		kvm_flush_remote_tlbs(kvm);
 
 mmu_unlock:
-	if (r.found_memslot)
+	if (r.found_memslot && !range->lockless)
 		KVM_MMU_UNLOCK(kvm);
 
 	srcu_read_unlock(&kvm->srcu, idx);
@@ -679,6 +687,8 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= false,
+		.lockless	=
+			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
 	};
 
 	return __kvm_handle_hva_range(kvm, &range).ret;
@@ -697,6 +707,8 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= false,
 		.may_block	= false,
+		.lockless	=
+			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
 	};
 
 	return __kvm_handle_hva_range(kvm, &range).ret;
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 01/11] KVM: Add lockless memslot walk to KVM
  2024-07-24  1:10 ` [PATCH v6 01/11] KVM: Add lockless memslot walk to KVM James Houghton
@ 2024-07-25 16:39   ` David Matlack
  2024-07-26  0:28     ` James Houghton
  0 siblings, 1 reply; 40+ messages in thread
From: David Matlack @ 2024-07-25 16:39 UTC (permalink / raw)
  To: James Houghton
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Rientjes, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

On 2024-07-24 01:10 AM, James Houghton wrote:
> Provide flexibility to the architecture to synchronize as optimally as
> they can instead of always taking the MMU lock for writing.
> 
> Architectures that do their own locking must select
> CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS.
> 
> The immediate application is to allow architectures to implement the
> test/clear_young MMU notifiers more cheaply.
> 
> Suggested-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>

Aside from the cleanup suggestion (which should be in separate patches
anyway):

Reviewed-by: David Matlack <dmatlack@google.com>

> ---
>  include/linux/kvm_host.h |  1 +
>  virt/kvm/Kconfig         |  3 +++
>  virt/kvm/kvm_main.c      | 26 +++++++++++++++++++-------
>  3 files changed, 23 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 689e8be873a7..8cd80f969cff 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -266,6 +266,7 @@ struct kvm_gfn_range {
>  	gfn_t end;
>  	union kvm_mmu_notifier_arg arg;
>  	bool may_block;
> +	bool lockless;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index b14e14cdbfb9..632334861001 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -100,6 +100,9 @@ config KVM_GENERIC_MMU_NOTIFIER
>         select MMU_NOTIFIER
>         bool
>  
> +config KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
> +       bool
> +
>  config KVM_GENERIC_MEMORY_ATTRIBUTES
>         depends on KVM_GENERIC_MMU_NOTIFIER
>         bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d0788d0a72cc..33f8997a5c29 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -555,6 +555,7 @@ struct kvm_mmu_notifier_range {
>  	on_lock_fn_t on_lock;
>  	bool flush_on_ret;
>  	bool may_block;
> +	bool lockless;
>  };
>  
>  /*
> @@ -609,6 +610,10 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>  			 IS_KVM_NULL_FN(range->handler)))
>  		return r;
>  
> +	/* on_lock will never be called for lockless walks */
> +	if (WARN_ON_ONCE(range->lockless && !IS_KVM_NULL_FN(range->on_lock)))
> +		return r;
> +
>  	idx = srcu_read_lock(&kvm->srcu);
>  
>  	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> @@ -640,15 +645,18 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>  			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
>  			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
>  			gfn_range.slot = slot;
> +			gfn_range.lockless = range->lockless;
>  
>  			if (!r.found_memslot) {
>  				r.found_memslot = true;
> -				KVM_MMU_LOCK(kvm);
> -				if (!IS_KVM_NULL_FN(range->on_lock))
> -					range->on_lock(kvm);
> -
> -				if (IS_KVM_NULL_FN(range->handler))
> -					goto mmu_unlock;
> +				if (!range->lockless) {
> +					KVM_MMU_LOCK(kvm);
> +					if (!IS_KVM_NULL_FN(range->on_lock))
> +						range->on_lock(kvm);
> +
> +					if (IS_KVM_NULL_FN(range->handler))
> +						goto mmu_unlock;
> +				}
>  			}
>  			r.ret |= range->handler(kvm, &gfn_range);
>  		}
> @@ -658,7 +666,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>  		kvm_flush_remote_tlbs(kvm);
>  
>  mmu_unlock:
> -	if (r.found_memslot)
> +	if (r.found_memslot && !range->lockless)
>  		KVM_MMU_UNLOCK(kvm);
>  
>  	srcu_read_unlock(&kvm->srcu, idx);
> @@ -679,6 +687,8 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
>  		.on_lock	= (void *)kvm_null_fn,
>  		.flush_on_ret	= true,
>  		.may_block	= false,
> +		.lockless	=
> +			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
>  	};
>  
>  	return __kvm_handle_hva_range(kvm, &range).ret;
> @@ -697,6 +707,8 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
>  		.on_lock	= (void *)kvm_null_fn,
>  		.flush_on_ret	= false,
>  		.may_block	= false,
> +		.lockless	=
> +			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),

kvm_handle_hva_range{,_no_flush}() have very generic names but
they're intimately tied to the "young" notifiers. Whereas
__kvm_handle_hva_range() is the truly generic handler function.

This is arguably a pre-existing issue, but adding
CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS makes these functions even more
intamtely tied to the "young" notifiers.

We could rename kvm_handle_hva_range{,_no_flush}() but I think the
cleanest thing to do might be to just drop them entirely and move their
contents into their callers (there are only 2 callers of these 3
functions). That will create a little duplication but IMO will make the
code easier to read.

And then we can also rename __kvm_handle_hva_range() to
kvm_handle_hva_range().

e.g. Something like this as the end result:


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 86fb2b560d98..0146c83e24bd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -590,8 +590,8 @@ static void kvm_null_fn(void)
 	     node;							     \
 	     node = interval_tree_iter_next(node, start, last))	     \
 
-static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
-							   const struct kvm_mmu_notifier_range *range)
+static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
+							 const struct kvm_mmu_notifier_range *range)
 {
 	struct kvm_mmu_notifier_return r = {
 		.ret = false,
@@ -674,48 +674,6 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 	return r;
 }
 
-static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
-						unsigned long start,
-						unsigned long end,
-						gfn_handler_t handler)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_mmu_notifier_range range = {
-		.start		= start,
-		.end		= end,
-		.handler	= handler,
-		.on_lock	= (void *)kvm_null_fn,
-		.flush_on_ret	= true,
-		.may_block	= false,
-		.lockless	=
-			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
-	};
-
-	return __kvm_handle_hva_range(kvm, &range).ret;
-}
-
-static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
-							 unsigned long start,
-							 unsigned long end,
-							 gfn_handler_t handler,
-							 bool fast_only)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_mmu_notifier_range range = {
-		.start			= start,
-		.end			= end,
-		.handler		= handler,
-		.on_lock		= (void *)kvm_null_fn,
-		.flush_on_ret		= false,
-		.may_block		= false,
-		.lockless		=
-			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
-		.arg.fast_only		= fast_only,
-	};
-
-	return __kvm_handle_hva_range(kvm, &range).ret;
-}
-
 void kvm_mmu_invalidate_begin(struct kvm *kvm)
 {
 	lockdep_assert_held_write(&kvm->mmu_lock);
@@ -808,7 +766,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * that guest memory has been reclaimed.  This needs to be done *after*
 	 * dropping mmu_lock, as x86's reclaim path is slooooow.
 	 */
-	if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
+	if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
 		kvm_arch_guest_memory_reclaimed(kvm);
 
 	return 0;
@@ -854,7 +812,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	};
 	bool wake;
 
-	__kvm_handle_hva_range(kvm, &hva_range);
+	kvm_handle_hva_range(kvm, &hva_range);
 
 	/* Pairs with the increment in range_start(). */
 	spin_lock(&kvm->mn_invalidate_lock);
@@ -876,6 +834,17 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 					      unsigned long start,
 					      unsigned long end)
 {
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_mmu_notifier_range range = {
+		.start		= start,
+		.end		= end,
+		.handler	= kvm_age_gfn,
+		.on_lock	= (void *)kvm_null_fn,
+		.flush_on_ret	= true,
+		.may_block	= false,
+		.lockless	= IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
+	};
+
 	trace_kvm_age_hva(start, end, false);
 
 	return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
@@ -887,6 +856,18 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 					unsigned long end,
 					bool fast_only)
 {
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_mmu_notifier_range range = {
+		.start		= start,
+		.end		= end,
+		.handler	= kvm_age_gfn,
+		.on_lock	= (void *)kvm_null_fn,
+		.flush_on_ret	= false,
+		.may_block	= false,
+		.lockless	= IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
+		.arg.fast_only	= fast_only,
+	};
+
 	trace_kvm_age_hva(start, end, fast_only);
 
 	/*
@@ -902,8 +883,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * cadence. If we find this inaccurate, we might come up with a
 	 * more sophisticated heuristic later.
 	 */
-	return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn,
-					     fast_only);
+	return kvm_handle_hva_range(kvm, &range).ret;
 }
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
@@ -911,6 +891,18 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 				       unsigned long address,
 				       bool fast_only)
 {
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_mmu_notifier_range range = {
+		.start		= address,
+		.end		= address + 1,
+		.handler	= kvm_test_age_gfn,
+		.on_lock	= (void *)kvm_null_fn,
+		.flush_on_ret	= false,
+		.may_block	= false,
+		.lockless	= IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
+		.arg.fast_only	= fast_only,
+	};
+
 	trace_kvm_test_age_hva(address, fast_only);
 
 	return kvm_handle_hva_range_no_flush(mn, address, address + 1,

>  	};
>  
>  	return __kvm_handle_hva_range(kvm, &range).ret;
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 01/11] KVM: Add lockless memslot walk to KVM
  2024-07-25 16:39   ` David Matlack
@ 2024-07-26  0:28     ` James Houghton
  0 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-26  0:28 UTC (permalink / raw)
  To: David Matlack
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Rientjes, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

On Thu, Jul 25, 2024 at 9:39 AM David Matlack <dmatlack@google.com> wrote:
>
> On 2024-07-24 01:10 AM, James Houghton wrote:
> > Provide flexibility to the architecture to synchronize as optimally as
> > they can instead of always taking the MMU lock for writing.
> >
> > Architectures that do their own locking must select
> > CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS.
> >
> > The immediate application is to allow architectures to implement the
> > test/clear_young MMU notifiers more cheaply.
> >
> > Suggested-by: Yu Zhao <yuzhao@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
>
> Aside from the cleanup suggestion (which should be in separate patches
> anyway):
>
> Reviewed-by: David Matlack <dmatlack@google.com>

Thanks David!

>
> > ---
> >  include/linux/kvm_host.h |  1 +
> >  virt/kvm/Kconfig         |  3 +++
> >  virt/kvm/kvm_main.c      | 26 +++++++++++++++++++-------
> >  3 files changed, 23 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 689e8be873a7..8cd80f969cff 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -266,6 +266,7 @@ struct kvm_gfn_range {
> >       gfn_t end;
> >       union kvm_mmu_notifier_arg arg;
> >       bool may_block;
> > +     bool lockless;
> >  };
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index b14e14cdbfb9..632334861001 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -100,6 +100,9 @@ config KVM_GENERIC_MMU_NOTIFIER
> >         select MMU_NOTIFIER
> >         bool
> >
> > +config KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
> > +       bool
> > +
> >  config KVM_GENERIC_MEMORY_ATTRIBUTES
> >         depends on KVM_GENERIC_MMU_NOTIFIER
> >         bool
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index d0788d0a72cc..33f8997a5c29 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -555,6 +555,7 @@ struct kvm_mmu_notifier_range {
> >       on_lock_fn_t on_lock;
> >       bool flush_on_ret;
> >       bool may_block;
> > +     bool lockless;
> >  };
> >
> >  /*
> > @@ -609,6 +610,10 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> >                        IS_KVM_NULL_FN(range->handler)))
> >               return r;
> >
> > +     /* on_lock will never be called for lockless walks */
> > +     if (WARN_ON_ONCE(range->lockless && !IS_KVM_NULL_FN(range->on_lock)))
> > +             return r;
> > +
> >       idx = srcu_read_lock(&kvm->srcu);
> >
> >       for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> > @@ -640,15 +645,18 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> >                       gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
> >                       gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
> >                       gfn_range.slot = slot;
> > +                     gfn_range.lockless = range->lockless;
> >
> >                       if (!r.found_memslot) {
> >                               r.found_memslot = true;
> > -                             KVM_MMU_LOCK(kvm);
> > -                             if (!IS_KVM_NULL_FN(range->on_lock))
> > -                                     range->on_lock(kvm);
> > -
> > -                             if (IS_KVM_NULL_FN(range->handler))
> > -                                     goto mmu_unlock;
> > +                             if (!range->lockless) {
> > +                                     KVM_MMU_LOCK(kvm);
> > +                                     if (!IS_KVM_NULL_FN(range->on_lock))
> > +                                             range->on_lock(kvm);
> > +
> > +                                     if (IS_KVM_NULL_FN(range->handler))
> > +                                             goto mmu_unlock;
> > +                             }
> >                       }
> >                       r.ret |= range->handler(kvm, &gfn_range);
> >               }
> > @@ -658,7 +666,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> >               kvm_flush_remote_tlbs(kvm);
> >
> >  mmu_unlock:
> > -     if (r.found_memslot)
> > +     if (r.found_memslot && !range->lockless)
> >               KVM_MMU_UNLOCK(kvm);
> >
> >       srcu_read_unlock(&kvm->srcu, idx);
> > @@ -679,6 +687,8 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> >               .on_lock        = (void *)kvm_null_fn,
> >               .flush_on_ret   = true,
> >               .may_block      = false,
> > +             .lockless       =
> > +                     IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
> >       };
> >
> >       return __kvm_handle_hva_range(kvm, &range).ret;
> > @@ -697,6 +707,8 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
> >               .on_lock        = (void *)kvm_null_fn,
> >               .flush_on_ret   = false,
> >               .may_block      = false,
> > +             .lockless       =
> > +                     IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
>
> kvm_handle_hva_range{,_no_flush}() have very generic names but
> they're intimately tied to the "young" notifiers. Whereas
> __kvm_handle_hva_range() is the truly generic handler function.
>
> This is arguably a pre-existing issue, but adding
> CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS makes these functions even more
> intamtely tied to the "young" notifiers.
>
> We could rename kvm_handle_hva_range{,_no_flush}() but I think the
> cleanest thing to do might be to just drop them entirely and move their
> contents into their callers (there are only 2 callers of these 3
> functions). That will create a little duplication but IMO will make the
> code easier to read.
>
> And then we can also rename __kvm_handle_hva_range() to
> kvm_handle_hva_range().

Thanks for the suggestion, I think this is a good idea. I'm curious
how others feel, as this indeed does duplicate the code some. Perhaps
it is better just to rename kvm_handle_hva_range() to
kvm_handle_hva_range_age() or something like that, and something
similar for _no_flush(). :/

But yeah I think it's fine to just do the manipulation you're
suggesting. I'll include it in v7 unless others say not to.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
  2024-07-24  1:10 ` [PATCH v6 01/11] KVM: Add lockless memslot walk to KVM James Houghton
@ 2024-07-24  1:10 ` James Houghton
  2024-07-25 18:07   ` David Matlack
  2024-08-17  1:05   ` Sean Christopherson
  2024-07-24  1:10 ` [PATCH v6 03/11] KVM: arm64: " James Houghton
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

Walk the TDP MMU in an RCU read-side critical section. This requires a
way to do RCU-safe walking of the tdp_mmu_roots; do this with a new
macro. The PTE modifications are now done atomically, and
kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for the
fact that kvm_age_gfn can now lockless update the accessed bit and the
R/X bits).

If the cmpxchg for marking the spte for access tracking fails, we simply
retry if the spte is still a leaf PTE. If it isn't, we return false
to continue the walk.

Harvesting age information from the shadow MMU is still done while
holding the MMU write lock.

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/Kconfig            |  1 +
 arch/x86/kvm/mmu/mmu.c          | 10 ++++-
 arch/x86/kvm/mmu/tdp_iter.h     | 27 +++++++------
 arch/x86/kvm/mmu/tdp_mmu.c      | 67 +++++++++++++++++++++++++--------
 5 files changed, 77 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 950a03e0181e..096988262005 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1456,6 +1456,7 @@ struct kvm_arch {
 	 * tdp_mmu_page set.
 	 *
 	 * For reads, this list is protected by:
+	 *	RCU alone or
 	 *	the MMU lock in read mode + RCU or
 	 *	the MMU lock in write mode
 	 *
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 4287a8071a3a..6ac43074c5e9 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -23,6 +23,7 @@ config KVM
 	depends on X86_LOCAL_APIC
 	select KVM_COMMON
 	select KVM_GENERIC_MMU_NOTIFIER
+	select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_PFNCACHE
 	select HAVE_KVM_DIRTY_RING_TSO
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 901be9e420a4..7b93ce8f0680 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1633,8 +1633,11 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm))
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
 		young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+		write_unlock(&kvm->mmu_lock);
+	}
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
@@ -1646,8 +1649,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm))
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
 		young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
+		write_unlock(&kvm->mmu_lock);
+	}
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 2880fd392e0c..510936a8455a 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
 	return xchg(rcu_dereference(sptep), new_spte);
 }
 
+static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
+{
+	atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
+
+	return (u64)atomic64_fetch_and(~mask, sptep_atomic);
+}
+
 static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 {
 	KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte));
@@ -32,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 }
 
 /*
- * SPTEs must be modified atomically if they are shadow-present, leaf
- * SPTEs, and have volatile bits, i.e. has bits that can be set outside
- * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
- * handler, and Accessed and Dirty bits can be set by the CPU.
+ * SPTEs must be modified atomically if they have bits that can be set outside
+ * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
+ * Writable bit can be set by KVM's fast page fault handler, the Accessed and
+ * Dirty bits can be set by the CPU, and the Accessed and R/X bits can be
+ * cleared by age_gfn_range.
  *
  * Note, non-leaf SPTEs do have Accessed bits and those bits are
  * technically volatile, but KVM doesn't consume the Accessed bit of
@@ -46,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
 {
 	return is_shadow_present_pte(old_spte) &&
-	       is_last_spte(old_spte, level) &&
-	       spte_has_volatile_bits(old_spte);
+	       is_last_spte(old_spte, level);
 }
 
 static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
@@ -63,12 +70,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
 static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte,
 					  u64 mask, int level)
 {
-	atomic64_t *sptep_atomic;
-
-	if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) {
-		sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
-		return (u64)atomic64_fetch_and(~mask, sptep_atomic);
-	}
+	if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level))
+		return tdp_mmu_clear_spte_bits_atomic(sptep, mask);
 
 	__kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask);
 	return old_spte;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c7dc49ee7388..3f13b2db53de 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -29,6 +29,11 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
 
 	return true;
 }
+static __always_inline bool kvm_lockdep_assert_rcu_read_lock_held(void)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	return true;
+}
 
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 {
@@ -178,6 +183,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		     ((_only_valid) && (_root)->role.invalid))) {		\
 		} else
 
+/*
+ * Iterate over all TDP MMU roots in an RCU read-side critical section.
+ */
+#define for_each_tdp_mmu_root_rcu(_kvm, _root, _as_id)				\
+	list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)		\
+		if (kvm_lockdep_assert_rcu_read_lock_held() &&			\
+		    (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id)) {	\
+		} else
+
 #define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
 	__for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
 
@@ -1224,6 +1238,27 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 	return ret;
 }
 
+static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless(
+		struct kvm *kvm,
+		struct kvm_gfn_range *range,
+		tdp_handler_t handler)
+{
+	struct kvm_mmu_page *root;
+	struct tdp_iter iter;
+	bool ret = false;
+
+	rcu_read_lock();
+
+	for_each_tdp_mmu_root_rcu(kvm, root, range->slot->as_id) {
+		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+			ret |= handler(kvm, &iter, range);
+	}
+
+	rcu_read_unlock();
+
+	return ret;
+}
+
 /*
  * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
  * if any of the GFNs in the range have been accessed.
@@ -1237,28 +1272,30 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 {
 	u64 new_spte;
 
+retry:
 	/* If we have a non-accessed entry we don't need to change the pte. */
 	if (!is_accessed_spte(iter->old_spte))
 		return false;
 
 	if (spte_ad_enabled(iter->old_spte)) {
-		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
-							 iter->old_spte,
-							 shadow_accessed_mask,
-							 iter->level);
+		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
+						shadow_accessed_mask);
 		new_spte = iter->old_spte & ~shadow_accessed_mask;
 	} else {
-		/*
-		 * Capture the dirty status of the page, so that it doesn't get
-		 * lost when the SPTE is marked for access tracking.
-		 */
+		new_spte = mark_spte_for_access_track(iter->old_spte);
+		if (__tdp_mmu_set_spte_atomic(iter, new_spte)) {
+			/*
+			 * The cmpxchg failed. If the spte is still a
+			 * last-level spte, we can safely retry.
+			 */
+			if (is_shadow_present_pte(iter->old_spte) &&
+			    is_last_spte(iter->old_spte, iter->level))
+				goto retry;
+			/* Otherwise, continue walking. */
+			return false;
+		}
 		if (is_writable_pte(iter->old_spte))
 			kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
-
-		new_spte = mark_spte_for_access_track(iter->old_spte);
-		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
-							iter->old_spte, new_spte,
-							iter->level);
 	}
 
 	trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
@@ -1268,7 +1305,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 
 bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
+	return kvm_tdp_mmu_handle_gfn_lockless(kvm, range, age_gfn_range);
 }
 
 static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
@@ -1279,7 +1316,7 @@ static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
 
 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
+	return kvm_tdp_mmu_handle_gfn_lockless(kvm, range, test_age_gfn);
 }
 
 /*
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-07-24  1:10 ` [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
@ 2024-07-25 18:07   ` David Matlack
  2024-07-26  0:34     ` James Houghton
  2024-08-17  1:05   ` Sean Christopherson
  1 sibling, 1 reply; 40+ messages in thread
From: David Matlack @ 2024-07-25 18:07 UTC (permalink / raw)
  To: James Houghton
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Rientjes, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

On 2024-07-24 01:10 AM, James Houghton wrote:
> Walk the TDP MMU in an RCU read-side critical section. This requires a
> way to do RCU-safe walking of the tdp_mmu_roots; do this with a new
> macro. The PTE modifications are now done atomically, and
> kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for the
> fact that kvm_age_gfn can now lockless update the accessed bit and the
> R/X bits).
> 
> If the cmpxchg for marking the spte for access tracking fails, we simply
> retry if the spte is still a leaf PTE. If it isn't, we return false
> to continue the walk.
> 
> Harvesting age information from the shadow MMU is still done while
> holding the MMU write lock.
> 
> Suggested-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>

Aside from the comment fixes below,

Reviewed-by: David Matlack <dmatlack@google.com>

> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/Kconfig            |  1 +
>  arch/x86/kvm/mmu/mmu.c          | 10 ++++-
>  arch/x86/kvm/mmu/tdp_iter.h     | 27 +++++++------
>  arch/x86/kvm/mmu/tdp_mmu.c      | 67 +++++++++++++++++++++++++--------
>  5 files changed, 77 insertions(+), 29 deletions(-)
> 
[...]
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
>  	return xchg(rcu_dereference(sptep), new_spte);
>  }
>  
> +static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
> +{
> +	atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
> +
> +	return (u64)atomic64_fetch_and(~mask, sptep_atomic);
> +}
> +
>  static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  {
>  	KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte));
> @@ -32,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  }
>  
>  /*
> - * SPTEs must be modified atomically if they are shadow-present, leaf
> - * SPTEs, and have volatile bits, i.e. has bits that can be set outside
> - * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
> - * handler, and Accessed and Dirty bits can be set by the CPU.
> + * SPTEs must be modified atomically if they have bits that can be set outside
> + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
> + * Writable bit can be set by KVM's fast page fault handler, the Accessed and
> + * Dirty bits can be set by the CPU, and the Accessed and R/X bits can be

"R/X bits" should be "W/R/X bits".

> + * cleared by age_gfn_range.

nit: "age_gfn_range()"



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-07-25 18:07   ` David Matlack
@ 2024-07-26  0:34     ` James Houghton
  0 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-26  0:34 UTC (permalink / raw)
  To: David Matlack
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Rientjes, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

On Thu, Jul 25, 2024 at 11:08 AM David Matlack <dmatlack@google.com> wrote:
>
> On 2024-07-24 01:10 AM, James Houghton wrote:
> > Walk the TDP MMU in an RCU read-side critical section. This requires a
> > way to do RCU-safe walking of the tdp_mmu_roots; do this with a new
> > macro. The PTE modifications are now done atomically, and
> > kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for the
> > fact that kvm_age_gfn can now lockless update the accessed bit and the
> > R/X bits).
> >
> > If the cmpxchg for marking the spte for access tracking fails, we simply
> > retry if the spte is still a leaf PTE. If it isn't, we return false
> > to continue the walk.
> >
> > Harvesting age information from the shadow MMU is still done while
> > holding the MMU write lock.
> >
> > Suggested-by: Yu Zhao <yuzhao@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
>
> Aside from the comment fixes below,
>
> Reviewed-by: David Matlack <dmatlack@google.com>

Thank you!

>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/Kconfig            |  1 +
> >  arch/x86/kvm/mmu/mmu.c          | 10 ++++-
> >  arch/x86/kvm/mmu/tdp_iter.h     | 27 +++++++------
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 67 +++++++++++++++++++++++++--------
> >  5 files changed, 77 insertions(+), 29 deletions(-)
> >
> [...]
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
> >       return xchg(rcu_dereference(sptep), new_spte);
> >  }
> >
> > +static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
> > +{
> > +     atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
> > +
> > +     return (u64)atomic64_fetch_and(~mask, sptep_atomic);
> > +}
> > +
> >  static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> >  {
> >       KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte));
> > @@ -32,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> >  }
> >
> >  /*
> > - * SPTEs must be modified atomically if they are shadow-present, leaf
> > - * SPTEs, and have volatile bits, i.e. has bits that can be set outside
> > - * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
> > - * handler, and Accessed and Dirty bits can be set by the CPU.
> > + * SPTEs must be modified atomically if they have bits that can be set outside
> > + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
> > + * Writable bit can be set by KVM's fast page fault handler, the Accessed and
> > + * Dirty bits can be set by the CPU, and the Accessed and R/X bits can be
>
> "R/X bits" should be "W/R/X bits".

Thanks. Right, we are clearing all of VMX_EPT_RWX_MASK.

>
> > + * cleared by age_gfn_range.
>
> nit: "age_gfn_range()"

Thanks, will fix this and all the other places where I've left off the ().


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-07-24  1:10 ` [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
  2024-07-25 18:07   ` David Matlack
@ 2024-08-17  1:05   ` Sean Christopherson
  2024-08-30  0:35     ` James Houghton
  1 sibling, 1 reply; 40+ messages in thread
From: Sean Christopherson @ 2024-08-17  1:05 UTC (permalink / raw)
  To: James Houghton
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Matlack, David Rientjes, James Morse,
	Jason Gunthorpe, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao, Zenghui Yu,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Wed, Jul 24, 2024, James Houghton wrote:
> Walk the TDP MMU in an RCU read-side critical section. 

...without holding mmu_lock, while doing xxx.  There are a lot of TDP MMU walks,
pand they all need RCU protection.

> This requires a way to do RCU-safe walking of the tdp_mmu_roots; do this with
> a new macro. The PTE modifications are now done atomically, and
> kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for the fact
> that kvm_age_gfn can now lockless update the accessed bit and the R/X bits).
> 
> If the cmpxchg for marking the spte for access tracking fails, we simply
> retry if the spte is still a leaf PTE. If it isn't, we return false
> to continue the walk.

Please avoid pronouns.  E.g. s/we/KVM (and adjust grammar as needed), so that
it's clear what actor in particular is doing the retry.

> Harvesting age information from the shadow MMU is still done while
> holding the MMU write lock.
> 
> Suggested-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/Kconfig            |  1 +
>  arch/x86/kvm/mmu/mmu.c          | 10 ++++-
>  arch/x86/kvm/mmu/tdp_iter.h     | 27 +++++++------
>  arch/x86/kvm/mmu/tdp_mmu.c      | 67 +++++++++++++++++++++++++--------
>  5 files changed, 77 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 950a03e0181e..096988262005 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1456,6 +1456,7 @@ struct kvm_arch {
>  	 * tdp_mmu_page set.
>  	 *
>  	 * For reads, this list is protected by:
> +	 *	RCU alone or
>  	 *	the MMU lock in read mode + RCU or
>  	 *	the MMU lock in write mode
>  	 *
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 4287a8071a3a..6ac43074c5e9 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -23,6 +23,7 @@ config KVM
>  	depends on X86_LOCAL_APIC
>  	select KVM_COMMON
>  	select KVM_GENERIC_MMU_NOTIFIER
> +	select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
>  	select HAVE_KVM_IRQCHIP
>  	select HAVE_KVM_PFNCACHE
>  	select HAVE_KVM_DIRTY_RING_TSO
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 901be9e420a4..7b93ce8f0680 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1633,8 +1633,11 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	bool young = false;
>  
> -	if (kvm_memslots_have_rmaps(kvm))
> +	if (kvm_memslots_have_rmaps(kvm)) {
> +		write_lock(&kvm->mmu_lock);
>  		young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> +		write_unlock(&kvm->mmu_lock);
> +	}
>  
>  	if (tdp_mmu_enabled)
>  		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> @@ -1646,8 +1649,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	bool young = false;
>  
> -	if (kvm_memslots_have_rmaps(kvm))
> +	if (kvm_memslots_have_rmaps(kvm)) {
> +		write_lock(&kvm->mmu_lock);
>  		young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> +		write_unlock(&kvm->mmu_lock);
> +	}
>  
>  	if (tdp_mmu_enabled)
>  		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index 2880fd392e0c..510936a8455a 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
>  	return xchg(rcu_dereference(sptep), new_spte);
>  }
>  
> +static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
> +{
> +	atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
> +
> +	return (u64)atomic64_fetch_and(~mask, sptep_atomic);
> +}
> +
>  static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  {
>  	KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte));
> @@ -32,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  }
>  
>  /*
> - * SPTEs must be modified atomically if they are shadow-present, leaf
> - * SPTEs, and have volatile bits, i.e. has bits that can be set outside
> - * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
> - * handler, and Accessed and Dirty bits can be set by the CPU.
> + * SPTEs must be modified atomically if they have bits that can be set outside
> + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
> + * Writable bit can be set by KVM's fast page fault handler, the Accessed and
> + * Dirty bits can be set by the CPU, and the Accessed and R/X bits can be
> + * cleared by age_gfn_range.
>   *
>   * Note, non-leaf SPTEs do have Accessed bits and those bits are
>   * technically volatile, but KVM doesn't consume the Accessed bit of
> @@ -46,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
>  {
>  	return is_shadow_present_pte(old_spte) &&
> -	       is_last_spte(old_spte, level) &&
> -	       spte_has_volatile_bits(old_spte);
> +	       is_last_spte(old_spte, level);
>  }
>  
>  static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> @@ -63,12 +70,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
>  static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte,
>  					  u64 mask, int level)
>  {
> -	atomic64_t *sptep_atomic;
> -
> -	if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) {
> -		sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
> -		return (u64)atomic64_fetch_and(~mask, sptep_atomic);
> -	}
> +	if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level))
> +		return tdp_mmu_clear_spte_bits_atomic(sptep, mask);
>  
>  	__kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask);
>  	return old_spte;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index c7dc49ee7388..3f13b2db53de 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -29,6 +29,11 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
>  
>  	return true;
>  }
> +static __always_inline bool kvm_lockdep_assert_rcu_read_lock_held(void)
> +{
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +	return true;
> +}

I doubt KVM needs a manual WARN, the RCU deference stuff should yell loudly if
something is missing an rcu_read_lock().

>  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  {
> @@ -178,6 +183,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>  		     ((_only_valid) && (_root)->role.invalid))) {		\
>  		} else
>  
> +/*
> + * Iterate over all TDP MMU roots in an RCU read-side critical section.
> + */
> +#define for_each_tdp_mmu_root_rcu(_kvm, _root, _as_id)				\
> +	list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)		\

This should just process valid roots:

https://lore.kernel.org/all/20240801183453.57199-7-seanjc@google.com

> +		if (kvm_lockdep_assert_rcu_read_lock_held() &&			\
> +		    (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id)) {	\
> +		} else
> +
>  #define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
>  	__for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
>  
> @@ -1224,6 +1238,27 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless(
> +		struct kvm *kvm,
> +		struct kvm_gfn_range *range,
> +		tdp_handler_t handler)

Please burn all the Google3 from your brain, and code ;-)

> +	struct kvm_mmu_page *root;
> +	struct tdp_iter iter;
> +	bool ret = false;
> +
> +	rcu_read_lock();
> +
> +	for_each_tdp_mmu_root_rcu(kvm, root, range->slot->as_id) {
> +		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
> +			ret |= handler(kvm, &iter, range);
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
>  /*
>   * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
>   * if any of the GFNs in the range have been accessed.
> @@ -1237,28 +1272,30 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
>  {
>  	u64 new_spte;
>  
> +retry:
>  	/* If we have a non-accessed entry we don't need to change the pte. */
>  	if (!is_accessed_spte(iter->old_spte))
>  		return false;
>  
>  	if (spte_ad_enabled(iter->old_spte)) {
> -		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> -							 iter->old_spte,
> -							 shadow_accessed_mask,
> -							 iter->level);
> +		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> +						shadow_accessed_mask);
>  		new_spte = iter->old_spte & ~shadow_accessed_mask;
>  	} else {
> -		/*
> -		 * Capture the dirty status of the page, so that it doesn't get
> -		 * lost when the SPTE is marked for access tracking.
> -		 */
> +		new_spte = mark_spte_for_access_track(iter->old_spte);
> +		if (__tdp_mmu_set_spte_atomic(iter, new_spte)) {
> +			/*
> +			 * The cmpxchg failed. If the spte is still a
> +			 * last-level spte, we can safely retry.
> +			 */
> +			if (is_shadow_present_pte(iter->old_spte) &&
> +			    is_last_spte(iter->old_spte, iter->level))
> +				goto retry;

Do we have a feel for how often conflicts actually happen?  I.e. is it worth
retrying and having to worry about infinite loops, however improbable they may
be?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-17  1:05   ` Sean Christopherson
@ 2024-08-30  0:35     ` James Houghton
  2024-08-30  3:47       ` Sean Christopherson
  0 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-08-30  0:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Matlack, David Rientjes, James Morse,
	Jason Gunthorpe, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao, Zenghui Yu,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Fri, Aug 16, 2024 at 6:05 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Jul 24, 2024, James Houghton wrote:
> > Walk the TDP MMU in an RCU read-side critical section.
>
> ...without holding mmu_lock, while doing xxx.  There are a lot of TDP MMU walks,
> pand they all need RCU protection.

Added "without holding mmu_lock when harvesting and potentially
updating age information on sptes".

> > This requires a way to do RCU-safe walking of the tdp_mmu_roots; do this with
> > a new macro. The PTE modifications are now done atomically, and
> > kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for the fact
> > that kvm_age_gfn can now lockless update the accessed bit and the R/X bits).
> >
> > If the cmpxchg for marking the spte for access tracking fails, we simply
> > retry if the spte is still a leaf PTE. If it isn't, we return false
> > to continue the walk.
>
> Please avoid pronouns.  E.g. s/we/KVM (and adjust grammar as needed), so that
> it's clear what actor in particular is doing the retry.

Fixed. Though, I have also changed this to reflect the change in the
retry logic I've made, given your other comment.

> > Harvesting age information from the shadow MMU is still done while
> > holding the MMU write lock.
> >
> > Suggested-by: Yu Zhao <yuzhao@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/Kconfig            |  1 +
> >  arch/x86/kvm/mmu/mmu.c          | 10 ++++-
> >  arch/x86/kvm/mmu/tdp_iter.h     | 27 +++++++------
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 67 +++++++++++++++++++++++++--------
> >  5 files changed, 77 insertions(+), 29 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 950a03e0181e..096988262005 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1456,6 +1456,7 @@ struct kvm_arch {
> >        * tdp_mmu_page set.
> >        *
> >        * For reads, this list is protected by:
> > +      *      RCU alone or
> >        *      the MMU lock in read mode + RCU or
> >        *      the MMU lock in write mode
> >        *
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index 4287a8071a3a..6ac43074c5e9 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -23,6 +23,7 @@ config KVM
> >       depends on X86_LOCAL_APIC
> >       select KVM_COMMON
> >       select KVM_GENERIC_MMU_NOTIFIER
> > +     select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
> >       select HAVE_KVM_IRQCHIP
> >       select HAVE_KVM_PFNCACHE
> >       select HAVE_KVM_DIRTY_RING_TSO
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 901be9e420a4..7b93ce8f0680 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1633,8 +1633,11 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >       bool young = false;
> >
> > -     if (kvm_memslots_have_rmaps(kvm))
> > +     if (kvm_memslots_have_rmaps(kvm)) {
> > +             write_lock(&kvm->mmu_lock);
> >               young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > +             write_unlock(&kvm->mmu_lock);
> > +     }
> >
> >       if (tdp_mmu_enabled)
> >               young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> > @@ -1646,8 +1649,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >       bool young = false;
> >
> > -     if (kvm_memslots_have_rmaps(kvm))
> > +     if (kvm_memslots_have_rmaps(kvm)) {
> > +             write_lock(&kvm->mmu_lock);
> >               young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> > +             write_unlock(&kvm->mmu_lock);
> > +     }
> >
> >       if (tdp_mmu_enabled)
> >               young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index 2880fd392e0c..510936a8455a 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
> >       return xchg(rcu_dereference(sptep), new_spte);
> >  }
> >
> > +static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
> > +{
> > +     atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
> > +
> > +     return (u64)atomic64_fetch_and(~mask, sptep_atomic);
> > +}
> > +
> >  static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> >  {
> >       KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte));
> > @@ -32,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> >  }
> >
> >  /*
> > - * SPTEs must be modified atomically if they are shadow-present, leaf
> > - * SPTEs, and have volatile bits, i.e. has bits that can be set outside
> > - * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
> > - * handler, and Accessed and Dirty bits can be set by the CPU.
> > + * SPTEs must be modified atomically if they have bits that can be set outside
> > + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
> > + * Writable bit can be set by KVM's fast page fault handler, the Accessed and
> > + * Dirty bits can be set by the CPU, and the Accessed and R/X bits can be
> > + * cleared by age_gfn_range.
> >   *
> >   * Note, non-leaf SPTEs do have Accessed bits and those bits are
> >   * technically volatile, but KVM doesn't consume the Accessed bit of
> > @@ -46,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> >  static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
> >  {
> >       return is_shadow_present_pte(old_spte) &&
> > -            is_last_spte(old_spte, level) &&
> > -            spte_has_volatile_bits(old_spte);
> > +            is_last_spte(old_spte, level);
> >  }
> >
> >  static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> > @@ -63,12 +70,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> >  static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte,
> >                                         u64 mask, int level)
> >  {
> > -     atomic64_t *sptep_atomic;
> > -
> > -     if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) {
> > -             sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
> > -             return (u64)atomic64_fetch_and(~mask, sptep_atomic);
> > -     }
> > +     if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level))
> > +             return tdp_mmu_clear_spte_bits_atomic(sptep, mask);
> >
> >       __kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask);
> >       return old_spte;
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index c7dc49ee7388..3f13b2db53de 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -29,6 +29,11 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
> >
> >       return true;
> >  }
> > +static __always_inline bool kvm_lockdep_assert_rcu_read_lock_held(void)
> > +{
> > +     WARN_ON_ONCE(!rcu_read_lock_held());
> > +     return true;
> > +}
>
> I doubt KVM needs a manual WARN, the RCU deference stuff should yell loudly if
> something is missing an rcu_read_lock().

You're right -- removed.

> >  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> >  {
> > @@ -178,6 +183,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> >                    ((_only_valid) && (_root)->role.invalid))) {               \
> >               } else
> >
> > +/*
> > + * Iterate over all TDP MMU roots in an RCU read-side critical section.
> > + */
> > +#define for_each_tdp_mmu_root_rcu(_kvm, _root, _as_id)                               \
> > +     list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)         \
>
> This should just process valid roots:
>
> https://lore.kernel.org/all/20240801183453.57199-7-seanjc@google.com

Thanks! I've added `|| (_root)->role.invalid)` to the below
conditional expression, and I've renamed the macro to
for_each_valid_tdp_mmu_root_rcu.

> > +             if (kvm_lockdep_assert_rcu_read_lock_held() &&                  \
> > +                 (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id)) {     \
> > +             } else
> > +
> >  #define for_each_tdp_mmu_root(_kvm, _root, _as_id)                   \
> >       __for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
> >
> > @@ -1224,6 +1238,27 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
> >       return ret;
> >  }
> >
> > +static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless(
> > +             struct kvm *kvm,
> > +             struct kvm_gfn_range *range,
> > +             tdp_handler_t handler)
>
> Please burn all the Google3 from your brain, and code ;-)

I indented this way to avoid going past the 80 character limit. I've
adjusted it to be more like the other functions in this file.

Perhaps I should put `static __always_inline bool` on its own line?

>
> > +     struct kvm_mmu_page *root;
> > +     struct tdp_iter iter;
> > +     bool ret = false;
> > +
> > +     rcu_read_lock();
> > +
> > +     for_each_tdp_mmu_root_rcu(kvm, root, range->slot->as_id) {
> > +             tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
> > +                     ret |= handler(kvm, &iter, range);
> > +     }
> > +
> > +     rcu_read_unlock();
> > +
> > +     return ret;
> > +}
> > +
> >  /*
> >   * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
> >   * if any of the GFNs in the range have been accessed.
> > @@ -1237,28 +1272,30 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
> >  {
> >       u64 new_spte;
> >
> > +retry:
> >       /* If we have a non-accessed entry we don't need to change the pte. */
> >       if (!is_accessed_spte(iter->old_spte))
> >               return false;
> >
> >       if (spte_ad_enabled(iter->old_spte)) {
> > -             iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> > -                                                      iter->old_spte,
> > -                                                      shadow_accessed_mask,
> > -                                                      iter->level);
> > +             iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> > +                                             shadow_accessed_mask);
> >               new_spte = iter->old_spte & ~shadow_accessed_mask;
> >       } else {
> > -             /*
> > -              * Capture the dirty status of the page, so that it doesn't get
> > -              * lost when the SPTE is marked for access tracking.
> > -              */
> > +             new_spte = mark_spte_for_access_track(iter->old_spte);
> > +             if (__tdp_mmu_set_spte_atomic(iter, new_spte)) {
> > +                     /*
> > +                      * The cmpxchg failed. If the spte is still a
> > +                      * last-level spte, we can safely retry.
> > +                      */
> > +                     if (is_shadow_present_pte(iter->old_spte) &&
> > +                         is_last_spte(iter->old_spte, iter->level))
> > +                             goto retry;
>
> Do we have a feel for how often conflicts actually happen?  I.e. is it worth
> retrying and having to worry about infinite loops, however improbable they may
> be?

I'm not sure how common this is. I think it's probably better not to
retry actually. If the cmpxchg fails, this spte is probably young
anyway, so I can just `return true` instead of potentially retrying.
This is all best-effort anyway.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-30  0:35     ` James Houghton
@ 2024-08-30  3:47       ` Sean Christopherson
  2024-08-30 12:47         ` Jason Gunthorpe
  0 siblings, 1 reply; 40+ messages in thread
From: Sean Christopherson @ 2024-08-30  3:47 UTC (permalink / raw)
  To: James Houghton
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Matlack, David Rientjes, James Morse,
	Jason Gunthorpe, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao, Zenghui Yu,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Thu, Aug 29, 2024, James Houghton wrote:
> On Fri, Aug 16, 2024 at 6:05 PM Sean Christopherson <seanjc@google.com> wrote:
> > > +static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless(
> > > +             struct kvm *kvm,
> > > +             struct kvm_gfn_range *range,
> > > +             tdp_handler_t handler)
> >
> > Please burn all the Google3 from your brain, and code ;-)
> 
> I indented this way to avoid going past the 80 character limit. I've
> adjusted it to be more like the other functions in this file.
> 
> Perhaps I should put `static __always_inline bool` on its own line?

Noooo. Do not wrap before the function name.  Linus has a nice explanation/rant
on this[1].

In this case, I'm pretty sure you can avoid the helper and simply handle all aging
paths in a single API, e.g. similar to what I proposed for the shadow MMU[2].

[1] https://lore.kernel.org/all/CAHk-=wjoLAYG446ZNHfg=GhjSY6nFmuB_wA8fYd5iLBNXjo9Bw@mail.gmail.com
[2] https://lore.kernel.org/all/20240809194335.1726916-16-seanjc@google.com

> > >  /*
> > >   * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
> > >   * if any of the GFNs in the range have been accessed.
> > > @@ -1237,28 +1272,30 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
> > >  {
> > >       u64 new_spte;
> > >
> > > +retry:
> > >       /* If we have a non-accessed entry we don't need to change the pte. */
> > >       if (!is_accessed_spte(iter->old_spte))
> > >               return false;
> > >
> > >       if (spte_ad_enabled(iter->old_spte)) {
> > > -             iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> > > -                                                      iter->old_spte,
> > > -                                                      shadow_accessed_mask,
> > > -                                                      iter->level);
> > > +             iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> > > +                                             shadow_accessed_mask);
> > >               new_spte = iter->old_spte & ~shadow_accessed_mask;
> > >       } else {
> > > -             /*
> > > -              * Capture the dirty status of the page, so that it doesn't get
> > > -              * lost when the SPTE is marked for access tracking.
> > > -              */
> > > +             new_spte = mark_spte_for_access_track(iter->old_spte);
> > > +             if (__tdp_mmu_set_spte_atomic(iter, new_spte)) {
> > > +                     /*
> > > +                      * The cmpxchg failed. If the spte is still a
> > > +                      * last-level spte, we can safely retry.
> > > +                      */
> > > +                     if (is_shadow_present_pte(iter->old_spte) &&
> > > +                         is_last_spte(iter->old_spte, iter->level))
> > > +                             goto retry;
> >
> > Do we have a feel for how often conflicts actually happen?  I.e. is it worth
> > retrying and having to worry about infinite loops, however improbable they may
> > be?
> 
> I'm not sure how common this is. I think it's probably better not to
> retry actually. If the cmpxchg fails, this spte is probably young
> anyway, so I can just `return true` instead of potentially retrying.
> This is all best-effort anyway.

+1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-30  3:47       ` Sean Christopherson
@ 2024-08-30 12:47         ` Jason Gunthorpe
  2024-08-30 17:09           ` Sean Christopherson
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2024-08-30 12:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, Andrew Morton, Paolo Bonzini, Ankit Agrawal,
	Axel Rasmussen, Catalin Marinas, David Matlack, David Rientjes,
	James Morse, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao, Zenghui Yu,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Thu, Aug 29, 2024 at 08:47:59PM -0700, Sean Christopherson wrote:
> On Thu, Aug 29, 2024, James Houghton wrote:
> > On Fri, Aug 16, 2024 at 6:05 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > +static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless(
> > > > +             struct kvm *kvm,
> > > > +             struct kvm_gfn_range *range,
> > > > +             tdp_handler_t handler)
> > >
> > > Please burn all the Google3 from your brain, and code ;-)
> > 
> > I indented this way to avoid going past the 80 character limit. I've
> > adjusted it to be more like the other functions in this file.
> > 
> > Perhaps I should put `static __always_inline bool` on its own line?
> 
> Noooo. Do not wrap before the function name.  Linus has a nice explanation/rant
> on this[1].

IMHO, run clang-format on your stuff and just be happy with 99% of
what it spits out. Saves *so much time* and usually arguing..

clang-format will occasionally decide to wrap in the GNU way, if it
can put the arguments all on one line. People will never agree on
small details of style, but it would be really nice if we can at least
agree not to nitpick clang-format's decisions :) :)

Jason


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-30 12:47         ` Jason Gunthorpe
@ 2024-08-30 17:09           ` Sean Christopherson
  2024-08-30 20:22             ` Jason Gunthorpe
  0 siblings, 1 reply; 40+ messages in thread
From: Sean Christopherson @ 2024-08-30 17:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: James Houghton, Andrew Morton, Paolo Bonzini, Ankit Agrawal,
	Axel Rasmussen, Catalin Marinas, David Matlack, David Rientjes,
	James Morse, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao, Zenghui Yu,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Fri, Aug 30, 2024, Jason Gunthorpe wrote:
> On Thu, Aug 29, 2024 at 08:47:59PM -0700, Sean Christopherson wrote:
> > On Thu, Aug 29, 2024, James Houghton wrote:
> > > On Fri, Aug 16, 2024 at 6:05 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > +static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless(
> > > > > +             struct kvm *kvm,
> > > > > +             struct kvm_gfn_range *range,
> > > > > +             tdp_handler_t handler)
> > > >
> > > > Please burn all the Google3 from your brain, and code ;-)
> > > 
> > > I indented this way to avoid going past the 80 character limit. I've
> > > adjusted it to be more like the other functions in this file.
> > > 
> > > Perhaps I should put `static __always_inline bool` on its own line?
> > 
> > Noooo. Do not wrap before the function name.  Linus has a nice explanation/rant
> > on this[1].
> 
> IMHO, run clang-format on your stuff and just be happy with 99% of
> what it spits out. Saves *so much time* and usually arguing..

Heh, nope, not bending on this one.  The time I spend far hunting for implementations
because of wraps before the function name far exceeds the time it takes me to
push back on these warts in review.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-30 17:09           ` Sean Christopherson
@ 2024-08-30 20:22             ` Jason Gunthorpe
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Gunthorpe @ 2024-08-30 20:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, Andrew Morton, Paolo Bonzini, Ankit Agrawal,
	Axel Rasmussen, Catalin Marinas, David Matlack, David Rientjes,
	James Morse, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao, Zenghui Yu,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Fri, Aug 30, 2024 at 10:09:30AM -0700, Sean Christopherson wrote:
> On Fri, Aug 30, 2024, Jason Gunthorpe wrote:
> > On Thu, Aug 29, 2024 at 08:47:59PM -0700, Sean Christopherson wrote:
> > > On Thu, Aug 29, 2024, James Houghton wrote:
> > > > On Fri, Aug 16, 2024 at 6:05 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > > +static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless(
> > > > > > +             struct kvm *kvm,
> > > > > > +             struct kvm_gfn_range *range,
> > > > > > +             tdp_handler_t handler)
> > > > >
> > > > > Please burn all the Google3 from your brain, and code ;-)
> > > > 
> > > > I indented this way to avoid going past the 80 character limit. I've
> > > > adjusted it to be more like the other functions in this file.
> > > > 
> > > > Perhaps I should put `static __always_inline bool` on its own line?
> > > 
> > > Noooo. Do not wrap before the function name.  Linus has a nice explanation/rant
> > > on this[1].
> > 
> > IMHO, run clang-format on your stuff and just be happy with 99% of
> > what it spits out. Saves *so much time* and usually arguing..
> 
> Heh, nope, not bending on this one.  The time I spend far hunting for implementations
> because of wraps before the function name far exceeds the time it takes me to
> push back on these warts in review.

clangd solved that problem for me :)

Jason


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6 03/11] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
  2024-07-24  1:10 ` [PATCH v6 01/11] KVM: Add lockless memslot walk to KVM James Houghton
  2024-07-24  1:10 ` [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
@ 2024-07-24  1:10 ` James Houghton
  2024-07-25 21:55   ` James Houghton
  2024-07-24  1:10 ` [PATCH v6 04/11] mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER James Houghton
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

Replace the MMU write locks (taken in the memslot iteration loop) for
read locks.

Grabbing the read lock instead of the write lock is safe because the
only requirement we have is that the stage-2 page tables do not get
deallocated while we are walking them. The stage2_age_walker() callback
is safe to race with itself; update the comment to reflect the
synchronization change.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/arm64/kvm/Kconfig       |  1 +
 arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++------
 arch/arm64/kvm/mmu.c         | 30 ++++++++++++++++++++++--------
 3 files changed, 32 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 58f09370d17e..7a1af8141c0e 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -22,6 +22,7 @@ menuconfig KVM
 	select KVM_COMMON
 	select KVM_GENERIC_HARDWARE_ENABLING
 	select KVM_GENERIC_MMU_NOTIFIER
+	select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select KVM_MMIO
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 9e2bbee77491..a24a2a857456 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1319,10 +1319,10 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
 	data->young = true;
 
 	/*
-	 * stage2_age_walker() is always called while holding the MMU lock for
-	 * write, so this will always succeed. Nonetheless, this deliberately
-	 * follows the race detection pattern of the other stage-2 walkers in
-	 * case the locking mechanics of the MMU notifiers is ever changed.
+	 * This walk is not exclusive; the PTE is permitted to change from
+	 * under us. If there is a race to update this PTE, then the GFN is
+	 * most likely young, so failing to clear the AF is likely to be
+	 * inconsequential.
 	 */
 	if (data->mkold && !stage2_try_set_pte(ctx, new))
 		return -EAGAIN;
@@ -1345,10 +1345,13 @@ bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
 	struct kvm_pgtable_walker walker = {
 		.cb		= stage2_age_walker,
 		.arg		= &data,
-		.flags		= KVM_PGTABLE_WALK_LEAF,
+		.flags		= KVM_PGTABLE_WALK_LEAF |
+				  KVM_PGTABLE_WALK_SHARED,
 	};
+	int r;
 
-	WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker));
+	r = kvm_pgtable_walk(pgt, addr, size, &walker);
+	WARN_ON_ONCE(r && r != -EAGAIN);
 	return data.young;
 }
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6981b1bc0946..e37765f6f2a1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1912,29 +1912,43 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
+	bool young = false;
+
+	read_lock(&kvm->mmu_lock);
 
 	if (!kvm->arch.mmu.pgt)
-		return false;
+		goto out;
 
-	return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
-						   range->start << PAGE_SHIFT,
-						   size, true);
+	young = kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
+						    range->start << PAGE_SHIFT,
+						    size, true);
 	/*
 	 * TODO: Handle nested_mmu structures here using the reverse mapping in
 	 * a later version of patch series.
 	 */
+
+out:
+	read_unlock(&kvm->mmu_lock);
+	return young;
 }
 
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
+	bool young = false;
+
+	read_lock(&kvm->mmu_lock);
 
 	if (!kvm->arch.mmu.pgt)
-		return false;
+		goto out;
 
-	return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
-						   range->start << PAGE_SHIFT,
-						   size, false);
+	young = kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
+						    range->start << PAGE_SHIFT,
+						    size, false);
+
+out:
+	read_unlock(&kvm->mmu_lock);
+	return young;
 }
 
 phys_addr_t kvm_mmu_get_httbr(void)
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 03/11] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-07-24  1:10 ` [PATCH v6 03/11] KVM: arm64: " James Houghton
@ 2024-07-25 21:55   ` James Houghton
  2024-08-17  0:46     ` Sean Christopherson
  0 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-25 21:55 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Morse, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Oliver Upton, Raghavendra Rao Ananta, Ryan Roberts,
	Sean Christopherson, Shaoqin Huang, Suzuki K Poulose, Wei Xu,
	Will Deacon, Yu Zhao, Zenghui Yu, kvmarm, kvm, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm

On Tue, Jul 23, 2024 at 6:11 PM James Houghton <jthoughton@google.com> wrote:
>
> Replace the MMU write locks (taken in the memslot iteration loop) for
> read locks.
>
> Grabbing the read lock instead of the write lock is safe because the
> only requirement we have is that the stage-2 page tables do not get
> deallocated while we are walking them. The stage2_age_walker() callback
> is safe to race with itself; update the comment to reflect the
> synchronization change.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---

Here is some data to show that this patch at least *can* be helpful:

# arm64 patched to do aging (i.e., set HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY)
# The test is faulting memory in while doing aging as fast as possible.
# taskset -c 0-32 ./access_tracking_perf_test -l -r /dev/cgroup/memory
-p -v 32 -m 3

# Write lock
vcpu wall time                : 3.039207157s
lru_gen avg pass duration     : 1.660541541s, (passes:2, total:3.321083083s)

# Read lock
vcpu wall time                : 3.010848445s
lru_gen avg pass duration     : 0.306623698s, (passes:11, total:3.372860688s)

Aging is able to run significantly faster, but vCPU runtime isn't
affected much (in this test).

It would be really nice to motivate this patch with a test that didn't
require patching the kernel... Oliver and Marc, please let me know if
you'd like to see more data. I'm also happy to simply drop this patch.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 03/11] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-07-25 21:55   ` James Houghton
@ 2024-08-17  0:46     ` Sean Christopherson
  2024-08-17  1:03       ` Yu Zhao
  0 siblings, 1 reply; 40+ messages in thread
From: Sean Christopherson @ 2024-08-17  0:46 UTC (permalink / raw)
  To: James Houghton
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Matlack, David Rientjes, James Morse,
	Jason Gunthorpe, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao, Zenghui Yu,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Thu, Jul 25, 2024, James Houghton wrote:
> On Tue, Jul 23, 2024 at 6:11 PM James Houghton <jthoughton@google.com> wrote:
> >
> > Replace the MMU write locks (taken in the memslot iteration loop) for
> > read locks.
> >
> > Grabbing the read lock instead of the write lock is safe because the
> > only requirement we have is that the stage-2 page tables do not get
> > deallocated while we are walking them. The stage2_age_walker() callback
> > is safe to race with itself; update the comment to reflect the
> > synchronization change.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> 
> Here is some data to show that this patch at least *can* be helpful:
> 
> # arm64 patched to do aging (i.e., set HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY)
> # The test is faulting memory in while doing aging as fast as possible.
> # taskset -c 0-32 ./access_tracking_perf_test -l -r /dev/cgroup/memory
> -p -v 32 -m 3
> 
> # Write lock
> vcpu wall time                : 3.039207157s
> lru_gen avg pass duration     : 1.660541541s, (passes:2, total:3.321083083s)
> 
> # Read lock
> vcpu wall time                : 3.010848445s
> lru_gen avg pass duration     : 0.306623698s, (passes:11, total:3.372860688s)
> 
> Aging is able to run significantly faster, but vCPU runtime isn't
> affected much (in this test).

Were you expecting vCPU runtime to improve (more)?  If so, lack of movement could
be due to KVM arm64 taking mmap_lock for read when handling faults:

https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 03/11] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-17  0:46     ` Sean Christopherson
@ 2024-08-17  1:03       ` Yu Zhao
  2024-08-19 20:41         ` Oliver Upton
  0 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2024-08-17  1:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, Andrew Morton, Paolo Bonzini, Ankit Agrawal,
	Axel Rasmussen, Catalin Marinas, David Matlack, David Rientjes,
	James Morse, Jason Gunthorpe, Jonathan Corbet, Marc Zyngier,
	Oliver Upton, Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Zenghui Yu, kvmarm, kvm,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jul 25, 2024, James Houghton wrote:
> > On Tue, Jul 23, 2024 at 6:11 PM James Houghton <jthoughton@google.com> wrote:
> > >
> > > Replace the MMU write locks (taken in the memslot iteration loop) for
> > > read locks.
> > >
> > > Grabbing the read lock instead of the write lock is safe because the
> > > only requirement we have is that the stage-2 page tables do not get
> > > deallocated while we are walking them. The stage2_age_walker() callback
> > > is safe to race with itself; update the comment to reflect the
> > > synchronization change.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> >
> > Here is some data to show that this patch at least *can* be helpful:
> >
> > # arm64 patched to do aging (i.e., set HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY)
> > # The test is faulting memory in while doing aging as fast as possible.
> > # taskset -c 0-32 ./access_tracking_perf_test -l -r /dev/cgroup/memory
> > -p -v 32 -m 3
> >
> > # Write lock
> > vcpu wall time                : 3.039207157s
> > lru_gen avg pass duration     : 1.660541541s, (passes:2, total:3.321083083s)
> >
> > # Read lock
> > vcpu wall time                : 3.010848445s
> > lru_gen avg pass duration     : 0.306623698s, (passes:11, total:3.372860688s)
> >
> > Aging is able to run significantly faster, but vCPU runtime isn't
> > affected much (in this test).
>
> Were you expecting vCPU runtime to improve (more)?  If so, lack of movement could
> be due to KVM arm64 taking mmap_lock for read when handling faults:
>
> https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com

For the above test, I don't think it's mmap_lock -- the reclaim path,
e.g., when zswapping guest memory, has two stages: aging (scanning
PTEs) and eviction (unmapping PTEs). Only testing the former isn't
realistic at all. IOW, for a r/w lock use case, only testing the read
lock path would be bad coverage.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 03/11] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-17  1:03       ` Yu Zhao
@ 2024-08-19 20:41         ` Oliver Upton
  2024-08-19 22:47           ` Sean Christopherson
  2024-08-30  0:33           ` James Houghton
  0 siblings, 2 replies; 40+ messages in thread
From: Oliver Upton @ 2024-08-19 20:41 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Sean Christopherson, James Houghton, Andrew Morton, Paolo Bonzini,
	Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Morse, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Zenghui Yu, kvmarm, kvm,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Fri, Aug 16, 2024 at 07:03:27PM -0600, Yu Zhao wrote:
> On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote:

[...]

> > Were you expecting vCPU runtime to improve (more)?  If so, lack of movement could
> > be due to KVM arm64 taking mmap_lock for read when handling faults:
> >
> > https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com
> 
> For the above test, I don't think it's mmap_lock

Yeah, I don't think this is related to the mmap_lock.

James is likely using hardware that has FEAT_HAFDBS, so vCPUs won't
fault for an Access flag update. Even if he's on a machine w/o it,
Access flag faults are handled outside the mmap_lock.

Forcing SW management of the AF at stage-2 would be the best case for
demonstrating the locking improvement:

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a24a2a857456..a640e8a8c6ea 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -669,8 +669,6 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
 	 * happen to be running on a design that has unadvertised support for
 	 * HAFDBS. Here be dragons.
 	 */
-	if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38))
-		vtcr |= VTCR_EL2_HA;
 #endif /* CONFIG_ARM64_HW_AFDBM */

 	if (kvm_lpa2_is_enabled())

Changing the config option would work too, but I wasn't sure if
FEAT_HAFDBS on the primary MMU influenced MGLRU heuristics.

> -- the reclaim path,
> e.g., when zswapping guest memory, has two stages: aging (scanning
> PTEs) and eviction (unmapping PTEs). Only testing the former isn't
> realistic at all.

AIUI, the intention of this test data is to provide some justification
for why Marc + I should consider the locking change *outside* of any
MMU notifier changes. So from that POV, this is meant as a hacked
up microbenchmark and not meant to be realistic.

And really, the arm64 change has nothing to do with this series at
this point, which is disappointing. In the interest of moving this
feature along for both architectures, would you be able help James
with:

 - Identifying a benchmark that you believe is realistic

 - Suggestions on how to run that benchmark on Google infrastructure

Asking since you had a setup / data earlier on when you were carrying
the series. Hopefully with supportive data we can get arm64 to opt-in
to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well.

-- 
Thanks,
Oliver

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 03/11] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-19 20:41         ` Oliver Upton
@ 2024-08-19 22:47           ` Sean Christopherson
  2024-08-30  0:33           ` James Houghton
  1 sibling, 0 replies; 40+ messages in thread
From: Sean Christopherson @ 2024-08-19 22:47 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Yu Zhao, James Houghton, Andrew Morton, Paolo Bonzini,
	Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Morse, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Zenghui Yu, kvmarm, kvm,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Mon, Aug 19, 2024, Oliver Upton wrote:
> On Fri, Aug 16, 2024 at 07:03:27PM -0600, Yu Zhao wrote:
> > On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote:
> 
> [...]
> 
> > > Were you expecting vCPU runtime to improve (more)?  If so, lack of movement could
> > > be due to KVM arm64 taking mmap_lock for read when handling faults:
> > >
> > > https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com
> > 
> > For the above test, I don't think it's mmap_lock
> 
> Yeah, I don't think this is related to the mmap_lock.
> 
> James is likely using hardware that has FEAT_HAFDBS, so vCPUs won't
> fault for an Access flag update.

Huh, didn't know that was a thing on ARM.  Ooh, that lends even more credence to
my assertion that marking folios accessed in handle_access_fault() can go away[*].
I assume hardware-assisted updates means this code in handle_access_fault() will
no longer be hit, as KVM simply won't ever get access faults?  If so, I'll add
that info to the changelog.

	if (kvm_pte_valid(pte))
		kvm_set_pfn_accessed(kvm_pte_to_pfn(pte));

[*] https://lore.kernel.org/all/20240726235234.228822-83-seanjc@google.com

> Even if he's on a machine w/o it, Access flag faults are handled outside the
> mmap_lock.

Oh, right, they go down handle_access_fault(), not user_mem_abort().

Reviewing late Friday afternoon, never a good idea ;-)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 03/11] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-19 20:41         ` Oliver Upton
  2024-08-19 22:47           ` Sean Christopherson
@ 2024-08-30  0:33           ` James Houghton
  2024-08-30  0:48             ` Oliver Upton
  1 sibling, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-08-30  0:33 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Yu Zhao, Sean Christopherson, Andrew Morton, Paolo Bonzini,
	Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Morse, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Zenghui Yu, kvmarm, kvm,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Fri, Aug 16, 2024 at 07:03:27PM -0600, Yu Zhao wrote:
> > On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> [...]
>
> > > Were you expecting vCPU runtime to improve (more)?  If so, lack of movement could
> > > be due to KVM arm64 taking mmap_lock for read when handling faults:

I had no real expectation. I was hoping that maybe there could be a
vCPU runtime improvement, given that user_mem_abort() (being called
because we're faulting memory in continuously in this test) has to
take the KVM MMU lock for reading, and aging is taking it for reading
vs. writing. I think that's why aging is a lot slower when using the
write lock: it is waiting for the readers to drop the lock, but I
guess the delay on the *readers* due to the pending writer seems to be
pretty minimal.

> > >
> > > https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com
> >
> > For the above test, I don't think it's mmap_lock
>
> Yeah, I don't think this is related to the mmap_lock.
>
> James is likely using hardware that has FEAT_HAFDBS, so vCPUs won't
> fault for an Access flag update. Even if he's on a machine w/o it,
> Access flag faults are handled outside the mmap_lock.

Yeah I was running on Ampere Altra CPUs.

> Forcing SW management of the AF at stage-2 would be the best case for
> demonstrating the locking improvement:
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index a24a2a857456..a640e8a8c6ea 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -669,8 +669,6 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
>          * happen to be running on a design that has unadvertised support for
>          * HAFDBS. Here be dragons.
>          */
> -       if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38))
> -               vtcr |= VTCR_EL2_HA;
>  #endif /* CONFIG_ARM64_HW_AFDBM */
>
>         if (kvm_lpa2_is_enabled())

Thanks!

> Changing the config option would work too, but I wasn't sure if
> FEAT_HAFDBS on the primary MMU influenced MGLRU heuristics.

Indeed, disabling CONFIG_ARM64_HW_AFDBM will cause MGLRU not to do aging.

> > -- the reclaim path,
> > e.g., when zswapping guest memory, has two stages: aging (scanning
> > PTEs) and eviction (unmapping PTEs). Only testing the former isn't
> > realistic at all.
>
> AIUI, the intention of this test data is to provide some justification
> for why Marc + I should consider the locking change *outside* of any
> MMU notifier changes. So from that POV, this is meant as a hacked
> up microbenchmark and not meant to be realistic.
>
> And really, the arm64 change has nothing to do with this series at
> this point, which is disappointing. In the interest of moving this
> feature along for both architectures, would you be able help James
> with:
>
>  - Identifying a benchmark that you believe is realistic
>
>  - Suggestions on how to run that benchmark on Google infrastructure
>
> Asking since you had a setup / data earlier on when you were carrying
> the series. Hopefully with supportive data we can get arm64 to opt-in
> to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well.

I'll keep trying some other approaches I can take for getting similar
testing that Yu had; it is somewhat difficult for me to reproduce
those tests (and it really shouldn't be.... sorry).

I think it makes most sense for me to drop the arm64 patch for now and
re-propose it (or something stronger) alongside enabling aging. Does
that sound ok?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 03/11] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-30  0:33           ` James Houghton
@ 2024-08-30  0:48             ` Oliver Upton
  2024-08-30 15:33               ` David Matlack
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Upton @ 2024-08-30  0:48 UTC (permalink / raw)
  To: James Houghton
  Cc: Yu Zhao, Sean Christopherson, Andrew Morton, Paolo Bonzini,
	Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Morse, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Zenghui Yu, kvmarm, kvm,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Thu, Aug 29, 2024 at 05:33:00PM -0700, James Houghton wrote:
> On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > Asking since you had a setup / data earlier on when you were carrying
> > the series. Hopefully with supportive data we can get arm64 to opt-in
> > to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well.
> 
> I'll keep trying some other approaches I can take for getting similar
> testing that Yu had; it is somewhat difficult for me to reproduce
> those tests (and it really shouldn't be.... sorry).

No need to apologize. Getting good test hardware for arm64 is a complete
chore. Sure would love a functional workstation with cores from this
decade...

> I think it makes most sense for me to drop the arm64 patch for now and
> re-propose it (or something stronger) alongside enabling aging. Does
> that sound ok?

I'm a bit disappointed that we haven't gotten forward progress on the
arm64 patches, but I also recognize this is the direction of travel as
the x86 patches are shaping up.

So yeah, I'm OK with it, but I'd love to get the arm64 side sorted out
soon while the context is still fresh.

-- 
Thanks,
Oliver


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 03/11] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-30  0:48             ` Oliver Upton
@ 2024-08-30 15:33               ` David Matlack
  2024-08-30 17:38                 ` Oliver Upton
  0 siblings, 1 reply; 40+ messages in thread
From: David Matlack @ 2024-08-30 15:33 UTC (permalink / raw)
  To: Oliver Upton
  Cc: James Houghton, Yu Zhao, Sean Christopherson, Andrew Morton,
	Paolo Bonzini, Ankit Agrawal, Axel Rasmussen, Catalin Marinas,
	David Rientjes, James Morse, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Zenghui Yu, kvmarm, kvm,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Thu, Aug 29, 2024 at 5:48 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Thu, Aug 29, 2024 at 05:33:00PM -0700, James Houghton wrote:
> > On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > Asking since you had a setup / data earlier on when you were carrying
> > > the series. Hopefully with supportive data we can get arm64 to opt-in
> > > to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well.
> >
> > I'll keep trying some other approaches I can take for getting similar
> > testing that Yu had; it is somewhat difficult for me to reproduce
> > those tests (and it really shouldn't be.... sorry).
>
> No need to apologize. Getting good test hardware for arm64 is a complete
> chore. Sure would love a functional workstation with cores from this
> decade...
>
> > I think it makes most sense for me to drop the arm64 patch for now and
> > re-propose it (or something stronger) alongside enabling aging. Does
> > that sound ok?
>
> I'm a bit disappointed that we haven't gotten forward progress on the
> arm64 patches, but I also recognize this is the direction of travel as
> the x86 patches are shaping up.
>
> So yeah, I'm OK with it, but I'd love to get the arm64 side sorted out
> soon while the context is still fresh.

Converting the aging notifiers to holding mmu_lock for read seems like
a pure win and minimal churn. Can we keep that patch in v7 (which
depends on the lockless notifier refactor, i.e. is not completely
stand-alone)? We can revisit enabling MGLRU on arm64 in a subsequent
series.
>
> --
> Thanks,
> Oliver


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 03/11] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  2024-08-30 15:33               ` David Matlack
@ 2024-08-30 17:38                 ` Oliver Upton
  0 siblings, 0 replies; 40+ messages in thread
From: Oliver Upton @ 2024-08-30 17:38 UTC (permalink / raw)
  To: David Matlack
  Cc: James Houghton, Yu Zhao, Sean Christopherson, Andrew Morton,
	Paolo Bonzini, Ankit Agrawal, Axel Rasmussen, Catalin Marinas,
	David Rientjes, James Morse, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Zenghui Yu, kvmarm, kvm,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm

Hey David,

On Fri, Aug 30, 2024 at 08:33:59AM -0700, David Matlack wrote:
> On Thu, Aug 29, 2024 at 5:48 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Thu, Aug 29, 2024 at 05:33:00PM -0700, James Houghton wrote:
> > > On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > Asking since you had a setup / data earlier on when you were carrying
> > > > the series. Hopefully with supportive data we can get arm64 to opt-in
> > > > to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well.
> > >
> > > I'll keep trying some other approaches I can take for getting similar
> > > testing that Yu had; it is somewhat difficult for me to reproduce
> > > those tests (and it really shouldn't be.... sorry).
> >
> > No need to apologize. Getting good test hardware for arm64 is a complete
> > chore. Sure would love a functional workstation with cores from this
> > decade...
> >
> > > I think it makes most sense for me to drop the arm64 patch for now and
> > > re-propose it (or something stronger) alongside enabling aging. Does
> > > that sound ok?
> >
> > I'm a bit disappointed that we haven't gotten forward progress on the
> > arm64 patches, but I also recognize this is the direction of travel as
> > the x86 patches are shaping up.
> >
> > So yeah, I'm OK with it, but I'd love to get the arm64 side sorted out
> > soon while the context is still fresh.
> 
> Converting the aging notifiers to holding mmu_lock for read seems like
> a pure win and minimal churn. Can we keep that patch in v7 (which
> depends on the lockless notifier refactor, i.e. is not completely
> stand-alone)? We can revisit enabling MGLRU on arm64 in a subsequent
> series.

Even though the churn is minimal in LOC, locking changes are significant. If
one thing has become clear, there are some strong opinions about arm64
participating in MGLRU w/ the read lock. So it is almost guaranteed that
these read lock changes will eventually get thrown out in favor of an
RCU-protected walker.

Then we're stuck with potentially 3 flavors of locking in kernels that
people actually use, and dealing with breakage that only affects that
intermediate step is gonna be annoying.

-- 
Thanks,
Oliver


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6 04/11] mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER
  2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
                   ` (2 preceding siblings ...)
  2024-07-24  1:10 ` [PATCH v6 03/11] KVM: arm64: " James Houghton
@ 2024-07-24  1:10 ` James Houghton
  2024-08-01  9:34   ` David Hildenbrand
  2024-07-24  1:10 ` [PATCH v6 05/11] mm: Add fast_only bool to test_young and clear_young MMU notifiers James Houghton
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

Remove the now unnecessary ifdef in mm/damon/vaddr.c as well.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/mmu_notifier.h | 7 +++++++
 mm/damon/vaddr.c             | 2 --
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d39ebb10caeb..e2dd57ca368b 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -606,6 +606,13 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
 	return 0;
 }
 
+static inline int mmu_notifier_clear_young(struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long end)
+{
+	return 0;
+}
+
 static inline int mmu_notifier_test_young(struct mm_struct *mm,
 					  unsigned long address)
 {
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 381559e4a1fa..a453d77565e6 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -351,11 +351,9 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
 		set_huge_pte_at(mm, addr, pte, entry, psize);
 	}
 
-#ifdef CONFIG_MMU_NOTIFIER
 	if (mmu_notifier_clear_young(mm, addr,
 				     addr + huge_page_size(hstate_vma(vma))))
 		referenced = true;
-#endif /* CONFIG_MMU_NOTIFIER */
 
 	if (referenced)
 		folio_set_young(folio);
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 04/11] mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER
  2024-07-24  1:10 ` [PATCH v6 04/11] mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER James Houghton
@ 2024-08-01  9:34   ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-08-01  9:34 UTC (permalink / raw)
  To: James Houghton, Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Morse, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Oliver Upton, Raghavendra Rao Ananta, Ryan Roberts,
	Sean Christopherson, Shaoqin Huang, Suzuki K Poulose, Wei Xu,
	Will Deacon, Yu Zhao, Zenghui Yu, kvmarm, kvm, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm

On 24.07.24 03:10, James Houghton wrote:
> Remove the now unnecessary ifdef in mm/damon/vaddr.c as well.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   include/linux/mmu_notifier.h | 7 +++++++
>   mm/damon/vaddr.c             | 2 --
>   2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d39ebb10caeb..e2dd57ca368b 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -606,6 +606,13 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
>   	return 0;
>   }
>   
> +static inline int mmu_notifier_clear_young(struct mm_struct *mm,
> +					   unsigned long start,
> +					   unsigned long end)
> +{
> +	return 0;
> +}
> +
>   static inline int mmu_notifier_test_young(struct mm_struct *mm,
>   					  unsigned long address)
>   {
> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> index 381559e4a1fa..a453d77565e6 100644
> --- a/mm/damon/vaddr.c
> +++ b/mm/damon/vaddr.c
> @@ -351,11 +351,9 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
>   		set_huge_pte_at(mm, addr, pte, entry, psize);
>   	}
>   
> -#ifdef CONFIG_MMU_NOTIFIER
>   	if (mmu_notifier_clear_young(mm, addr,
>   				     addr + huge_page_size(hstate_vma(vma))))
>   		referenced = true;
> -#endif /* CONFIG_MMU_NOTIFIER */
>   
>   	if (referenced)
>   		folio_set_young(folio);

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6 05/11] mm: Add fast_only bool to test_young and clear_young MMU notifiers
  2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
                   ` (3 preceding siblings ...)
  2024-07-24  1:10 ` [PATCH v6 04/11] mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER James Houghton
@ 2024-07-24  1:10 ` James Houghton
  2024-08-01  9:36   ` David Hildenbrand
  2024-07-24  1:10 ` [PATCH v6 06/11] mm: Add has_fast_aging to struct mmu_notifier James Houghton
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

For implementers, the fast_only bool indicates that the age information
needs to be harvested such that we do not slow down other MMU operations,
and ideally that we are not ourselves slowed down by other MMU
operations.  Usually this means that the implementation should be
lockless.

Also add mmu_notifier_test_young_fast_only() and
mmu_notifier_clear_young_fast_only() helpers to set fast_only for these
notifiers.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/mmu_notifier.h | 46 +++++++++++++++++++++++++++++++-----
 include/trace/events/kvm.h   | 19 +++++++++------
 mm/mmu_notifier.c            | 12 ++++++----
 virt/kvm/kvm_main.c          | 12 ++++++----
 4 files changed, 67 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index e2dd57ca368b..45c5995ebd84 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -110,7 +110,8 @@ struct mmu_notifier_ops {
 	int (*clear_young)(struct mmu_notifier *subscription,
 			   struct mm_struct *mm,
 			   unsigned long start,
-			   unsigned long end);
+			   unsigned long end,
+			   bool fast_only);
 
 	/*
 	 * test_young is called to check the young/accessed bitflag in
@@ -120,7 +121,8 @@ struct mmu_notifier_ops {
 	 */
 	int (*test_young)(struct mmu_notifier *subscription,
 			  struct mm_struct *mm,
-			  unsigned long address);
+			  unsigned long address,
+			  bool fast_only);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -380,9 +382,11 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long end);
 extern int __mmu_notifier_clear_young(struct mm_struct *mm,
 				      unsigned long start,
-				      unsigned long end);
+				      unsigned long end,
+				      bool fast_only);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
-				     unsigned long address);
+				     unsigned long address,
+				     bool fast_only);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
 extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
@@ -416,7 +420,16 @@ static inline int mmu_notifier_clear_young(struct mm_struct *mm,
 					   unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		return __mmu_notifier_clear_young(mm, start, end);
+		return __mmu_notifier_clear_young(mm, start, end, false);
+	return 0;
+}
+
+static inline int mmu_notifier_clear_young_fast_only(struct mm_struct *mm,
+						     unsigned long start,
+						     unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_young(mm, start, end, true);
 	return 0;
 }
 
@@ -424,7 +437,15 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 					  unsigned long address)
 {
 	if (mm_has_notifiers(mm))
-		return __mmu_notifier_test_young(mm, address);
+		return __mmu_notifier_test_young(mm, address, false);
+	return 0;
+}
+
+static inline int mmu_notifier_test_young_fast_only(struct mm_struct *mm,
+						    unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_test_young(mm, address, true);
 	return 0;
 }
 
@@ -613,12 +634,25 @@ static inline int mmu_notifier_clear_young(struct mm_struct *mm,
 	return 0;
 }
 
+static inline int mmu_notifier_clear_young_fast_only(struct mm_struct *mm,
+						     unsigned long start,
+						     unsigned long end)
+{
+	return 0;
+}
+
 static inline int mmu_notifier_test_young(struct mm_struct *mm,
 					  unsigned long address)
 {
 	return 0;
 }
 
+static inline int mmu_notifier_test_young_fast_only(struct mm_struct *mm,
+						    unsigned long address)
+{
+	return 0;
+}
+
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 74e40d5d4af4..6d9485cf3e51 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -457,36 +457,41 @@ TRACE_EVENT(kvm_unmap_hva_range,
 );
 
 TRACE_EVENT(kvm_age_hva,
-	TP_PROTO(unsigned long start, unsigned long end),
-	TP_ARGS(start, end),
+	TP_PROTO(unsigned long start, unsigned long end, bool fast_only),
+	TP_ARGS(start, end, fast_only),
 
 	TP_STRUCT__entry(
 		__field(	unsigned long,	start		)
 		__field(	unsigned long,	end		)
+		__field(	bool,		fast_only	)
 	),
 
 	TP_fast_assign(
 		__entry->start		= start;
 		__entry->end		= end;
+		__entry->fast_only	= fast_only;
 	),
 
-	TP_printk("mmu notifier age hva: %#016lx -- %#016lx",
-		  __entry->start, __entry->end)
+	TP_printk("mmu notifier age hva: %#016lx -- %#016lx fast_only: %d",
+		  __entry->start, __entry->end, __entry->fast_only)
 );
 
 TRACE_EVENT(kvm_test_age_hva,
-	TP_PROTO(unsigned long hva),
-	TP_ARGS(hva),
+	TP_PROTO(unsigned long hva, bool fast_only),
+	TP_ARGS(hva, fast_only),
 
 	TP_STRUCT__entry(
 		__field(	unsigned long,	hva		)
+		__field(	bool,		fast_only	)
 	),
 
 	TP_fast_assign(
 		__entry->hva		= hva;
+		__entry->fast_only	= fast_only;
 	),
 
-	TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
+	TP_printk("mmu notifier test age hva: %#016lx fast_only: %d",
+		  __entry->hva, __entry->fast_only)
 );
 
 #endif /* _TRACE_KVM_MAIN_H */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 8982e6139d07..f9a0ca6ffe65 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 
 int __mmu_notifier_clear_young(struct mm_struct *mm,
 			       unsigned long start,
-			       unsigned long end)
+			       unsigned long end,
+			       bool fast_only)
 {
 	struct mmu_notifier *subscription;
 	int young = 0, id;
@@ -395,7 +396,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
 				 srcu_read_lock_held(&srcu)) {
 		if (subscription->ops->clear_young)
 			young |= subscription->ops->clear_young(subscription,
-								mm, start, end);
+								mm, start, end,
+								fast_only);
 	}
 	srcu_read_unlock(&srcu, id);
 
@@ -403,7 +405,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
 }
 
 int __mmu_notifier_test_young(struct mm_struct *mm,
-			      unsigned long address)
+			      unsigned long address,
+			      bool fast_only)
 {
 	struct mmu_notifier *subscription;
 	int young = 0, id;
@@ -414,7 +417,8 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 				 srcu_read_lock_held(&srcu)) {
 		if (subscription->ops->test_young) {
 			young = subscription->ops->test_young(subscription, mm,
-							      address);
+							      address,
+							      fast_only);
 			if (young)
 				break;
 		}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 33f8997a5c29..959b6d5d8ce4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -874,7 +874,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 					      unsigned long start,
 					      unsigned long end)
 {
-	trace_kvm_age_hva(start, end);
+	trace_kvm_age_hva(start, end, false);
 
 	return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
 }
@@ -882,9 +882,10 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long start,
-					unsigned long end)
+					unsigned long end,
+					bool fast_only)
 {
-	trace_kvm_age_hva(start, end);
+	trace_kvm_age_hva(start, end, fast_only);
 
 	/*
 	 * Even though we do not flush TLB, this will still adversely
@@ -904,9 +905,10 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long address)
+				       unsigned long address,
+				       bool fast_only)
 {
-	trace_kvm_test_age_hva(address);
+	trace_kvm_test_age_hva(address, fast_only);
 
 	return kvm_handle_hva_range_no_flush(mn, address, address + 1,
 					     kvm_test_age_gfn);
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 05/11] mm: Add fast_only bool to test_young and clear_young MMU notifiers
  2024-07-24  1:10 ` [PATCH v6 05/11] mm: Add fast_only bool to test_young and clear_young MMU notifiers James Houghton
@ 2024-08-01  9:36   ` David Hildenbrand
  2024-08-01 23:13     ` James Houghton
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-08-01  9:36 UTC (permalink / raw)
  To: James Houghton, Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Morse, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Oliver Upton, Raghavendra Rao Ananta, Ryan Roberts,
	Sean Christopherson, Shaoqin Huang, Suzuki K Poulose, Wei Xu,
	Will Deacon, Yu Zhao, Zenghui Yu, kvmarm, kvm, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm

On 24.07.24 03:10, James Houghton wrote:
> For implementers, the fast_only bool indicates that the age information
> needs to be harvested such that we do not slow down other MMU operations,
> and ideally that we are not ourselves slowed down by other MMU
> operations.  Usually this means that the implementation should be
> lockless.

But what are the semantics if "fast_only" cannot be achieved by the 
implementer?

Can we add some documentation to the new functions that explain what 
this mysterious "fast_only" is and what the expected semantics are? 
Please? :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 05/11] mm: Add fast_only bool to test_young and clear_young MMU notifiers
  2024-08-01  9:36   ` David Hildenbrand
@ 2024-08-01 23:13     ` James Houghton
  2024-08-02 15:57       ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-08-01 23:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Matlack, David Rientjes, James Morse,
	Jason Gunthorpe, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

On Thu, Aug 1, 2024 at 2:36 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 24.07.24 03:10, James Houghton wrote:
> > For implementers, the fast_only bool indicates that the age information
> > needs to be harvested such that we do not slow down other MMU operations,
> > and ideally that we are not ourselves slowed down by other MMU
> > operations.  Usually this means that the implementation should be
> > lockless.
>
> But what are the semantics if "fast_only" cannot be achieved by the
> implementer?
>
> Can we add some documentation to the new functions that explain what
> this mysterious "fast_only" is and what the expected semantics are?
> Please? :)

Thanks for pointing out the missing documentation. How's this?

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 45c5995ebd84..c21992036dd3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -106,6 +106,18 @@ struct mmu_notifier_ops {
         * clear_young is a lightweight version of clear_flush_young. Like the
         * latter, it is supposed to test-and-clear the young/accessed bitflag
         * in the secondary pte, but it may omit flushing the secondary tlb.
+        *
+        * The fast_only parameter indicates that this call should not block,
+        * and this function should not cause other MMU notifier calls to
+        * block. Usually this means that the implementation should be
+        * lockless.
+        *
+        * When called with fast_only, this notifier will be a no-op unless
+        * has_fast_aging is set on the struct mmu_notifier.
+        *
+        * When fast_only is true, if the implementer cannot determine that a
+        * range is young without blocking, it should return 0 (i.e.,
+        * that the range is NOT young).
         */
        int (*clear_young)(struct mmu_notifier *subscription,
                           struct mm_struct *mm,
@@ -118,6 +130,8 @@ struct mmu_notifier_ops {
         * the secondary pte. This is used to know if the page is
         * frequently used without actually clearing the flag or tearing
         * down the secondary mapping on the page.
+        *
+        * The fast_only parameter has the same meaning as with clear_young.
         */
        int (*test_young)(struct mmu_notifier *subscription,
                          struct mm_struct *mm,

I've also moved the commit that follows this one (the one that adds
has_fast_aging) to be before this one so that the comment makes sense.


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 05/11] mm: Add fast_only bool to test_young and clear_young MMU notifiers
  2024-08-01 23:13     ` James Houghton
@ 2024-08-02 15:57       ` David Hildenbrand
  2024-08-05 16:54         ` James Houghton
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:57 UTC (permalink / raw)
  To: James Houghton
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Matlack, David Rientjes, James Morse,
	Jason Gunthorpe, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

On 02.08.24 01:13, James Houghton wrote:
> On Thu, Aug 1, 2024 at 2:36 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 24.07.24 03:10, James Houghton wrote:
>>> For implementers, the fast_only bool indicates that the age information
>>> needs to be harvested such that we do not slow down other MMU operations,
>>> and ideally that we are not ourselves slowed down by other MMU
>>> operations.  Usually this means that the implementation should be
>>> lockless.
>>
>> But what are the semantics if "fast_only" cannot be achieved by the
>> implementer?
>>
>> Can we add some documentation to the new functions that explain what
>> this mysterious "fast_only" is and what the expected semantics are?
>> Please? :)
> 
> Thanks for pointing out the missing documentation. How's this?
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 45c5995ebd84..c21992036dd3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -106,6 +106,18 @@ struct mmu_notifier_ops {
>           * clear_young is a lightweight version of clear_flush_young. Like the
>           * latter, it is supposed to test-and-clear the young/accessed bitflag
>           * in the secondary pte, but it may omit flushing the secondary tlb.
> +        *

Probably makes sense to highlight the parameters like @fast_only

> +        * The fast_only parameter indicates that this call should not block,
> +        * and this function should not cause other MMU notifier calls to
> +        * block. Usually this means that the implementation should be
> +        * lockless.
> +        *
> +        * When called with fast_only, this notifier will be a no-op unless
> +        * has_fast_aging is set on the struct mmu_notifier.

"... and will return 0 (NOT young)." ?

> +        *
> +        * When fast_only is true, if the implementer cannot determine that a
> +        * range is young without blocking, it should return 0 (i.e.,
> +        * that the range is NOT young).
>           */
>          int (*clear_young)(struct mmu_notifier *subscription,
>                             struct mm_struct *mm,
> @@ -118,6 +130,8 @@ struct mmu_notifier_ops {
>           * the secondary pte. This is used to know if the page is
>           * frequently used without actually clearing the flag or tearing
>           * down the secondary mapping on the page.
> +        *
> +        * The fast_only parameter has the same meaning as with clear_young.
>           */
>          int (*test_young)(struct mmu_notifier *subscription,
>                            struct mm_struct *mm,
> 
> I've also moved the commit that follows this one (the one that adds
> has_fast_aging) to be before this one so that the comment makes sense.


Makes sense, thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 05/11] mm: Add fast_only bool to test_young and clear_young MMU notifiers
  2024-08-02 15:57       ` David Hildenbrand
@ 2024-08-05 16:54         ` James Houghton
  0 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-08-05 16:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Matlack, David Rientjes, James Morse,
	Jason Gunthorpe, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

On Fri, Aug 2, 2024 at 8:57 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.08.24 01:13, James Houghton wrote:
> > On Thu, Aug 1, 2024 at 2:36 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 24.07.24 03:10, James Houghton wrote:
> >>> For implementers, the fast_only bool indicates that the age information
> >>> needs to be harvested such that we do not slow down other MMU operations,
> >>> and ideally that we are not ourselves slowed down by other MMU
> >>> operations.  Usually this means that the implementation should be
> >>> lockless.
> >>
> >> But what are the semantics if "fast_only" cannot be achieved by the
> >> implementer?
> >>
> >> Can we add some documentation to the new functions that explain what
> >> this mysterious "fast_only" is and what the expected semantics are?
> >> Please? :)
> >
> > Thanks for pointing out the missing documentation. How's this?
> >
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 45c5995ebd84..c21992036dd3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -106,6 +106,18 @@ struct mmu_notifier_ops {
> >           * clear_young is a lightweight version of clear_flush_young. Like the
> >           * latter, it is supposed to test-and-clear the young/accessed bitflag
> >           * in the secondary pte, but it may omit flushing the secondary tlb.
> > +        *
>
> Probably makes sense to highlight the parameters like @fast_only

Will do.

>
> > +        * The fast_only parameter indicates that this call should not block,
> > +        * and this function should not cause other MMU notifier calls to
> > +        * block. Usually this means that the implementation should be
> > +        * lockless.
> > +        *
> > +        * When called with fast_only, this notifier will be a no-op unless
> > +        * has_fast_aging is set on the struct mmu_notifier.
>
> "... and will return 0 (NOT young)." ?

Thanks, I'll add this.

>
> > +        *
> > +        * When fast_only is true, if the implementer cannot determine that a
> > +        * range is young without blocking, it should return 0 (i.e.,
> > +        * that the range is NOT young).
> >           */
> >          int (*clear_young)(struct mmu_notifier *subscription,
> >                             struct mm_struct *mm,
> > @@ -118,6 +130,8 @@ struct mmu_notifier_ops {
> >           * the secondary pte. This is used to know if the page is
> >           * frequently used without actually clearing the flag or tearing
> >           * down the secondary mapping on the page.
> > +        *
> > +        * The fast_only parameter has the same meaning as with clear_young.
> >           */
> >          int (*test_young)(struct mmu_notifier *subscription,
> >                            struct mm_struct *mm,
> >
> > I've also moved the commit that follows this one (the one that adds
> > has_fast_aging) to be before this one so that the comment makes sense.
>
>
> Makes sense, thanks!

Thanks David!


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6 06/11] mm: Add has_fast_aging to struct mmu_notifier
  2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
                   ` (4 preceding siblings ...)
  2024-07-24  1:10 ` [PATCH v6 05/11] mm: Add fast_only bool to test_young and clear_young MMU notifiers James Houghton
@ 2024-07-24  1:10 ` James Houghton
  2024-07-24  1:10 ` [PATCH v6 07/11] KVM: Pass fast_only to kvm_{test_,}age_gfn James Houghton
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

has_fast_aging should be set by subscribers that non-trivially implement
fast_only versions of both test_young() and clear_young().

Fast aging must be opt-in. For a subscriber that has not been
enlightened with "fast aging", the test/clear_young() will behave
identically whether or not fast_only is given.

Given that KVM is the only test/clear_young() implementer, we could
instead add an equivalent check in KVM, but doing so would incur an
indirect function call every time, even if the notifier ends up being a
no-op.

Add mm_has_fast_young_notifiers() in case a caller wants to know if it
should skip many calls to the mmu notifiers that may not be necessary
(like MGLRU look-around).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/mmu_notifier.h | 14 ++++++++++++++
 mm/mmu_notifier.c            | 26 ++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 45c5995ebd84..e23fc10f864b 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -233,6 +233,7 @@ struct mmu_notifier {
 	struct mm_struct *mm;
 	struct rcu_head rcu;
 	unsigned int users;
+	bool has_fast_aging;
 };
 
 /**
@@ -387,6 +388,7 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address,
 				     bool fast_only);
+extern bool __mm_has_fast_young_notifiers(struct mm_struct *mm);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
 extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
@@ -449,6 +451,13 @@ static inline int mmu_notifier_test_young_fast_only(struct mm_struct *mm,
 	return 0;
 }
 
+static inline bool mm_has_fast_young_notifiers(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		return __mm_has_fast_young_notifiers(mm);
+	return 0;
+}
+
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
@@ -653,6 +662,11 @@ static inline int mmu_notifier_test_young_fast_only(struct mm_struct *mm,
 	return 0;
 }
 
+static inline bool mm_has_fast_young_notifiers(struct mm_struct *mm)
+{
+	return 0;
+}
+
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index f9a0ca6ffe65..f9ec810c8a1b 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -382,6 +382,26 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 	return young;
 }
 
+bool __mm_has_fast_young_notifiers(struct mm_struct *mm)
+{
+	struct mmu_notifier *subscription;
+	bool has_fast_aging = false;
+	int id;
+
+	id = srcu_read_lock(&srcu);
+	hlist_for_each_entry_rcu(subscription,
+				 &mm->notifier_subscriptions->list, hlist,
+				 srcu_read_lock_held(&srcu)) {
+		if (subscription->has_fast_aging) {
+			has_fast_aging = true;
+			break;
+		}
+	}
+	srcu_read_unlock(&srcu, id);
+
+	return has_fast_aging;
+}
+
 int __mmu_notifier_clear_young(struct mm_struct *mm,
 			       unsigned long start,
 			       unsigned long end,
@@ -394,6 +414,9 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
 	hlist_for_each_entry_rcu(subscription,
 				 &mm->notifier_subscriptions->list, hlist,
 				 srcu_read_lock_held(&srcu)) {
+		if (fast_only && !subscription->has_fast_aging)
+			continue;
+
 		if (subscription->ops->clear_young)
 			young |= subscription->ops->clear_young(subscription,
 								mm, start, end,
@@ -415,6 +438,9 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	hlist_for_each_entry_rcu(subscription,
 				 &mm->notifier_subscriptions->list, hlist,
 				 srcu_read_lock_held(&srcu)) {
+		if (fast_only && !subscription->has_fast_aging)
+			continue;
+
 		if (subscription->ops->test_young) {
 			young = subscription->ops->test_young(subscription, mm,
 							      address,
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 07/11] KVM: Pass fast_only to kvm_{test_,}age_gfn
  2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
                   ` (5 preceding siblings ...)
  2024-07-24  1:10 ` [PATCH v6 06/11] mm: Add has_fast_aging to struct mmu_notifier James Houghton
@ 2024-07-24  1:10 ` James Houghton
  2024-07-24  1:10 ` [PATCH v6 08/11] KVM: x86: Optimize kvm_{test_,}age_gfn a little bit James Houghton
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

Provide the basics for architectures to implement a fast-only version of
kvm_{test_,}age_gfn.

Add CONFIG_HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY that architectures will
set if they non-trivially implement test_young() and clear_young() when
called with fast_only.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/Kconfig         |  4 ++++
 virt/kvm/kvm_main.c      | 37 +++++++++++++++++++++----------------
 3 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8cd80f969cff..944c5fba2344 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
 	unsigned long attributes;
+	bool fast_only;
 };
 
 struct kvm_gfn_range {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 632334861001..cb4d5384c2f2 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -103,6 +103,10 @@ config KVM_GENERIC_MMU_NOTIFIER
 config KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
        bool
 
+config HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY
+       select KVM_GENERIC_MMU_NOTIFIER
+       bool
+
 config KVM_GENERIC_MEMORY_ATTRIBUTES
        depends on KVM_GENERIC_MMU_NOTIFIER
        bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 959b6d5d8ce4..86fb2b560d98 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -697,18 +697,20 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
 							 unsigned long start,
 							 unsigned long end,
-							 gfn_handler_t handler)
+							 gfn_handler_t handler,
+							 bool fast_only)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	const struct kvm_mmu_notifier_range range = {
-		.start		= start,
-		.end		= end,
-		.handler	= handler,
-		.on_lock	= (void *)kvm_null_fn,
-		.flush_on_ret	= false,
-		.may_block	= false,
-		.lockless	=
+		.start			= start,
+		.end			= end,
+		.handler		= handler,
+		.on_lock		= (void *)kvm_null_fn,
+		.flush_on_ret		= false,
+		.may_block		= false,
+		.lockless		=
 			IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
+		.arg.fast_only		= fast_only,
 	};
 
 	return __kvm_handle_hva_range(kvm, &range).ret;
@@ -900,7 +902,8 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * cadence. If we find this inaccurate, we might come up with a
 	 * more sophisticated heuristic later.
 	 */
-	return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
+	return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn,
+					     fast_only);
 }
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
@@ -911,7 +914,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 	trace_kvm_test_age_hva(address, fast_only);
 
 	return kvm_handle_hva_range_no_flush(mn, address, address + 1,
-					     kvm_test_age_gfn);
+					     kvm_test_age_gfn, fast_only);
 }
 
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
@@ -926,17 +929,19 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
-	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
-	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
-	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
-	.clear_young		= kvm_mmu_notifier_clear_young,
-	.test_young		= kvm_mmu_notifier_test_young,
-	.release		= kvm_mmu_notifier_release,
+	.invalidate_range_start		= kvm_mmu_notifier_invalidate_range_start,
+	.invalidate_range_end		= kvm_mmu_notifier_invalidate_range_end,
+	.clear_flush_young		= kvm_mmu_notifier_clear_flush_young,
+	.clear_young			= kvm_mmu_notifier_clear_young,
+	.test_young			= kvm_mmu_notifier_test_young,
+	.release			= kvm_mmu_notifier_release,
 };
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)
 {
 	kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
+	kvm->mmu_notifier.has_fast_aging =
+		IS_ENABLED(CONFIG_HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY);
 	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
 }
 
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 08/11] KVM: x86: Optimize kvm_{test_,}age_gfn a little bit
  2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
                   ` (6 preceding siblings ...)
  2024-07-24  1:10 ` [PATCH v6 07/11] KVM: Pass fast_only to kvm_{test_,}age_gfn James Houghton
@ 2024-07-24  1:10 ` James Houghton
  2024-07-25 18:17   ` David Matlack
  2024-07-24  1:10 ` [PATCH v6 09/11] KVM: x86: Implement fast_only versions of kvm_{test_,}age_gfn James Houghton
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

Optimize both kvm_age_gfn and kvm_test_age_gfn's interaction with the
shadow MMU by, rather than checking if our memslot has rmaps, check if
there are any indirect_shadow_pages at all.

Also, for kvm_test_age_gfn, reorder the TDP MMU check to be first. If we
find that the range is young, we do not need to check the shadow MMU.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7b93ce8f0680..919d59385f89 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1629,19 +1629,24 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	__rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
 }
 
+static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
+{
+	return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm)) {
+	if (tdp_mmu_enabled)
+		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
+
+	if (kvm_has_shadow_mmu_sptes(kvm)) {
 		write_lock(&kvm->mmu_lock);
 		young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-	if (tdp_mmu_enabled)
-		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
-
 	return young;
 }
 
@@ -1649,15 +1654,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm)) {
+	if (tdp_mmu_enabled)
+		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
+
+	if (!young && kvm_has_shadow_mmu_sptes(kvm)) {
 		write_lock(&kvm->mmu_lock);
 		young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-	if (tdp_mmu_enabled)
-		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
-
 	return young;
 }
 
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 08/11] KVM: x86: Optimize kvm_{test_,}age_gfn a little bit
  2024-07-24  1:10 ` [PATCH v6 08/11] KVM: x86: Optimize kvm_{test_,}age_gfn a little bit James Houghton
@ 2024-07-25 18:17   ` David Matlack
  2024-08-17  1:00     ` Sean Christopherson
  0 siblings, 1 reply; 40+ messages in thread
From: David Matlack @ 2024-07-25 18:17 UTC (permalink / raw)
  To: James Houghton
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Rientjes, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

On 2024-07-24 01:10 AM, James Houghton wrote:
> Optimize both kvm_age_gfn and kvm_test_age_gfn's interaction with the

nit: Use () when referring to functions.

> shadow MMU by, rather than checking if our memslot has rmaps, check if
> there are any indirect_shadow_pages at all.

What is optimized by checking indirect_shadow_pages instead of
have_rmaps and what's the benefit? Smells like a premature optimization.

> 
> Also, for kvm_test_age_gfn, reorder the TDP MMU check to be first. If we
> find that the range is young, we do not need to check the shadow MMU.

This should be a separate commit since it's a logically distinct change
and no dependency on the other change in this commit (other than both
touch the same function).

Splitting the commits up will also make it easier to write more specific
short logs (instead of "optimize a little bit" :)

Also, the commit re-orders kvm_age_gfn() as well but the commit message
only mentions kvm_test_age_gfn(). No objection to keeping the two
functions consistent but it should be called out in the commit message.

> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7b93ce8f0680..919d59385f89 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1629,19 +1629,24 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
>  	__rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
>  }
>  
> +static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
> +{
> +	return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
> +}
> +
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	bool young = false;
>  
> -	if (kvm_memslots_have_rmaps(kvm)) {
> +	if (tdp_mmu_enabled)
> +		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> +
> +	if (kvm_has_shadow_mmu_sptes(kvm)) {
>  		write_lock(&kvm->mmu_lock);
>  		young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
>  		write_unlock(&kvm->mmu_lock);
>  	}
>  
> -	if (tdp_mmu_enabled)
> -		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> -
>  	return young;
>  }
>  
> @@ -1649,15 +1654,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	bool young = false;
>  
> -	if (kvm_memslots_have_rmaps(kvm)) {
> +	if (tdp_mmu_enabled)
> +		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> +
> +	if (!young && kvm_has_shadow_mmu_sptes(kvm)) {

nit: A short comment here might be helpful to explain why young is
checked.

>  		write_lock(&kvm->mmu_lock);
>  		young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
>  		write_unlock(&kvm->mmu_lock);
>  	}
>  
> -	if (tdp_mmu_enabled)
> -		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> -
>  	return young;
>  }
>  
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 08/11] KVM: x86: Optimize kvm_{test_,}age_gfn a little bit
  2024-07-25 18:17   ` David Matlack
@ 2024-08-17  1:00     ` Sean Christopherson
  2024-08-30  0:34       ` James Houghton
  0 siblings, 1 reply; 40+ messages in thread
From: Sean Christopherson @ 2024-08-17  1:00 UTC (permalink / raw)
  To: David Matlack
  Cc: James Houghton, Andrew Morton, Paolo Bonzini, Ankit Agrawal,
	Axel Rasmussen, Catalin Marinas, David Rientjes, James Morse,
	Jason Gunthorpe, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao, Zenghui Yu,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Thu, Jul 25, 2024, David Matlack wrote:
> On 2024-07-24 01:10 AM, James Houghton wrote:
> > Optimize both kvm_age_gfn and kvm_test_age_gfn's interaction with the
> 
> nit: Use () when referring to functions.
> 
> > shadow MMU by, rather than checking if our memslot has rmaps, check if
> > there are any indirect_shadow_pages at all.
> 
> What is optimized by checking indirect_shadow_pages instead of
> have_rmaps and what's the benefit? Smells like a premature optimization.

Checking indirect_shadow_pages avoids taking mmu_lock for write when KVM doesn't
currently have shadow MMU pages, but did at some point in the past, whereas
kvm_memslots_have_rmaps() is sticky and will return true forever.

> > Also, for kvm_test_age_gfn, reorder the TDP MMU check to be first. If we
> > find that the range is young, we do not need to check the shadow MMU.
> 
> This should be a separate commit since it's a logically distinct change
> and no dependency on the other change in this commit (other than both
> touch the same function).
> 
> Splitting the commits up will also make it easier to write more specific
> short logs (instead of "optimize a little bit" :)

+1.  Especially code movement and refactoring, e.g. factoring out
tdp_mmu_clear_spte_bits_atomic() would ideally be in a standalone patch that's
dead simple to review.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 08/11] KVM: x86: Optimize kvm_{test_,}age_gfn a little bit
  2024-08-17  1:00     ` Sean Christopherson
@ 2024-08-30  0:34       ` James Houghton
  0 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-08-30  0:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Matlack, Andrew Morton, Paolo Bonzini, Ankit Agrawal,
	Axel Rasmussen, Catalin Marinas, David Rientjes, James Morse,
	Jason Gunthorpe, Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang,
	Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao, Zenghui Yu,
	kvmarm, kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm

On Fri, Aug 16, 2024 at 6:00 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jul 25, 2024, David Matlack wrote:
> > On 2024-07-24 01:10 AM, James Houghton wrote:
> > > Optimize both kvm_age_gfn and kvm_test_age_gfn's interaction with the
> >
> > nit: Use () when referring to functions.
> >
> > > shadow MMU by, rather than checking if our memslot has rmaps, check if
> > > there are any indirect_shadow_pages at all.
> >
> > What is optimized by checking indirect_shadow_pages instead of
> > have_rmaps and what's the benefit? Smells like a premature optimization.
>
> Checking indirect_shadow_pages avoids taking mmu_lock for write when KVM doesn't
> currently have shadow MMU pages, but did at some point in the past, whereas
> kvm_memslots_have_rmaps() is sticky and will return true forever.

Thanks for the clear explanation.

> > > Also, for kvm_test_age_gfn, reorder the TDP MMU check to be first. If we
> > > find that the range is young, we do not need to check the shadow MMU.
> >
> > This should be a separate commit since it's a logically distinct change
> > and no dependency on the other change in this commit (other than both
> > touch the same function).

Done.

> > Splitting the commits up will also make it easier to write more specific
> > short logs (instead of "optimize a little bit" :)
>
> +1.  Especially code movement and refactoring, e.g. factoring out
> tdp_mmu_clear_spte_bits_atomic() would ideally be in a standalone patch that's
> dead simple to review.

I have now split out the creation of tdp_mmu_clear_spte_bits_atomic()
into its own patch. Though I'm not entirely convinced splitting out
every refactor like that is always a good thing.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6 09/11] KVM: x86: Implement fast_only versions of kvm_{test_,}age_gfn
  2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
                   ` (7 preceding siblings ...)
  2024-07-24  1:10 ` [PATCH v6 08/11] KVM: x86: Optimize kvm_{test_,}age_gfn a little bit James Houghton
@ 2024-07-24  1:10 ` James Houghton
  2024-07-25 18:24   ` David Matlack
  2024-07-24  1:10 ` [PATCH v6 10/11] mm: multi-gen LRU: Have secondary MMUs participate in aging James Houghton
  2024-07-24  1:10 ` [PATCH v6 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton
  10 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

These fast-only versions simply ignore the shadow MMU. We can locklessly
handle the shadow MMU later.

Set HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY for X86_64 only, as that is
the only case where the TDP MMU might be used. Without the TDP MMU, the
fast-only notifiers will always be no-ops. It would be ideal not to
report has_fast_only if !tdp_mmu_enabled, but tdp_mmu_enabled can be
changed at any time.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/Kconfig   | 1 +
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 6ac43074c5e9..ed9049cf1255 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -24,6 +24,7 @@ config KVM
 	select KVM_COMMON
 	select KVM_GENERIC_MMU_NOTIFIER
 	select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
+	select HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY if X86_64
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_PFNCACHE
 	select HAVE_KVM_DIRTY_RING_TSO
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 919d59385f89..3c6c9442434a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1641,7 +1641,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
 
-	if (kvm_has_shadow_mmu_sptes(kvm)) {
+	if (!range->arg.fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
 		write_lock(&kvm->mmu_lock);
 		young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
 		write_unlock(&kvm->mmu_lock);
@@ -1657,7 +1657,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
 
-	if (!young && kvm_has_shadow_mmu_sptes(kvm)) {
+	if (!young && !range->arg.fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
 		write_lock(&kvm->mmu_lock);
 		young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
 		write_unlock(&kvm->mmu_lock);
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 09/11] KVM: x86: Implement fast_only versions of kvm_{test_,}age_gfn
  2024-07-24  1:10 ` [PATCH v6 09/11] KVM: x86: Implement fast_only versions of kvm_{test_,}age_gfn James Houghton
@ 2024-07-25 18:24   ` David Matlack
  0 siblings, 0 replies; 40+ messages in thread
From: David Matlack @ 2024-07-25 18:24 UTC (permalink / raw)
  To: James Houghton
  Cc: Andrew Morton, Paolo Bonzini, Ankit Agrawal, Axel Rasmussen,
	Catalin Marinas, David Rientjes, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

On 2024-07-24 01:10 AM, James Houghton wrote:
> These fast-only versions simply ignore the shadow MMU. We can locklessly
> handle the shadow MMU later.
> 
> Set HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY for X86_64 only, as that is
> the only case where the TDP MMU might be used. Without the TDP MMU, the
> fast-only notifiers will always be no-ops. It would be ideal not to
> report has_fast_only if !tdp_mmu_enabled, but tdp_mmu_enabled can be
> changed at any time.

tdp_mmu_enabled is a read-only KVM parameter. And even when it was
writable, it was still fixed for a given VM at VM creation time.

Would it make more sense to have kvm_arch_post_init_vm() set
has_fast_aging if the architecture supports it. And for x86 that means
iff tdp_mmu_enabled.

> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/kvm/Kconfig   | 1 +
>  arch/x86/kvm/mmu/mmu.c | 4 ++--
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 6ac43074c5e9..ed9049cf1255 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -24,6 +24,7 @@ config KVM
>  	select KVM_COMMON
>  	select KVM_GENERIC_MMU_NOTIFIER
>  	select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
> +	select HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY if X86_64
>  	select HAVE_KVM_IRQCHIP
>  	select HAVE_KVM_PFNCACHE
>  	select HAVE_KVM_DIRTY_RING_TSO
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 919d59385f89..3c6c9442434a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1641,7 +1641,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  	if (tdp_mmu_enabled)
>  		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
>  
> -	if (kvm_has_shadow_mmu_sptes(kvm)) {
> +	if (!range->arg.fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
>  		write_lock(&kvm->mmu_lock);
>  		young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
>  		write_unlock(&kvm->mmu_lock);
> @@ -1657,7 +1657,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  	if (tdp_mmu_enabled)
>  		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
>  
> -	if (!young && kvm_has_shadow_mmu_sptes(kvm)) {
> +	if (!young && !range->arg.fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
>  		write_lock(&kvm->mmu_lock);
>  		young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
>  		write_unlock(&kvm->mmu_lock);
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6 10/11] mm: multi-gen LRU: Have secondary MMUs participate in aging
  2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
                   ` (8 preceding siblings ...)
  2024-07-24  1:10 ` [PATCH v6 09/11] KVM: x86: Implement fast_only versions of kvm_{test_,}age_gfn James Houghton
@ 2024-07-24  1:10 ` James Houghton
  2024-07-24  1:10 ` [PATCH v6 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton
  10 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

Secondary MMUs are currently consulted for access/age information at
eviction time, but before then, we don't get accurate age information.
That is, pages that are mostly accessed through a secondary MMU (like
guest memory, used by KVM) will always just proceed down to the oldest
generation, and then at eviction time, if KVM reports the page to be
young, the page will be activated/promoted back to the youngest
generation.

The added feature bit (0x8), if disabled, will make MGLRU behave as if
there are no secondary MMUs subscribed to MMU notifiers except at
eviction time.

Implement aging with the new mmu_notifier_clear_young_fast_only()
notifier. For architectures that do not support this notifier, this
becomes a no-op. For architectures that do implement it, it should be
fast enough to make aging worth it (usually the case if the notifier is
implemented locklessly).

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 include/linux/mmzone.h                        |   6 +-
 mm/rmap.c                                     |   9 +-
 mm/vmscan.c                                   | 148 ++++++++++++++----
 4 files changed, 127 insertions(+), 42 deletions(-)

diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..e1862407652c 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -48,6 +48,10 @@ Values Components
        verified on x86 varieties other than Intel and AMD. If it is
        disabled, the multi-gen LRU will suffer a negligible
        performance degradation.
+0x0008 Clear the accessed bit in secondary MMU page tables when aging
+       instead of waiting until eviction time. This results in accurate
+       page age information for pages that are mainly used by a
+       secondary MMU.
 [yYnN] Apply to all the components above.
 ====== ===============================================================
 
@@ -56,7 +60,7 @@ E.g.,
 
     echo y >/sys/kernel/mm/lru_gen/enabled
     cat /sys/kernel/mm/lru_gen/enabled
-    0x0007
+    0x000f
     echo 5 >/sys/kernel/mm/lru_gen/enabled
     cat /sys/kernel/mm/lru_gen/enabled
     0x0005
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 586a8f0104d7..ee82e635e75b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -400,6 +400,7 @@ enum {
 	LRU_GEN_CORE,
 	LRU_GEN_MM_WALK,
 	LRU_GEN_NONLEAF_YOUNG,
+	LRU_GEN_SECONDARY_MMU_WALK,
 	NR_LRU_GEN_CAPS
 };
 
@@ -557,7 +558,7 @@ struct lru_gen_memcg {
 
 void lru_gen_init_pgdat(struct pglist_data *pgdat);
 void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 void lru_gen_init_memcg(struct mem_cgroup *memcg);
 void lru_gen_exit_memcg(struct mem_cgroup *memcg);
@@ -576,8 +577,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
 {
 }
 
-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
+	return false;
 }
 
 static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..24a3ff639919 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -870,13 +870,10 @@ static bool folio_referenced_one(struct folio *folio,
 			continue;
 		}
 
-		if (pvmw.pte) {
-			if (lru_gen_enabled() &&
-			    pte_young(ptep_get(pvmw.pte))) {
-				lru_gen_look_around(&pvmw);
+		if (lru_gen_enabled() && pvmw.pte) {
+			if (lru_gen_look_around(&pvmw))
 				referenced++;
-			}
-
+		} else if (pvmw.pte) {
 			if (ptep_clear_flush_young_notify(vma, address,
 						pvmw.pte))
 				referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..e4fa52c8f714 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,7 @@
 #include <linux/khugepaged.h>
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2579,6 +2580,11 @@ static bool should_clear_pmd_young(void)
 	return arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG);
 }
 
+static bool should_walk_secondary_mmu(void)
+{
+	return get_cap(LRU_GEN_SECONDARY_MMU_WALK);
+}
+
 /******************************************************************************
  *                          shorthand helpers
  ******************************************************************************/
@@ -3276,7 +3282,8 @@ static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk
 	return false;
 }
 
-static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr,
+				 struct pglist_data *pgdat)
 {
 	unsigned long pfn = pte_pfn(pte);
 
@@ -3291,10 +3298,15 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
 	if (WARN_ON_ONCE(!pfn_valid(pfn)))
 		return -1;
 
+	/* try to avoid unnecessary memory loads */
+	if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+		return -1;
+
 	return pfn;
 }
 
-static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr,
+				 struct pglist_data *pgdat)
 {
 	unsigned long pfn = pmd_pfn(pmd);
 
@@ -3309,6 +3321,10 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
 	if (WARN_ON_ONCE(!pfn_valid(pfn)))
 		return -1;
 
+	/* try to avoid unnecessary memory loads */
+	if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+		return -1;
+
 	return pfn;
 }
 
@@ -3317,10 +3333,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
 {
 	struct folio *folio;
 
-	/* try to avoid unnecessary memory loads */
-	if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
-		return NULL;
-
 	folio = pfn_folio(pfn);
 	if (folio_nid(folio) != pgdat->node_id)
 		return NULL;
@@ -3343,6 +3355,26 @@ static bool suitable_to_scan(int total, int young)
 	return young * n >= total;
 }
 
+static bool lru_gen_notifier_clear_young(struct mm_struct *mm,
+					 unsigned long start,
+					 unsigned long end)
+{
+	return should_walk_secondary_mmu() &&
+		mmu_notifier_clear_young_fast_only(mm, start, end);
+}
+
+static bool lru_gen_pmdp_test_and_clear_young(struct vm_area_struct *vma,
+					      unsigned long addr,
+					      pmd_t *pmd)
+{
+	bool young = pmdp_test_and_clear_young(vma, addr, pmd);
+
+	if (lru_gen_notifier_clear_young(vma->vm_mm, addr, addr + PMD_SIZE))
+		young = true;
+
+	return young;
+}
+
 static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 			   struct mm_walk *args)
 {
@@ -3357,8 +3389,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
 	DEFINE_MAX_SEQ(walk->lruvec);
 	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+	struct mm_struct *mm = args->mm;
 
-	pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+	pte = pte_offset_map_nolock(mm, pmd, start & PMD_MASK, &ptl);
 	if (!pte)
 		return false;
 	if (!spin_trylock(ptl)) {
@@ -3376,11 +3409,11 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		total++;
 		walk->mm_stats[MM_LEAF_TOTAL]++;
 
-		pfn = get_pte_pfn(ptent, args->vma, addr);
+		pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
 		if (pfn == -1)
 			continue;
 
-		if (!pte_young(ptent)) {
+		if (!pte_young(ptent) && !mm_has_notifiers(mm)) {
 			walk->mm_stats[MM_LEAF_OLD]++;
 			continue;
 		}
@@ -3389,8 +3422,14 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		if (!folio)
 			continue;
 
-		if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
-			VM_WARN_ON_ONCE(true);
+		if (!lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE) &&
+		    !pte_young(ptent)) {
+			walk->mm_stats[MM_LEAF_OLD]++;
+			continue;
+		}
+
+		if (pte_young(ptent))
+			ptep_test_and_clear_young(args->vma, addr, pte + i);
 
 		young++;
 		walk->mm_stats[MM_LEAF_YOUNG]++;
@@ -3456,22 +3495,25 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 		/* don't round down the first address */
 		addr = i ? (*first & PMD_MASK) + i * PMD_SIZE : *first;
 
-		pfn = get_pmd_pfn(pmd[i], vma, addr);
-		if (pfn == -1)
-			goto next;
-
-		if (!pmd_trans_huge(pmd[i])) {
-			if (should_clear_pmd_young())
+		if (pmd_present(pmd[i]) && !pmd_trans_huge(pmd[i])) {
+			if (should_clear_pmd_young() &&
+			    !should_walk_secondary_mmu())
 				pmdp_test_and_clear_young(vma, addr, pmd + i);
 			goto next;
 		}
 
+		pfn = get_pmd_pfn(pmd[i], vma, addr, pgdat);
+		if (pfn == -1)
+			goto next;
+
 		folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
 		if (!folio)
 			goto next;
 
-		if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+		if (!lru_gen_pmdp_test_and_clear_young(vma, addr, pmd + i)) {
+			walk->mm_stats[MM_LEAF_OLD]++;
 			goto next;
+		}
 
 		walk->mm_stats[MM_LEAF_YOUNG]++;
 
@@ -3528,19 +3570,18 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 		}
 
 		if (pmd_trans_huge(val)) {
-			unsigned long pfn = pmd_pfn(val);
 			struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+			unsigned long pfn = get_pmd_pfn(val, vma, addr, pgdat);
 
 			walk->mm_stats[MM_LEAF_TOTAL]++;
 
-			if (!pmd_young(val)) {
-				walk->mm_stats[MM_LEAF_OLD]++;
+			if (pfn == -1)
 				continue;
-			}
 
-			/* try to avoid unnecessary memory loads */
-			if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			if (!pmd_young(val) && !mm_has_notifiers(args->mm)) {
+				walk->mm_stats[MM_LEAF_OLD]++;
 				continue;
+			}
 
 			walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
 			continue;
@@ -3548,7 +3589,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 
 		walk->mm_stats[MM_NONLEAF_TOTAL]++;
 
-		if (should_clear_pmd_young()) {
+		if (should_clear_pmd_young() && !should_walk_secondary_mmu()) {
 			if (!pmd_young(val))
 				continue;
 
@@ -3994,6 +4035,31 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
  *                          rmap/PT walk feedback
  ******************************************************************************/
 
+static bool should_look_around(struct vm_area_struct *vma, unsigned long addr,
+			       pte_t *pte, int *young)
+{
+	int secondary_young = mmu_notifier_clear_young(
+				vma->vm_mm, addr, addr + PAGE_SIZE);
+
+	/*
+	 * Look around if (1) the PTE is young or (2) the secondary PTE was
+	 * young and one of the "fast" MMUs of one of the secondary MMUs
+	 * reported that the page was young.
+	 */
+	if (pte_young(ptep_get(pte))) {
+		ptep_test_and_clear_young(vma, addr, pte);
+		*young = true;
+		return true;
+	}
+
+	if (secondary_young) {
+		*young = true;
+		return mm_has_fast_young_notifiers(vma->vm_mm);
+	}
+
+	return false;
+}
+
 /*
  * This function exploits spatial locality when shrink_folio_list() walks the
  * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
@@ -4001,7 +4067,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
  * the PTE table to the Bloom filter. This forms a feedback loop between the
  * eviction and the aging.
  */
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
 	int i;
 	unsigned long start;
@@ -4019,16 +4085,20 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
 	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+	struct mm_struct *mm = pvmw->vma->vm_mm;
 
 	lockdep_assert_held(pvmw->ptl);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
 
+	if (!should_look_around(vma, addr, pte, &young))
+		return young;
+
 	if (spin_is_contended(pvmw->ptl))
-		return;
+		return young;
 
 	/* exclude special VMAs containing anon pages from COW */
 	if (vma->vm_flags & VM_SPECIAL)
-		return;
+		return young;
 
 	/* avoid taking the LRU lock under the PTL when possible */
 	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
@@ -4036,6 +4106,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	start = max(addr & PMD_MASK, vma->vm_start);
 	end = min(addr | ~PMD_MASK, vma->vm_end - 1) + 1;
 
+	if (end - start == PAGE_SIZE)
+		return young;
+
 	if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
 		if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
 			end = start + MIN_LRU_BATCH * PAGE_SIZE;
@@ -4049,7 +4122,7 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 
 	/* folio_update_gen() requires stable folio_memcg() */
 	if (!mem_cgroup_trylock_pages(memcg))
-		return;
+		return young;
 
 	arch_enter_lazy_mmu_mode();
 
@@ -4059,19 +4132,23 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		unsigned long pfn;
 		pte_t ptent = ptep_get(pte + i);
 
-		pfn = get_pte_pfn(ptent, vma, addr);
+		pfn = get_pte_pfn(ptent, vma, addr, pgdat);
 		if (pfn == -1)
 			continue;
 
-		if (!pte_young(ptent))
+		if (!pte_young(ptent) && !mm_has_notifiers(mm))
 			continue;
 
 		folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
 		if (!folio)
 			continue;
 
-		if (!ptep_test_and_clear_young(vma, addr, pte + i))
-			VM_WARN_ON_ONCE(true);
+		if (!lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE) &&
+		    !pte_young(ptent))
+			continue;
+
+		if (pte_young(ptent))
+			ptep_test_and_clear_young(vma, addr, pte + i);
 
 		young++;
 
@@ -4101,6 +4178,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	/* feedback from rmap walkers to page table walkers */
 	if (mm_state && suitable_to_scan(i, young))
 		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
+
+	return young;
 }
 
 /******************************************************************************
@@ -5137,6 +5216,9 @@ static ssize_t enabled_show(struct kobject *kobj, struct kobj_attribute *attr, c
 	if (should_clear_pmd_young())
 		caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
 
+	if (should_walk_secondary_mmu())
+		caps |= BIT(LRU_GEN_SECONDARY_MMU_WALK);
+
 	return sysfs_emit(buf, "0x%04x\n", caps);
 }
 
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test
  2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
                   ` (9 preceding siblings ...)
  2024-07-24  1:10 ` [PATCH v6 10/11] mm: multi-gen LRU: Have secondary MMUs participate in aging James Houghton
@ 2024-07-24  1:10 ` James Houghton
  10 siblings, 0 replies; 40+ messages in thread
From: James Houghton @ 2024-07-24  1:10 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Ankit Agrawal, Axel Rasmussen, Catalin Marinas, David Matlack,
	David Rientjes, James Houghton, James Morse, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Oliver Upton,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Suzuki K Poulose, Wei Xu, Will Deacon, Yu Zhao,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm

This test now has two modes of operation:
1. (default) To check how much vCPU performance was affected by access
             tracking (previously existed, now supports MGLRU aging).
2. (-p) To also benchmark how fast MGLRU can do aging while vCPUs are
        faulting in memory.

Mode (1) also serves as a way to verify that aging is working properly
for pages only accessed by KVM.  It will fail if one does not have the
0x8 lru_gen feature bit.

To support MGLRU, the test creates a memory cgroup, moves itself into
it, then uses the lru_gen debugfs output to track memory in that cgroup.
The logic to parse the lru_gen debugfs output has been put into
selftests/kvm/lib/lru_gen_util.c.

Co-developed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/access_tracking_perf_test.c | 369 +++++++++++++++--
 .../selftests/kvm/include/lru_gen_util.h      |  55 +++
 .../testing/selftests/kvm/lib/lru_gen_util.c  | 391 ++++++++++++++++++
 4 files changed, 786 insertions(+), 30 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h
 create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index b084ba2262a0..0ab8d3f4628c 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -22,6 +22,7 @@ LIBKVM += lib/elf.c
 LIBKVM += lib/guest_modes.c
 LIBKVM += lib/io.c
 LIBKVM += lib/kvm_util.c
+LIBKVM += lib/lru_gen_util.c
 LIBKVM += lib/memstress.c
 LIBKVM += lib/guest_sprintf.c
 LIBKVM += lib/rbtree.c
diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..6ff64ac349a9 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -38,6 +38,7 @@
 #include <inttypes.h>
 #include <limits.h>
 #include <pthread.h>
+#include <stdio.h>
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
@@ -47,6 +48,20 @@
 #include "memstress.h"
 #include "guest_modes.h"
 #include "processor.h"
+#include "lru_gen_util.h"
+
+static const char *TEST_MEMCG_NAME = "access_tracking_perf_test";
+static const int LRU_GEN_ENABLED = 0x1;
+static const int LRU_GEN_MM_WALK = 0x2;
+static const int LRU_GEN_SECONDARY_MMU_WALK = 0x8;
+static const char *CGROUP_PROCS = "cgroup.procs";
+/*
+ * If using MGLRU, this test assumes a cgroup v2 or cgroup v1 memory hierarchy
+ * is mounted at cgroup_root.
+ *
+ * Can be changed with -r.
+ */
+static const char *cgroup_root = "/sys/fs/cgroup";
 
 /* Global variable used to synchronize all of the vCPU threads. */
 static int iteration;
@@ -62,6 +77,9 @@ static enum {
 /* The iteration that was last completed by each vCPU. */
 static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
 
+/* The time at which the last iteration was completed */
+static struct timespec vcpu_last_completed_time[KVM_MAX_VCPUS];
+
 /* Whether to overlap the regions of memory vCPUs access. */
 static bool overlap_memory_access;
 
@@ -74,6 +92,12 @@ struct test_params {
 
 	/* The number of vCPUs to create in the VM. */
 	int nr_vcpus;
+
+	/* Whether to use lru_gen aging instead of idle page tracking. */
+	bool lru_gen;
+
+	/* Whether to test the performance of aging itself. */
+	bool benchmark_lru_gen;
 };
 
 static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
@@ -89,6 +113,50 @@ static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
 
 }
 
+static void write_file_long(const char *path, long v)
+{
+	FILE *f;
+
+	f = fopen(path, "w");
+	TEST_ASSERT(f, "fopen(%s) failed", path);
+	TEST_ASSERT(fprintf(f, "%ld\n", v) > 0,
+		    "fprintf to %s failed", path);
+	TEST_ASSERT(!fclose(f), "fclose(%s) failed", path);
+}
+
+static char *path_join(const char *parent, const char *child)
+{
+	char *out = NULL;
+
+	return asprintf(&out, "%s/%s", parent, child) >= 0 ? out : NULL;
+}
+
+static char *memcg_path(const char *memcg)
+{
+	return path_join(cgroup_root, memcg);
+}
+
+static char *memcg_file_path(const char *memcg, const char *file)
+{
+	char *mp = memcg_path(memcg);
+	char *fp;
+
+	if (!mp)
+		return NULL;
+	fp = path_join(mp, file);
+	free(mp);
+	return fp;
+}
+
+static void move_to_memcg(const char *memcg, pid_t pid)
+{
+	char *procs = memcg_file_path(memcg, CGROUP_PROCS);
+
+	TEST_ASSERT(procs, "Failed to construct cgroup.procs path");
+	write_file_long(procs, pid);
+	free(procs);
+}
+
 #define PAGEMAP_PRESENT (1ULL << 63)
 #define PAGEMAP_PFN_MASK ((1ULL << 55) - 1)
 
@@ -242,6 +310,8 @@ static void vcpu_thread_main(struct memstress_vcpu_args *vcpu_args)
 		};
 
 		vcpu_last_completed_iteration[vcpu_idx] = current_iteration;
+		clock_gettime(CLOCK_MONOTONIC,
+			      &vcpu_last_completed_time[vcpu_idx]);
 	}
 }
 
@@ -253,38 +323,68 @@ static void spin_wait_for_vcpu(int vcpu_idx, int target_iteration)
 	}
 }
 
+static bool all_vcpus_done(int target_iteration, int nr_vcpus)
+{
+	for (int i = 0; i < nr_vcpus; ++i)
+		if (READ_ONCE(vcpu_last_completed_iteration[i]) !=
+		    target_iteration)
+			return false;
+
+	return true;
+}
+
 /* The type of memory accesses to perform in the VM. */
 enum access_type {
 	ACCESS_READ,
 	ACCESS_WRITE,
 };
 
-static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description)
+static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description,
+			  bool wait)
 {
-	struct timespec ts_start;
-	struct timespec ts_elapsed;
 	int next_iteration, i;
 
 	/* Kick off the vCPUs by incrementing iteration. */
 	next_iteration = ++iteration;
 
-	clock_gettime(CLOCK_MONOTONIC, &ts_start);
-
 	/* Wait for all vCPUs to finish the iteration. */
-	for (i = 0; i < nr_vcpus; i++)
-		spin_wait_for_vcpu(i, next_iteration);
+	if (wait) {
+		struct timespec ts_start;
+		struct timespec ts_elapsed;
+
+		clock_gettime(CLOCK_MONOTONIC, &ts_start);
 
-	ts_elapsed = timespec_elapsed(ts_start);
-	pr_info("%-30s: %ld.%09lds\n",
-		description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+		for (i = 0; i < nr_vcpus; i++)
+			spin_wait_for_vcpu(i, next_iteration);
+
+		ts_elapsed = timespec_elapsed(ts_start);
+
+		pr_info("%-30s: %ld.%09lds\n",
+			description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+	} else
+		pr_info("%-30s\n", description);
 }
 
-static void access_memory(struct kvm_vm *vm, int nr_vcpus,
-			  enum access_type access, const char *description)
+static void _access_memory(struct kvm_vm *vm, int nr_vcpus,
+			   enum access_type access, const char *description,
+			   bool wait)
 {
 	memstress_set_write_percent(vm, (access == ACCESS_READ) ? 0 : 100);
 	iteration_work = ITERATION_ACCESS_MEMORY;
-	run_iteration(vm, nr_vcpus, description);
+	run_iteration(vm, nr_vcpus, description, wait);
+}
+
+static void access_memory(struct kvm_vm *vm, int nr_vcpus,
+			  enum access_type access, const char *description)
+{
+	return _access_memory(vm, nr_vcpus, access, description, true);
+}
+
+static void access_memory_async(struct kvm_vm *vm, int nr_vcpus,
+				enum access_type access,
+				const char *description)
+{
+	return _access_memory(vm, nr_vcpus, access, description, false);
 }
 
 static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
@@ -297,19 +397,115 @@ static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
 	 */
 	pr_debug("Marking VM memory idle (slow)...\n");
 	iteration_work = ITERATION_MARK_IDLE;
-	run_iteration(vm, nr_vcpus, "Mark memory idle");
+	run_iteration(vm, nr_vcpus, "Mark memory idle", true);
 }
 
-static void run_test(enum vm_guest_mode mode, void *arg)
+static void create_memcg(const char *memcg)
+{
+	const char *full_memcg_path = memcg_path(memcg);
+	int ret;
+
+	TEST_ASSERT(full_memcg_path, "Failed to construct full memcg path");
+retry:
+	ret = mkdir(full_memcg_path, 0755);
+	if (ret && errno == EEXIST) {
+		TEST_ASSERT(!rmdir(full_memcg_path),
+			    "Found existing memcg at %s, but rmdir failed",
+			    full_memcg_path);
+		goto retry;
+	}
+	TEST_ASSERT(!ret, "Creating the memcg failed: mkdir(%s) failed",
+		    full_memcg_path);
+
+	pr_info("Created memcg at %s\n", full_memcg_path);
+}
+
+/*
+ * Test lru_gen aging speed while vCPUs are faulting memory in.
+ *
+ * This test will run lru_gen aging until the vCPUs have finished all of
+ * the faulting work, reporting:
+ *  - vcpu wall time (wall time for slowest vCPU)
+ *  - average aging pass duration
+ *  - total number of aging passes
+ *  - total time spent aging
+ *
+ * This test produces the most useful results when the vcpu wall time and the
+ * total time spent aging are similar (i.e., we want to avoid timing aging
+ * while the vCPUs aren't doing any work).
+ */
+static void run_benchmark(enum vm_guest_mode mode, struct kvm_vm *vm,
+			  struct test_params *params)
 {
-	struct test_params *params = arg;
-	struct kvm_vm *vm;
 	int nr_vcpus = params->nr_vcpus;
+	struct memcg_stats stats;
+	struct timespec ts_start, ts_max, ts_vcpus_elapsed,
+			ts_aging_elapsed, ts_aging_elapsed_avg;
+	int num_passes = 0;
 
-	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
-				 params->backing_src, !overlap_memory_access);
+	printf("Running lru_gen benchmark...\n");
 
-	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
+	clock_gettime(CLOCK_MONOTONIC, &ts_start);
+	access_memory_async(vm, nr_vcpus, ACCESS_WRITE,
+			    "Populating memory (async)");
+	while (!all_vcpus_done(iteration, nr_vcpus)) {
+		lru_gen_do_aging_quiet(&stats, TEST_MEMCG_NAME);
+		++num_passes;
+	}
+
+	ts_aging_elapsed = timespec_elapsed(ts_start);
+	ts_aging_elapsed_avg = timespec_div(ts_aging_elapsed, num_passes);
+
+	/* Find out when the slowest vCPU finished. */
+	ts_max = ts_start;
+	for (int i = 0; i < nr_vcpus; ++i) {
+		struct timespec *vcpu_ts = &vcpu_last_completed_time[i];
+
+		if (ts_max.tv_sec < vcpu_ts->tv_sec ||
+		    (ts_max.tv_sec == vcpu_ts->tv_sec  &&
+		     ts_max.tv_nsec < vcpu_ts->tv_nsec))
+			ts_max = *vcpu_ts;
+	}
+
+	ts_vcpus_elapsed = timespec_sub(ts_max, ts_start);
+
+	pr_info("%-30s: %ld.%09lds\n", "vcpu wall time",
+		ts_vcpus_elapsed.tv_sec, ts_vcpus_elapsed.tv_nsec);
+
+	pr_info("%-30s: %ld.%09lds, (passes:%d, total:%ld.%09lds)\n",
+		"lru_gen avg pass duration",
+		ts_aging_elapsed_avg.tv_sec,
+		ts_aging_elapsed_avg.tv_nsec,
+		num_passes,
+		ts_aging_elapsed.tv_sec,
+		ts_aging_elapsed.tv_nsec);
+}
+
+/*
+ * Test how much access tracking affects vCPU performance.
+ *
+ * Supports two modes of access tracking:
+ * - idle page tracking
+ * - lru_gen aging
+ *
+ * When using lru_gen, this test additionally verifies that the pages are in
+ * fact getting younger and older, otherwise the performance data would be
+ * invalid.
+ *
+ * The forced lru_gen aging can race with aging that occurs naturally.
+ */
+static void run_test(enum vm_guest_mode mode, struct kvm_vm *vm,
+		     struct test_params *params)
+{
+	int nr_vcpus = params->nr_vcpus;
+	bool lru_gen = params->lru_gen;
+	struct memcg_stats stats;
+	// If guest_page_size is larger than the host's page size, the
+	// guest (memstress) will only fault in a subset of the host's pages.
+	long total_pages = nr_vcpus * params->vcpu_memory_bytes /
+			   max(memstress_args.guest_page_size,
+			       (uint64_t)getpagesize());
+	int found_gens[5];
 
 	pr_info("\n");
 	access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory");
@@ -319,11 +515,78 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");
 
 	/* Repeat on memory that has been marked as idle. */
-	mark_memory_idle(vm, nr_vcpus);
+	if (lru_gen) {
+		/* Do an initial page table scan */
+		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+		TEST_ASSERT(sum_memcg_stats(&stats) >= total_pages,
+		  "Not all pages tracked in lru_gen stats.\n"
+		  "Is lru_gen enabled? Did the memcg get created properly?");
+
+		/* Find the generation we're currently in (probably youngest) */
+		found_gens[0] = lru_gen_find_generation(&stats, total_pages);
+
+		/* Do an aging pass now */
+		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+		/* Same generation, but a newer generation has been made */
+		found_gens[1] = lru_gen_find_generation(&stats, total_pages);
+		TEST_ASSERT(found_gens[1] == found_gens[0],
+			    "unexpected gen change: %d vs. %d",
+			    found_gens[1], found_gens[0]);
+	} else
+		mark_memory_idle(vm, nr_vcpus);
+
 	access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory");
-	mark_memory_idle(vm, nr_vcpus);
+
+	if (lru_gen) {
+		/* Scan the page tables again */
+		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+		/* The pages should now be young again, so in a newer generation */
+		found_gens[2] = lru_gen_find_generation(&stats, total_pages);
+		TEST_ASSERT(found_gens[2] > found_gens[1],
+			    "pages did not get younger");
+
+		/* Do another aging pass */
+		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+		/* Same generation; new generation has been made */
+		found_gens[3] = lru_gen_find_generation(&stats, total_pages);
+		TEST_ASSERT(found_gens[3] == found_gens[2],
+			    "unexpected gen change: %d vs. %d",
+			    found_gens[3], found_gens[2]);
+	} else
+		mark_memory_idle(vm, nr_vcpus);
+
 	access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
 
+	if (lru_gen) {
+		/* Scan the pages tables again */
+		lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+		/* The pages should now be young again, so in a newer generation */
+		found_gens[4] = lru_gen_find_generation(&stats, total_pages);
+		TEST_ASSERT(found_gens[4] > found_gens[3],
+			    "pages did not get younger");
+	}
+}
+
+static void setup_vm_and_run(enum vm_guest_mode mode, void *arg)
+{
+	struct test_params *params = arg;
+	int nr_vcpus = params->nr_vcpus;
+	struct kvm_vm *vm;
+
+	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+				 params->backing_src, !overlap_memory_access);
+
+	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
+
+	if (params->benchmark_lru_gen)
+		run_benchmark(mode, vm, params);
+	else
+		run_test(mode, vm, params);
+
 	memstress_join_vcpu_threads(nr_vcpus);
 	memstress_destroy_vm(vm);
 }
@@ -331,8 +594,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]  [-s mem_type]\n",
-	       name);
+	printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]"
+	       " [-s mem_type] [-l] [-r memcg_root]\n", name);
 	puts("");
 	printf(" -h: Display this help message.");
 	guest_modes_help();
@@ -342,6 +605,9 @@ static void help(char *name)
 	printf(" -v: specify the number of vCPUs to run.\n");
 	printf(" -o: Overlap guest memory accesses instead of partitioning\n"
 	       "     them into a separate region of memory for each vCPU.\n");
+	printf(" -l: Use MGLRU aging instead of idle page tracking\n");
+	printf(" -p: Benchmark MGLRU aging while faulting memory in\n");
+	printf(" -r: The memory cgroup hierarchy root to use (when -l is given)\n");
 	backing_src_help("-s");
 	puts("");
 	exit(0);
@@ -353,13 +619,15 @@ int main(int argc, char *argv[])
 		.backing_src = DEFAULT_VM_MEM_SRC,
 		.vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
 		.nr_vcpus = 1,
+		.lru_gen = false,
+		.benchmark_lru_gen = false,
 	};
 	int page_idle_fd;
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
+	while ((opt = getopt(argc, argv, "hm:b:v:os:lr:p")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -376,6 +644,15 @@ int main(int argc, char *argv[])
 		case 's':
 			params.backing_src = parse_backing_src_type(optarg);
 			break;
+		case 'l':
+			params.lru_gen = true;
+			break;
+		case 'p':
+			params.benchmark_lru_gen = true;
+			break;
+		case 'r':
+			cgroup_root = strdup(optarg);
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
@@ -383,12 +660,44 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
-	__TEST_REQUIRE(page_idle_fd >= 0,
-		       "CONFIG_IDLE_PAGE_TRACKING is not enabled");
-	close(page_idle_fd);
+	if (!params.lru_gen) {
+		page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
+		__TEST_REQUIRE(page_idle_fd >= 0,
+			       "CONFIG_IDLE_PAGE_TRACKING is not enabled");
+		close(page_idle_fd);
+	} else {
+		int lru_gen_fd, lru_gen_debug_fd;
+		long mglru_features;
+		char mglru_feature_str[8] = {};
+
+		lru_gen_fd = open("/sys/kernel/mm/lru_gen/enabled", O_RDONLY);
+		__TEST_REQUIRE(lru_gen_fd >= 0,
+			       "CONFIG_LRU_GEN is not enabled");
+		TEST_ASSERT(read(lru_gen_fd, &mglru_feature_str, 7) > 0,
+				 "couldn't read lru_gen features");
+		mglru_features = strtol(mglru_feature_str, NULL, 16);
+		__TEST_REQUIRE(mglru_features & LRU_GEN_ENABLED,
+			       "lru_gen is not enabled");
+		__TEST_REQUIRE(mglru_features & LRU_GEN_MM_WALK,
+			       "lru_gen does not support MM_WALK");
+		__TEST_REQUIRE(mglru_features & LRU_GEN_SECONDARY_MMU_WALK,
+			       "lru_gen does not support SECONDARY_MMU_WALK");
+
+		lru_gen_debug_fd = open(DEBUGFS_LRU_GEN, O_RDWR);
+		__TEST_REQUIRE(lru_gen_debug_fd >= 0,
+				"Cannot access %s", DEBUGFS_LRU_GEN);
+		close(lru_gen_debug_fd);
+	}
+
+	TEST_ASSERT(!params.benchmark_lru_gen || params.lru_gen,
+		    "-p specified without -l");
+
+	if (params.lru_gen) {
+		create_memcg(TEST_MEMCG_NAME);
+		move_to_memcg(TEST_MEMCG_NAME, getpid());
+	}
 
-	for_each_guest_mode(run_test, &params);
+	for_each_guest_mode(setup_vm_and_run, &params);
 
 	return 0;
 }
diff --git a/tools/testing/selftests/kvm/include/lru_gen_util.h b/tools/testing/selftests/kvm/include/lru_gen_util.h
new file mode 100644
index 000000000000..4eef8085a3cb
--- /dev/null
+++ b/tools/testing/selftests/kvm/include/lru_gen_util.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Tools for integrating with lru_gen, like parsing the lru_gen debugfs output.
+ *
+ * Copyright (C) 2024, Google LLC.
+ */
+#ifndef SELFTEST_KVM_LRU_GEN_UTIL_H
+#define SELFTEST_KVM_LRU_GEN_UTIL_H
+
+#include <inttypes.h>
+#include <limits.h>
+#include <stdlib.h>
+
+#include "test_util.h"
+
+#define MAX_NR_GENS 16 /* MAX_NR_GENS in include/linux/mmzone.h */
+#define MAX_NR_NODES 4 /* Maximum number of nodes we support */
+
+static const char *DEBUGFS_LRU_GEN = "/sys/kernel/debug/lru_gen";
+
+struct generation_stats {
+	int gen;
+	long age_ms;
+	long nr_anon;
+	long nr_file;
+};
+
+struct node_stats {
+	int node;
+	int nr_gens; /* Number of populated gens entries. */
+	struct generation_stats gens[MAX_NR_GENS];
+};
+
+struct memcg_stats {
+	unsigned long memcg_id;
+	int nr_nodes; /* Number of populated nodes entries. */
+	struct node_stats nodes[MAX_NR_NODES];
+};
+
+void print_memcg_stats(const struct memcg_stats *stats, const char *name);
+
+void read_memcg_stats(struct memcg_stats *stats, const char *memcg);
+
+void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg);
+
+long sum_memcg_stats(const struct memcg_stats *stats);
+
+void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg);
+
+void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg);
+
+int lru_gen_find_generation(const struct memcg_stats *stats,
+			    unsigned long total_pages);
+
+#endif /* SELFTEST_KVM_LRU_GEN_UTIL_H */
diff --git a/tools/testing/selftests/kvm/lib/lru_gen_util.c b/tools/testing/selftests/kvm/lib/lru_gen_util.c
new file mode 100644
index 000000000000..3c02a635a9f7
--- /dev/null
+++ b/tools/testing/selftests/kvm/lib/lru_gen_util.c
@@ -0,0 +1,391 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2024, Google LLC.
+ */
+
+#include <time.h>
+
+#include "lru_gen_util.h"
+
+/*
+ * Tracks state while we parse memcg lru_gen stats. The file we're parsing is
+ * structured like this (some extra whitespace elided):
+ *
+ * memcg (id) (path)
+ * node (id)
+ * (gen_nr) (age_in_ms) (nr_anon_pages) (nr_file_pages)
+ */
+struct memcg_stats_parse_context {
+	bool consumed; /* Whether or not this line was consumed */
+	/* Next parse handler to invoke */
+	void (*next_handler)(struct memcg_stats *,
+			     struct memcg_stats_parse_context *, char *);
+	int current_node_idx; /* Current index in nodes array */
+	const char *name; /* The name of the memcg we're looking for */
+};
+
+static void memcg_stats_handle_searching(struct memcg_stats *stats,
+					 struct memcg_stats_parse_context *ctx,
+					 char *line);
+static void memcg_stats_handle_in_memcg(struct memcg_stats *stats,
+					struct memcg_stats_parse_context *ctx,
+					char *line);
+static void memcg_stats_handle_in_node(struct memcg_stats *stats,
+				       struct memcg_stats_parse_context *ctx,
+				       char *line);
+
+struct split_iterator {
+	char *str;
+	char *save;
+};
+
+static char *split_next(struct split_iterator *it)
+{
+	char *ret = strtok_r(it->str, " \t\n\r", &it->save);
+
+	it->str = NULL;
+	return ret;
+}
+
+static void memcg_stats_handle_searching(struct memcg_stats *stats,
+					 struct memcg_stats_parse_context *ctx,
+					 char *line)
+{
+	struct split_iterator it = { .str = line };
+	char *prefix = split_next(&it);
+	char *memcg_id = split_next(&it);
+	char *memcg_name = split_next(&it);
+	char *end;
+
+	ctx->consumed = true;
+
+	if (!prefix || strcmp("memcg", prefix))
+		return; /* Not a memcg line (maybe empty), skip */
+
+	TEST_ASSERT(memcg_id && memcg_name,
+		    "malformed memcg line; no memcg id or memcg_name");
+
+	if (strcmp(memcg_name + 1, ctx->name))
+		return; /* Wrong memcg, skip */
+
+	/* Found it! */
+
+	stats->memcg_id = strtoul(memcg_id, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed memcg id '%s'", memcg_id);
+	if (!stats->memcg_id)
+		return; /* Removed memcg? */
+
+	ctx->next_handler = memcg_stats_handle_in_memcg;
+}
+
+static void memcg_stats_handle_in_memcg(struct memcg_stats *stats,
+					struct memcg_stats_parse_context *ctx,
+					char *line)
+{
+	struct split_iterator it = { .str = line };
+	char *prefix = split_next(&it);
+	char *id = split_next(&it);
+	long found_node_id;
+	char *end;
+
+	ctx->consumed = true;
+	ctx->current_node_idx = -1;
+
+	if (!prefix)
+		return; /* Skip empty lines */
+
+	if (!strcmp("memcg", prefix)) {
+		/* Memcg done, found next one; stop. */
+		ctx->next_handler = NULL;
+		return;
+	} else if (strcmp("node", prefix))
+		TEST_ASSERT(false, "found malformed line after 'memcg ...',"
+				   "token: '%s'", prefix);
+
+	/* At this point we know we have a node line. Parse the ID. */
+
+	TEST_ASSERT(id, "malformed node line; no node id");
+
+	found_node_id = strtol(id, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed node id '%s'", id);
+
+	ctx->current_node_idx = stats->nr_nodes++;
+	TEST_ASSERT(ctx->current_node_idx < MAX_NR_NODES,
+		    "memcg has stats for too many nodes, max is %d",
+		    MAX_NR_NODES);
+	stats->nodes[ctx->current_node_idx].node = found_node_id;
+
+	ctx->next_handler = memcg_stats_handle_in_node;
+}
+
+static void memcg_stats_handle_in_node(struct memcg_stats *stats,
+				       struct memcg_stats_parse_context *ctx,
+				       char *line)
+{
+	/* Have to copy since we might not consume */
+	char *my_line = strdup(line);
+	struct split_iterator it = { .str = my_line };
+	char *gen, *age, *nr_anon, *nr_file;
+	struct node_stats *node_stats;
+	struct generation_stats *gen_stats;
+	char *end;
+
+	TEST_ASSERT(it.str, "failed to copy input line");
+
+	gen = split_next(&it);
+
+	/* Skip empty lines */
+	if (!gen)
+		goto out_consume; /* Skip empty lines */
+
+	if (!strcmp("memcg", gen) || !strcmp("node", gen)) {
+		/*
+		 * Reached next memcg or node section. Don't consume, let the
+		 * other handler deal with this.
+		 */
+		ctx->next_handler = memcg_stats_handle_in_memcg;
+		goto out;
+	}
+
+	node_stats = &stats->nodes[ctx->current_node_idx];
+	TEST_ASSERT(node_stats->nr_gens < MAX_NR_GENS,
+		    "found too many generation lines; max is %d",
+		    MAX_NR_GENS);
+	gen_stats = &node_stats->gens[node_stats->nr_gens++];
+
+	age = split_next(&it);
+	nr_anon = split_next(&it);
+	nr_file = split_next(&it);
+
+	TEST_ASSERT(age && nr_anon && nr_file,
+		    "malformed generation line; not enough tokens");
+
+	gen_stats->gen = (int)strtol(gen, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed generation number '%s'", gen);
+
+	gen_stats->age_ms = strtol(age, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed generation age '%s'", age);
+
+	gen_stats->nr_anon = strtol(nr_anon, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed anonymous page count '%s'",
+		    nr_anon);
+
+	gen_stats->nr_file = strtol(nr_file, &end, 10);
+	TEST_ASSERT(*end == '\0', "malformed file page count '%s'", nr_file);
+
+out_consume:
+	ctx->consumed = true;
+out:
+	free(my_line);
+}
+
+/* Pretty-print lru_gen @stats. */
+void print_memcg_stats(const struct memcg_stats *stats, const char *name)
+{
+	int node, gen;
+
+	fprintf(stderr, "stats for memcg %s (id %lu):\n",
+			name, stats->memcg_id);
+	for (node = 0; node < stats->nr_nodes; ++node) {
+		fprintf(stderr, "\tnode %d\n", stats->nodes[node].node);
+		for (gen = 0; gen < stats->nodes[node].nr_gens; ++gen) {
+			const struct generation_stats *gstats =
+				&stats->nodes[node].gens[gen];
+
+			fprintf(stderr,
+				"\t\tgen %d\tage_ms %ld"
+				"\tnr_anon %ld\tnr_file %ld\n",
+				gstats->gen, gstats->age_ms, gstats->nr_anon,
+				gstats->nr_file);
+		}
+	}
+}
+
+/* Re-read lru_gen debugfs information for @memcg into @stats. */
+void read_memcg_stats(struct memcg_stats *stats, const char *memcg)
+{
+	FILE *f;
+	ssize_t read = 0;
+	char *line = NULL;
+	size_t bufsz;
+	struct memcg_stats_parse_context ctx = {
+		.next_handler = memcg_stats_handle_searching,
+		.name = memcg,
+	};
+
+	memset(stats, 0, sizeof(struct memcg_stats));
+
+	f = fopen(DEBUGFS_LRU_GEN, "r");
+	TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
+
+	while (ctx.next_handler && (read = getline(&line, &bufsz, f)) > 0) {
+		ctx.consumed = false;
+
+		do {
+			ctx.next_handler(stats, &ctx, line);
+			if (!ctx.next_handler)
+				break;
+		} while (!ctx.consumed);
+	}
+
+	if (read < 0 && !feof(f))
+		TEST_ASSERT(false, "getline(%s) failed", DEBUGFS_LRU_GEN);
+
+	TEST_ASSERT(stats->memcg_id > 0, "Couldn't find memcg: %s\n"
+		    "Did the memcg get created in the proper mount?",
+		    memcg);
+	if (line)
+		free(line);
+	TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
+}
+
+/*
+ * Find all pages tracked by lru_gen for this memcg in generation @target_gen.
+ *
+ * If @target_gen is negative, look for all generations.
+ */
+static long sum_memcg_stats_for_gen(int target_gen,
+				    const struct memcg_stats *stats)
+{
+	int node, gen;
+	long total_nr = 0;
+
+	for (node = 0; node < stats->nr_nodes; ++node) {
+		const struct node_stats *node_stats = &stats->nodes[node];
+
+		for (gen = 0; gen < node_stats->nr_gens; ++gen) {
+			const struct generation_stats *gen_stats =
+				&node_stats->gens[gen];
+
+			if (target_gen >= 0 && gen_stats->gen != target_gen)
+				continue;
+
+			total_nr += gen_stats->nr_anon + gen_stats->nr_file;
+		}
+	}
+
+	return total_nr;
+}
+
+/* Find all pages tracked by lru_gen for this memcg. */
+long sum_memcg_stats(const struct memcg_stats *stats)
+{
+	return sum_memcg_stats_for_gen(-1, stats);
+}
+
+/* Read the memcg stats and optionally print if this is a debug build. */
+void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg)
+{
+	read_memcg_stats(stats, memcg);
+#ifdef DEBUG
+	print_memcg_stats(stats, memcg);
+#endif
+}
+
+/*
+ * If lru_gen aging should force page table scanning.
+ *
+ * If you want to set this to false, you will need to do eviction
+ * before doing extra aging passes.
+ */
+static const bool force_scan = true;
+
+static void run_aging_impl(unsigned long memcg_id, int node_id, int max_gen)
+{
+	FILE *f = fopen(DEBUGFS_LRU_GEN, "w");
+	char *command;
+	size_t sz;
+
+	TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
+	sz = asprintf(&command, "+ %lu %d %d 1 %d\n",
+		      memcg_id, node_id, max_gen, force_scan);
+	TEST_ASSERT(sz > 0, "creating aging command failed");
+
+	pr_debug("Running aging command: %s", command);
+	if (fwrite(command, sizeof(char), sz, f) < sz) {
+		TEST_ASSERT(false, "writing aging command %s to %s failed",
+			    command, DEBUGFS_LRU_GEN);
+	}
+
+	TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
+}
+
+static void _lru_gen_do_aging(struct memcg_stats *stats, const char *memcg,
+			      bool verbose)
+{
+	int node, gen;
+	struct timespec ts_start;
+	struct timespec ts_elapsed;
+
+	pr_debug("lru_gen: invoking aging...\n");
+
+	/* Must read memcg stats to construct the proper aging command. */
+	read_print_memcg_stats(stats, memcg);
+
+	if (verbose)
+		clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+	for (node = 0; node < stats->nr_nodes; ++node) {
+		int max_gen = 0;
+
+		for (gen = 0; gen < stats->nodes[node].nr_gens; ++gen) {
+			int this_gen = stats->nodes[node].gens[gen].gen;
+
+			max_gen = max_gen > this_gen ? max_gen : this_gen;
+		}
+
+		run_aging_impl(stats->memcg_id, stats->nodes[node].node,
+			       max_gen);
+	}
+
+	if (verbose) {
+		ts_elapsed = timespec_elapsed(ts_start);
+		pr_info("%-30s: %ld.%09lds\n", "lru_gen: Aging",
+			ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+	}
+
+	/* Re-read so callers get updated information */
+	read_print_memcg_stats(stats, memcg);
+}
+
+/* Do aging, and print how long it took. */
+void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg)
+{
+	return _lru_gen_do_aging(stats, memcg, true);
+}
+
+/* Do aging, don't print anything. */
+void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg)
+{
+	return _lru_gen_do_aging(stats, memcg, false);
+}
+
+/*
+ * Find which generation contains more than half of @total_pages, assuming that
+ * such a generation exists.
+ */
+int lru_gen_find_generation(const struct memcg_stats *stats,
+			    unsigned long total_pages)
+{
+	int node, gen, gen_idx, min_gen = INT_MAX, max_gen = -1;
+
+	for (node = 0; node < stats->nr_nodes; ++node)
+		for (gen_idx = 0; gen_idx < stats->nodes[node].nr_gens;
+		     ++gen_idx) {
+			gen = stats->nodes[node].gens[gen_idx].gen;
+			max_gen = gen > max_gen ? gen : max_gen;
+			min_gen = gen < min_gen ? gen : min_gen;
+		}
+
+	for (gen = min_gen; gen < max_gen; ++gen)
+		/* See if the most pages are in this generation. */
+		if (sum_memcg_stats_for_gen(gen, stats) >
+				total_pages / 2)
+			return gen;
+
+	TEST_ASSERT(false, "No generation includes majority of %lu pages.",
+		    total_pages);
+
+	/* unreachable, but make the compiler happy */
+	return -1;
+}
-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2024-08-30 20:22 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-24  1:10 [PATCH v6 00/11] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
2024-07-24  1:10 ` [PATCH v6 01/11] KVM: Add lockless memslot walk to KVM James Houghton
2024-07-25 16:39   ` David Matlack
2024-07-26  0:28     ` James Houghton
2024-07-24  1:10 ` [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
2024-07-25 18:07   ` David Matlack
2024-07-26  0:34     ` James Houghton
2024-08-17  1:05   ` Sean Christopherson
2024-08-30  0:35     ` James Houghton
2024-08-30  3:47       ` Sean Christopherson
2024-08-30 12:47         ` Jason Gunthorpe
2024-08-30 17:09           ` Sean Christopherson
2024-08-30 20:22             ` Jason Gunthorpe
2024-07-24  1:10 ` [PATCH v6 03/11] KVM: arm64: " James Houghton
2024-07-25 21:55   ` James Houghton
2024-08-17  0:46     ` Sean Christopherson
2024-08-17  1:03       ` Yu Zhao
2024-08-19 20:41         ` Oliver Upton
2024-08-19 22:47           ` Sean Christopherson
2024-08-30  0:33           ` James Houghton
2024-08-30  0:48             ` Oliver Upton
2024-08-30 15:33               ` David Matlack
2024-08-30 17:38                 ` Oliver Upton
2024-07-24  1:10 ` [PATCH v6 04/11] mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER James Houghton
2024-08-01  9:34   ` David Hildenbrand
2024-07-24  1:10 ` [PATCH v6 05/11] mm: Add fast_only bool to test_young and clear_young MMU notifiers James Houghton
2024-08-01  9:36   ` David Hildenbrand
2024-08-01 23:13     ` James Houghton
2024-08-02 15:57       ` David Hildenbrand
2024-08-05 16:54         ` James Houghton
2024-07-24  1:10 ` [PATCH v6 06/11] mm: Add has_fast_aging to struct mmu_notifier James Houghton
2024-07-24  1:10 ` [PATCH v6 07/11] KVM: Pass fast_only to kvm_{test_,}age_gfn James Houghton
2024-07-24  1:10 ` [PATCH v6 08/11] KVM: x86: Optimize kvm_{test_,}age_gfn a little bit James Houghton
2024-07-25 18:17   ` David Matlack
2024-08-17  1:00     ` Sean Christopherson
2024-08-30  0:34       ` James Houghton
2024-07-24  1:10 ` [PATCH v6 09/11] KVM: x86: Implement fast_only versions of kvm_{test_,}age_gfn James Houghton
2024-07-25 18:24   ` David Matlack
2024-07-24  1:10 ` [PATCH v6 10/11] mm: multi-gen LRU: Have secondary MMUs participate in aging James Houghton
2024-07-24  1:10 ` [PATCH v6 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).