kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock
@ 2025-07-07 22:47 James Houghton
  2025-07-07 22:47 ` [PATCH v5 1/7] KVM: x86/mmu: Track TDP MMU NX huge pages separately James Houghton
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: James Houghton @ 2025-07-07 22:47 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Vipin Sharma, David Matlack, James Houghton, kvm, linux-kernel

Hi Sean/Paolo,

I'm finishing off Vipin's NX huge page recovery optimization for the TDP
MMU from last year. This is a respin on the series I sent a couple weeks
ago, v4. Below is a mostly unchanged cover letter from v4.

NX huge page recovery can cause guest performance jitter, originally
noticed with network tests in Windows guests. Please see Vipin's earlier
performance results[1]. Below is some new data I have collected with the
nx_huge_pages_perf_test that I've included with this series.

The NX huge page recovery for the shadow MMU is still done under the MMU
write lock, but with the TDP MMU, we can instead do it under the MMU
read lock by:

1. Tracking the possible NX huge pages for the two MMUs separately
   (patch 1).
2. Updating the NX huge page recovery routine for the TDP MMU to
    - zap SPTEs atomically, and
    - grab tdp_mmu_pages_lock to iterate over the NX huge page list
   (patch 3).

I threw in patch 4 because it seems harmless and closer to the "right"
thing to do. Feel free to drop it if you don't agree with me. :)

I'm also grabbing David's execute_perf_test[3] while I'm at it. It was
dropped before simply because it didn't apply at the time. David's test
works well as a stress test for NX huge page recovery when NX huge page
recovery is tuned to be very aggressive.

Changes since v4[4]:
- 32-bit build fixups for patch 1 and 3.
- Small variable rename in patch 3.

Changes since v3[2]:
- Dropped the move of the `sp->nx_huge_page_disallowed` check to outside
  of the tdp_mmu_pages_lock.
- Implemented Sean's array suggestion for `possible_nx_huge_pages`.
- Implemented some other cleanup suggestions from Sean.
- Made shadow MMU not take the RCU lock in NX huge page recovery.
- Added a selftest for measuring jitter.
- Added David's execute_perf_test[3].

-- Results
$ cat /sys/module/kvm/parameters/nx_huge_pages_recovery_period_ms
100
$ cat /sys/module/kvm/parameters/nx_huge_pages_recovery_ratio
4

$ ./nx_huge_pages_perf_test -b 16G -s anonymous_hugetlb_1gb
[Unpatched] Max fault latency: 8496724 cycles
[Unpatched] Max fault latency: 8404426 cycles
[ Patched ] Max fault latency: 49418 cycles
[ Patched ] Max fault latency: 51948 cycles

$ ./nx_huge_pages_perf_test -b 16G -s anonymous_hugetlb_2mb
[Unpatched] Max fault latency: 5320740 cycles
[Unpatched] Max fault latency: 5384554 cycles
[ Patched ] Max fault latency: 50052 cycles
[ Patched ] Max fault latency: 103774 cycles

$ ./nx_huge_pages_perf_test -b 16G -s anonymous_thp
[Unpatched] Max fault latency: 7625022 cycles
[Unpatched] Max fault latency: 6339934 cycles
[ Patched ] Max fault latency: 107976 cycles
[ Patched ] Max fault latency: 108386 cycles

$ ./nx_huge_pages_perf_test -b 16G -s anonymous
[Unpatched] Max fault latency: 143036 cycles
[Unpatched] Max fault latency: 287444 cycles
[ Patched ] Max fault latency: 274626 cycles
[ Patched ] Max fault latency: 303984 cycles

We can see about a 100x decrease in maximum fault latency for both
2M pages and 1G pages. This test is only timing writes to unmapped
pages that are not themselves currently undergoing NX huge page
recovery. The test only produces interesting results when NX huge page
recovery is actually occurring, so the parameters are tuned to make it
very likely for NX huge page recovery to occur in the middle of the
test.

Based on latest kvm/next.

[1]: https://lore.kernel.org/kvm/20240906204515.3276696-3-vipinsh@google.com/
[2]: https://lore.kernel.org/kvm/20240906204515.3276696-1-vipinsh@google.com/
[3]: https://lore.kernel.org/kvm/20221109185905.486172-2-dmatlack@google.com/
[4]: https://lore.kernel.org/kvm/20250616181144.2874709-1-jthoughton@google.com/

David Matlack (1):
  KVM: selftests: Introduce a selftest to measure execution performance

James Houghton (3):
  KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU
  KVM: selftests: Provide extra mmap flags in vm_mem_add()
  KVM: selftests: Add an NX huge pages jitter test

Vipin Sharma (3):
  KVM: x86/mmu: Track TDP MMU NX huge pages separately
  KVM: x86/mmu: Rename kvm_tdp_mmu_zap_sp() to better indicate its
    purpose
  KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

 arch/x86/include/asm/kvm_host.h               |  43 +++-
 arch/x86/kvm/mmu/mmu.c                        | 180 +++++++++-----
 arch/x86/kvm/mmu/mmu_internal.h               |   7 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  49 +++-
 arch/x86/kvm/mmu/tdp_mmu.h                    |   3 +-
 tools/testing/selftests/kvm/Makefile.kvm      |   2 +
 .../testing/selftests/kvm/execute_perf_test.c | 199 ++++++++++++++++
 .../testing/selftests/kvm/include/kvm_util.h  |   3 +-
 .../testing/selftests/kvm/include/memstress.h |   4 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  15 +-
 tools/testing/selftests/kvm/lib/memstress.c   |  25 +-
 .../kvm/x86/nx_huge_pages_perf_test.c         | 223 ++++++++++++++++++
 .../kvm/x86/private_mem_conversions_test.c    |   2 +-
 13 files changed, 656 insertions(+), 99 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/execute_perf_test.c
 create mode 100644 tools/testing/selftests/kvm/x86/nx_huge_pages_perf_test.c


base-commit: 8046d29dde17002523f94d3e6e0ebe486ce52166
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v5 1/7] KVM: x86/mmu: Track TDP MMU NX huge pages separately
  2025-07-07 22:47 [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock James Houghton
@ 2025-07-07 22:47 ` James Houghton
  2025-08-19 17:57   ` Sean Christopherson
  2025-07-07 22:47 ` [PATCH v5 2/7] KVM: x86/mmu: Rename kvm_tdp_mmu_zap_sp() to better indicate its purpose James Houghton
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: James Houghton @ 2025-07-07 22:47 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Vipin Sharma, David Matlack, James Houghton, kvm, linux-kernel

From: Vipin Sharma <vipinsh@google.com>

Introduce struct kvm_possible_nx_huge_pages to track the list of
possible NX huge pages and the number of pages on the list.

When calculating how many pages to zap, we use the new counts we have
(instead of kvm->stat.nx_lpage_splits, which would be the sum of the two
new counts).

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/include/asm/kvm_host.h | 43 ++++++++++++++++--------
 arch/x86/kvm/mmu/mmu.c          | 58 +++++++++++++++++++++------------
 arch/x86/kvm/mmu/mmu_internal.h |  7 ++--
 arch/x86/kvm/mmu/tdp_mmu.c      |  4 +--
 4 files changed, 75 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b4a391929cdba..d544a269c1920 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1334,6 +1334,34 @@ enum kvm_apicv_inhibit {
 	__APICV_INHIBIT_REASON(SEV),			\
 	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
 
+struct kvm_possible_nx_huge_pages {
+	/*
+	 * A list of kvm_mmu_page structs that, if zapped, could possibly be
+	 * replaced by an NX huge page.  A shadow page is on this list if its
+	 * existence disallows an NX huge page (nx_huge_page_disallowed is set)
+	 * and there are no other conditions that prevent a huge page, e.g.
+	 * the backing host page is huge, dirtly logging is not enabled for its
+	 * memslot, etc...  Note, zapping shadow pages on this list doesn't
+	 * guarantee an NX huge page will be created in its stead, e.g. if the
+	 * guest attempts to execute from the region then KVM obviously can't
+	 * create an NX huge page (without hanging the guest).
+	 */
+	struct list_head pages;
+	u64 nr_pages;
+};
+
+enum kvm_mmu_type {
+	KVM_SHADOW_MMU,
+#ifdef CONFIG_X86_64
+	KVM_TDP_MMU,
+#endif
+	KVM_NR_MMU_TYPES,
+};
+
+#ifndef CONFIG_X86_64
+#define KVM_TDP_MMU -1
+#endif
+
 struct kvm_arch {
 	unsigned long n_used_mmu_pages;
 	unsigned long n_requested_mmu_pages;
@@ -1346,18 +1374,7 @@ struct kvm_arch {
 	bool pre_fault_allowed;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	struct list_head active_mmu_pages;
-	/*
-	 * A list of kvm_mmu_page structs that, if zapped, could possibly be
-	 * replaced by an NX huge page.  A shadow page is on this list if its
-	 * existence disallows an NX huge page (nx_huge_page_disallowed is set)
-	 * and there are no other conditions that prevent a huge page, e.g.
-	 * the backing host page is huge, dirtly logging is not enabled for its
-	 * memslot, etc...  Note, zapping shadow pages on this list doesn't
-	 * guarantee an NX huge page will be created in its stead, e.g. if the
-	 * guest attempts to execute from the region then KVM obviously can't
-	 * create an NX huge page (without hanging the guest).
-	 */
-	struct list_head possible_nx_huge_pages;
+	struct kvm_possible_nx_huge_pages possible_nx_huge_pages[KVM_NR_MMU_TYPES];
 #ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
 	struct kvm_page_track_notifier_head track_notifier_head;
 #endif
@@ -1516,7 +1533,7 @@ struct kvm_arch {
 	 * is held in read mode:
 	 *  - tdp_mmu_roots (above)
 	 *  - the link field of kvm_mmu_page structs used by the TDP MMU
-	 *  - possible_nx_huge_pages;
+	 *  - possible_nx_huge_pages[KVM_TDP_MMU];
 	 *  - the possible_nx_huge_page_link field of kvm_mmu_page structs used
 	 *    by the TDP MMU
 	 * Because the lock is only taken within the MMU lock, strictly
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4e06e2e89a8fa..f44d7f3acc179 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -65,9 +65,9 @@ int __read_mostly nx_huge_pages = -1;
 static uint __read_mostly nx_huge_pages_recovery_period_ms;
 #ifdef CONFIG_PREEMPT_RT
 /* Recovery can cause latency spikes, disable it for PREEMPT_RT.  */
-static uint __read_mostly nx_huge_pages_recovery_ratio = 0;
+unsigned int __read_mostly nx_huge_pages_recovery_ratio;
 #else
-static uint __read_mostly nx_huge_pages_recovery_ratio = 60;
+unsigned int __read_mostly nx_huge_pages_recovery_ratio = 60;
 #endif
 
 static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp);
@@ -776,7 +776,8 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 		kvm_flush_remote_tlbs_gfn(kvm, gfn, PG_LEVEL_4K);
 }
 
-void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+				 enum kvm_mmu_type mmu_type)
 {
 	/*
 	 * If it's possible to replace the shadow page with an NX huge page,
@@ -790,8 +791,9 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 		return;
 
 	++kvm->stat.nx_lpage_splits;
+	++kvm->arch.possible_nx_huge_pages[mmu_type].nr_pages;
 	list_add_tail(&sp->possible_nx_huge_page_link,
-		      &kvm->arch.possible_nx_huge_pages);
+		      &kvm->arch.possible_nx_huge_pages[mmu_type].pages);
 }
 
 static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
@@ -800,7 +802,7 @@ static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 	sp->nx_huge_page_disallowed = true;
 
 	if (nx_huge_page_possible)
-		track_possible_nx_huge_page(kvm, sp);
+		track_possible_nx_huge_page(kvm, sp, KVM_SHADOW_MMU);
 }
 
 static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -819,12 +821,14 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	kvm_mmu_gfn_allow_lpage(slot, gfn);
 }
 
-void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+				   enum kvm_mmu_type mmu_type)
 {
 	if (list_empty(&sp->possible_nx_huge_page_link))
 		return;
 
 	--kvm->stat.nx_lpage_splits;
+	--kvm->arch.possible_nx_huge_pages[mmu_type].nr_pages;
 	list_del_init(&sp->possible_nx_huge_page_link);
 }
 
@@ -832,7 +836,7 @@ static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	sp->nx_huge_page_disallowed = false;
 
-	untrack_possible_nx_huge_page(kvm, sp);
+	untrack_possible_nx_huge_page(kvm, sp, KVM_SHADOW_MMU);
 }
 
 static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu,
@@ -6684,9 +6688,12 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 
 void kvm_mmu_init_vm(struct kvm *kvm)
 {
+	int i;
+
 	kvm->arch.shadow_mmio_value = shadow_mmio_value;
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
-	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
+	for (i = 0; i < KVM_NR_MMU_TYPES; ++i)
+		INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages[i].pages);
 	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
 
 	if (tdp_mmu_enabled)
@@ -7519,16 +7526,27 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 	return err;
 }
 
-static void kvm_recover_nx_huge_pages(struct kvm *kvm)
+static unsigned long nx_huge_pages_to_zap(struct kvm *kvm,
+					  enum kvm_mmu_type mmu_type)
+{
+	unsigned long pages = READ_ONCE(kvm->arch.possible_nx_huge_pages[mmu_type].nr_pages);
+	unsigned int ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
+
+	return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
+}
+
+static void kvm_recover_nx_huge_pages(struct kvm *kvm,
+				      enum kvm_mmu_type mmu_type)
 {
-	unsigned long nx_lpage_splits = kvm->stat.nx_lpage_splits;
+	unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
+	struct list_head *nx_huge_pages;
 	struct kvm_memory_slot *slot;
-	int rcu_idx;
 	struct kvm_mmu_page *sp;
-	unsigned int ratio;
 	LIST_HEAD(invalid_list);
 	bool flush = false;
-	ulong to_zap;
+	int rcu_idx;
+
+	nx_huge_pages = &kvm->arch.possible_nx_huge_pages[mmu_type].pages;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
 	write_lock(&kvm->mmu_lock);
@@ -7540,10 +7558,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 	 */
 	rcu_read_lock();
 
-	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
-	to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
 	for ( ; to_zap; --to_zap) {
-		if (list_empty(&kvm->arch.possible_nx_huge_pages))
+		if (list_empty(nx_huge_pages))
 			break;
 
 		/*
@@ -7553,7 +7569,7 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 		 * the total number of shadow pages.  And because the TDP MMU
 		 * doesn't use active_mmu_pages.
 		 */
-		sp = list_first_entry(&kvm->arch.possible_nx_huge_pages,
+		sp = list_first_entry(nx_huge_pages,
 				      struct kvm_mmu_page,
 				      possible_nx_huge_page_link);
 		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
@@ -7590,7 +7606,7 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 
 		if (slot && kvm_slot_dirty_track_enabled(slot))
 			unaccount_nx_huge_page(kvm, sp);
-		else if (is_tdp_mmu_page(sp))
+		else if (mmu_type == KVM_TDP_MMU)
 			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
 		else
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
@@ -7621,9 +7637,10 @@ static void kvm_nx_huge_page_recovery_worker_kill(void *data)
 static bool kvm_nx_huge_page_recovery_worker(void *data)
 {
 	struct kvm *kvm = data;
+	long remaining_time;
 	bool enabled;
 	uint period;
-	long remaining_time;
+	int i;
 
 	enabled = calc_nx_huge_pages_recovery_period(&period);
 	if (!enabled)
@@ -7638,7 +7655,8 @@ static bool kvm_nx_huge_page_recovery_worker(void *data)
 	}
 
 	__set_current_state(TASK_RUNNING);
-	kvm_recover_nx_huge_pages(kvm);
+	for (i = 0; i < KVM_NR_MMU_TYPES; ++i)
+		kvm_recover_nx_huge_pages(kvm, i);
 	kvm->arch.nx_huge_page_last = get_jiffies_64();
 	return true;
 }
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index db8f33e4de624..a8fd2de13f707 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -413,7 +413,10 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
 
-void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
-void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+				 enum kvm_mmu_type mmu_type);
+void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+				   enum kvm_mmu_type mmu_type);
 
+extern unsigned int nx_huge_pages_recovery_ratio;
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1f..48b070f9f4e13 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -355,7 +355,7 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	sp->nx_huge_page_disallowed = false;
-	untrack_possible_nx_huge_page(kvm, sp);
+	untrack_possible_nx_huge_page(kvm, sp, KVM_TDP_MMU);
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 }
 
@@ -1303,7 +1303,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		    fault->req_level >= iter.level) {
 			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 			if (sp->nx_huge_page_disallowed)
-				track_possible_nx_huge_page(kvm, sp);
+				track_possible_nx_huge_page(kvm, sp, KVM_TDP_MMU);
 			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 		}
 	}
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 2/7] KVM: x86/mmu: Rename kvm_tdp_mmu_zap_sp() to better indicate its purpose
  2025-07-07 22:47 [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock James Houghton
  2025-07-07 22:47 ` [PATCH v5 1/7] KVM: x86/mmu: Track TDP MMU NX huge pages separately James Houghton
@ 2025-07-07 22:47 ` James Houghton
  2025-07-07 22:47 ` [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock James Houghton
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: James Houghton @ 2025-07-07 22:47 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Vipin Sharma, David Matlack, James Houghton, kvm, linux-kernel

From: Vipin Sharma <vipinsh@google.com>

kvm_tdp_mmu_zap_sp() is only used for NX huge page recovery, so rename
it to kvm_tdp_mmu_zap_possible_nx_huge_page(). In a future commit, this
function will be changed to include logic specific to NX huge page
recovery.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 2 +-
 arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
 arch/x86/kvm/mmu/tdp_mmu.h | 3 ++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f44d7f3acc179..b074f7bb5cc58 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7607,7 +7607,7 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 		if (slot && kvm_slot_dirty_track_enabled(slot))
 			unaccount_nx_huge_page(kvm, sp);
 		else if (mmu_type == KVM_TDP_MMU)
-			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
+			flush |= kvm_tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
 		else
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 48b070f9f4e13..19907eb04a9c4 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -925,7 +925,8 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_unlock();
 }
 
-bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
+					   struct kvm_mmu_page *sp)
 {
 	u64 old_spte;
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 52acf99d40a00..bd62977c9199e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -64,7 +64,8 @@ static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu,
 }
 
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
-bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
+bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
+					   struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
 				  enum kvm_tdp_mmu_root_types root_types);
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2025-07-07 22:47 [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock James Houghton
  2025-07-07 22:47 ` [PATCH v5 1/7] KVM: x86/mmu: Track TDP MMU NX huge pages separately James Houghton
  2025-07-07 22:47 ` [PATCH v5 2/7] KVM: x86/mmu: Rename kvm_tdp_mmu_zap_sp() to better indicate its purpose James Houghton
@ 2025-07-07 22:47 ` James Houghton
  2025-07-23 20:34   ` Sean Christopherson
  2025-07-07 22:47 ` [PATCH v5 4/7] KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU James Houghton
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: James Houghton @ 2025-07-07 22:47 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Vipin Sharma, David Matlack, James Houghton, kvm, linux-kernel

From: Vipin Sharma <vipinsh@google.com>

Use MMU read lock to recover TDP MMU NX huge pages. Iterate
over the huge pages list under tdp_mmu_pages_lock protection and
unaccount the page before dropping the lock.

We must not zap an SPTE if:
- The SPTE is a root page.
- The SPTE does not point at the SP's page table.

If the SPTE does not point at the SP's page table, then something else
has change the SPTE, so we cannot safely zap it.

Warn if zapping SPTE fails and current SPTE is still pointing to same
page table. This should never happen.

There is always a race between dirty logging, vCPU faults, and NX huge
page recovery for backing a gfn by an NX huge page or an executable
small page. Unaccounting sooner during the list traversal is increasing
the window of that race. Functionally, it is okay, because accounting
doesn't protect against iTLB multi-hit bug, it is there purely to
prevent KVM from bouncing a gfn between two page sizes. The only
downside is that a vCPU will end up doing more work in tearing down all
the child SPTEs. This should be a very rare race.

Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
read lock. This optimizaion is done to solve a guest jitter issue on
Windows VM which was observing an increase in network latency.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 107 ++++++++++++++++++++++++-------------
 arch/x86/kvm/mmu/tdp_mmu.c |  42 ++++++++++++---
 2 files changed, 105 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b074f7bb5cc58..7df1b4ead705b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7535,12 +7535,40 @@ static unsigned long nx_huge_pages_to_zap(struct kvm *kvm,
 	return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
 }
 
+static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
+					     struct kvm_mmu_page *sp)
+{
+	struct kvm_memory_slot *slot = NULL;
+
+	/*
+	 * Since gfn_to_memslot() is relatively expensive, it helps to skip it if
+	 * it the test cannot possibly return true.  On the other hand, if any
+	 * memslot has logging enabled, chances are good that all of them do, in
+	 * which case unaccount_nx_huge_page() is much cheaper than zapping the
+	 * page.
+	 *
+	 * If a memslot update is in progress, reading an incorrect value of
+	 * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
+	 * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming
+	 * nonzero, the page will be zapped unnecessarily.  Either way, this only
+	 * affects efficiency in racy situations, and not correctness.
+	 */
+	if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
+		struct kvm_memslots *slots;
+
+		slots = kvm_memslots_for_spte_role(kvm, sp->role);
+		slot = __gfn_to_memslot(slots, sp->gfn);
+		WARN_ON_ONCE(!slot);
+	}
+	return slot && kvm_slot_dirty_track_enabled(slot);
+}
+
 static void kvm_recover_nx_huge_pages(struct kvm *kvm,
-				      enum kvm_mmu_type mmu_type)
+				      const enum kvm_mmu_type mmu_type)
 {
 	unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
+	bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
 	struct list_head *nx_huge_pages;
-	struct kvm_memory_slot *slot;
 	struct kvm_mmu_page *sp;
 	LIST_HEAD(invalid_list);
 	bool flush = false;
@@ -7549,7 +7577,10 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 	nx_huge_pages = &kvm->arch.possible_nx_huge_pages[mmu_type].pages;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
-	write_lock(&kvm->mmu_lock);
+	if (is_tdp_mmu)
+		read_lock(&kvm->mmu_lock);
+	else
+		write_lock(&kvm->mmu_lock);
 
 	/*
 	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
@@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 	rcu_read_lock();
 
 	for ( ; to_zap; --to_zap) {
-		if (list_empty(nx_huge_pages))
+#ifdef CONFIG_X86_64
+		if (is_tdp_mmu)
+			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+#endif
+		if (list_empty(nx_huge_pages)) {
+#ifdef CONFIG_X86_64
+			if (is_tdp_mmu)
+				spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+#endif
 			break;
+		}
 
 		/*
 		 * We use a separate list instead of just using active_mmu_pages
@@ -7575,50 +7615,40 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
 		WARN_ON_ONCE(!sp->role.direct);
 
+		unaccount_nx_huge_page(kvm, sp);
+
+#ifdef CONFIG_X86_64
+		if (is_tdp_mmu)
+			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+#endif
+
 		/*
-		 * Unaccount and do not attempt to recover any NX Huge Pages
-		 * that are being dirty tracked, as they would just be faulted
-		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
-		 * recovered, along with all the other huge pages in the slot,
-		 * when dirty logging is disabled.
-		 *
-		 * Since gfn_to_memslot() is relatively expensive, it helps to
-		 * skip it if it the test cannot possibly return true.  On the
-		 * other hand, if any memslot has logging enabled, chances are
-		 * good that all of them do, in which case unaccount_nx_huge_page()
-		 * is much cheaper than zapping the page.
-		 *
-		 * If a memslot update is in progress, reading an incorrect value
-		 * of kvm->nr_memslots_dirty_logging is not a problem: if it is
-		 * becoming zero, gfn_to_memslot() will be done unnecessarily; if
-		 * it is becoming nonzero, the page will be zapped unnecessarily.
-		 * Either way, this only affects efficiency in racy situations,
-		 * and not correctness.
+		 * Do not attempt to recover any NX Huge Pages that are being
+		 * dirty tracked, as they would just be faulted back in as 4KiB
+		 * pages. The NX Huge Pages in this slot will be recovered,
+		 * along with all the other huge pages in the slot, when dirty
+		 * logging is disabled.
 		 */
-		slot = NULL;
-		if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
-			struct kvm_memslots *slots;
+		if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp)) {
+			if (is_tdp_mmu)
+				flush |= kvm_tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
+			else
+				kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 
-			slots = kvm_memslots_for_spte_role(kvm, sp->role);
-			slot = __gfn_to_memslot(slots, sp->gfn);
-			WARN_ON_ONCE(!slot);
 		}
 
-		if (slot && kvm_slot_dirty_track_enabled(slot))
-			unaccount_nx_huge_page(kvm, sp);
-		else if (mmu_type == KVM_TDP_MMU)
-			flush |= kvm_tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
-		else
-			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
 			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
 			rcu_read_unlock();
 
-			cond_resched_rwlock_write(&kvm->mmu_lock);
-			flush = false;
+			if (is_tdp_mmu)
+				cond_resched_rwlock_read(&kvm->mmu_lock);
+			else
+				cond_resched_rwlock_write(&kvm->mmu_lock);
 
+			flush = false;
 			rcu_read_lock();
 		}
 	}
@@ -7626,7 +7656,10 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 
 	rcu_read_unlock();
 
-	write_unlock(&kvm->mmu_lock);
+	if (is_tdp_mmu)
+		read_unlock(&kvm->mmu_lock);
+	else
+		write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 19907eb04a9c4..31d921705dee7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -928,21 +928,49 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
 					   struct kvm_mmu_page *sp)
 {
-	u64 old_spte;
+	struct tdp_iter iter = {
+		.old_spte = sp->ptep ? kvm_tdp_mmu_read_spte(sp->ptep) : 0,
+		.sptep = sp->ptep,
+		.level = sp->role.level + 1,
+		.gfn = sp->gfn,
+		.as_id = kvm_mmu_page_as_id(sp),
+	};
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
+
+	if (WARN_ON_ONCE(!is_tdp_mmu_page(sp)))
+		return false;
 
 	/*
-	 * This helper intentionally doesn't allow zapping a root shadow page,
-	 * which doesn't have a parent page table and thus no associated entry.
+	 * Root shadow pages don't have a parent page table and thus no
+	 * associated entry, but they can never be possible NX huge pages.
 	 */
 	if (WARN_ON_ONCE(!sp->ptep))
 		return false;
 
-	old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
-	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
+	/*
+	 * Since mmu_lock is held in read mode, it's possible another task has
+	 * already modified the SPTE. Zap the SPTE if and only if the SPTE
+	 * points at the SP's page table, as checking shadow-present isn't
+	 * sufficient, e.g. the SPTE could be replaced by a leaf SPTE, or even
+	 * another SP. Note, spte_to_child_pt() also checks that the SPTE is
+	 * shadow-present, i.e. guards against zapping a frozen SPTE.
+	 */
+	if ((tdp_ptep_t)sp->spt != spte_to_child_pt(iter.old_spte, iter.level))
 		return false;
 
-	tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
-			 SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);
+	/*
+	 * If a different task modified the SPTE, then it should be impossible
+	 * for the SPTE to still be used for the to-be-zapped SP. Non-leaf
+	 * SPTEs don't have Dirty bits, KVM always sets the Accessed bit when
+	 * creating non-leaf SPTEs, and all other bits are immutable for non-
+	 * leaf SPTEs, i.e. the only legal operations for non-leaf SPTEs are
+	 * zapping and replacement.
+	 */
+	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE)) {
+		WARN_ON_ONCE((tdp_ptep_t)sp->spt == spte_to_child_pt(iter.old_spte, iter.level));
+		return false;
+	}
 
 	return true;
 }
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 4/7] KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU
  2025-07-07 22:47 [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock James Houghton
                   ` (2 preceding siblings ...)
  2025-07-07 22:47 ` [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock James Houghton
@ 2025-07-07 22:47 ` James Houghton
  2025-07-23 20:38   ` Sean Christopherson
  2025-07-07 22:47 ` [PATCH v5 5/7] KVM: selftests: Introduce a selftest to measure execution performance James Houghton
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: James Houghton @ 2025-07-07 22:47 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Vipin Sharma, David Matlack, James Houghton, kvm, linux-kernel

Now that we have separate paths for the TDP MMU, it is trivial to only
grab rcu_read_lock() for the TDP MMU case. We do not need to grab it
for the shadow MMU, as pages are not RCU-freed in that case.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7df1b4ead705b..c8f7dd747d524 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7577,17 +7577,18 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 	nx_huge_pages = &kvm->arch.possible_nx_huge_pages[mmu_type].pages;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
-	if (is_tdp_mmu)
+	if (is_tdp_mmu) {
 		read_lock(&kvm->mmu_lock);
-	else
+		/*
+		 * Zapping TDP MMU shadow pages, including the remote TLB flush,
+		 * must be done under RCU protection, because the pages are
+		 * freed via RCU callback.
+		 */
+		rcu_read_lock();
+	} else {
 		write_lock(&kvm->mmu_lock);
+	}
 
-	/*
-	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
-	 * be done under RCU protection, because the pages are freed via RCU
-	 * callback.
-	 */
-	rcu_read_lock();
 
 	for ( ; to_zap; --to_zap) {
 #ifdef CONFIG_X86_64
@@ -7641,25 +7642,27 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
 			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
-			rcu_read_unlock();
 
-			if (is_tdp_mmu)
+			if (is_tdp_mmu) {
+				rcu_read_unlock();
 				cond_resched_rwlock_read(&kvm->mmu_lock);
-			else
+				rcu_read_lock();
+			} else {
 				cond_resched_rwlock_write(&kvm->mmu_lock);
+			}
 
 			flush = false;
-			rcu_read_lock();
 		}
 	}
 	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
 
-	rcu_read_unlock();
 
-	if (is_tdp_mmu)
+	if (is_tdp_mmu) {
+		rcu_read_unlock();
 		read_unlock(&kvm->mmu_lock);
-	else
+	} else {
 		write_unlock(&kvm->mmu_lock);
+	}
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 5/7] KVM: selftests: Introduce a selftest to measure execution performance
  2025-07-07 22:47 [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock James Houghton
                   ` (3 preceding siblings ...)
  2025-07-07 22:47 ` [PATCH v5 4/7] KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU James Houghton
@ 2025-07-07 22:47 ` James Houghton
  2025-07-23 20:50   ` Sean Christopherson
  2025-07-07 22:47 ` [PATCH v5 6/7] KVM: selftests: Provide extra mmap flags in vm_mem_add() James Houghton
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: James Houghton @ 2025-07-07 22:47 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Vipin Sharma, David Matlack, James Houghton, kvm, linux-kernel

From: David Matlack <dmatlack@google.com>

Introduce a new selftest, execute_perf_test, that uses the
perf_test_util framework to measure the performance of executing code
within a VM. This test is similar to the other perf_test_util-based
tests in that it spins up a variable number of vCPUs and runs them
concurrently, accessing memory.

In order to support execution, extend perf_test_util to populate guest
memory with return instructions rather than random garbage. This way
memory can be execute simply by calling it.

Currently only x86_64 supports execution, but other architectures can be
easily added by providing their return code instruction.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../testing/selftests/kvm/execute_perf_test.c | 199 ++++++++++++++++++
 .../testing/selftests/kvm/include/memstress.h |   4 +
 tools/testing/selftests/kvm/lib/memstress.c   |  25 ++-
 4 files changed, 227 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/execute_perf_test.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 38b95998e1e6b..0dc435e944632 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -137,6 +137,7 @@ TEST_GEN_PROGS_x86 += x86/recalc_apic_map_test
 TEST_GEN_PROGS_x86 += access_tracking_perf_test
 TEST_GEN_PROGS_x86 += coalesced_io_test
 TEST_GEN_PROGS_x86 += dirty_log_perf_test
+TEST_GEN_PROGS_x86 += execute_perf_test
 TEST_GEN_PROGS_x86 += guest_memfd_test
 TEST_GEN_PROGS_x86 += hardware_disable_test
 TEST_GEN_PROGS_x86 += memslot_modification_stress_test
diff --git a/tools/testing/selftests/kvm/execute_perf_test.c b/tools/testing/selftests/kvm/execute_perf_test.c
new file mode 100644
index 0000000000000..f7cbfd8184497
--- /dev/null
+++ b/tools/testing/selftests/kvm/execute_perf_test.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <inttypes.h>
+#include <limits.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+#include "memstress.h"
+#include "guest_modes.h"
+#include "ucall_common.h"
+
+/* Global variable used to synchronize all of the vCPU threads. */
+static int iteration;
+
+/* Set to true when vCPU threads should exit. */
+static bool done;
+
+/* The iteration that was last completed by each vCPU. */
+static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
+
+/* Whether to overlap the regions of memory vCPUs access. */
+static bool overlap_memory_access;
+
+struct test_params {
+	/* The backing source for the region of memory. */
+	enum vm_mem_backing_src_type backing_src;
+
+	/* The amount of memory to allocate for each vCPU. */
+	uint64_t vcpu_memory_bytes;
+
+	/* The number of vCPUs to create in the VM. */
+	int nr_vcpus;
+
+	/* The number of execute iterations the test will run. */
+	int iterations;
+};
+
+static void assert_ucall(struct kvm_vcpu *vcpu, uint64_t expected_ucall)
+{
+	struct ucall uc = {};
+
+	TEST_ASSERT(expected_ucall == get_ucall(vcpu, &uc),
+		    "Guest exited unexpectedly (expected ucall %" PRIu64
+		    ", got %" PRIu64 ")",
+		    expected_ucall, uc.cmd);
+}
+
+static bool spin_wait_for_next_iteration(int *current_iteration)
+{
+	int last_iteration = *current_iteration;
+
+	do {
+		if (READ_ONCE(done))
+			return false;
+
+		*current_iteration = READ_ONCE(iteration);
+	} while (last_iteration == *current_iteration);
+
+	return true;
+}
+
+static void vcpu_thread_main(struct memstress_vcpu_args *vcpu_args)
+{
+	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
+	int current_iteration = 0;
+
+	while (spin_wait_for_next_iteration(&current_iteration)) {
+		vcpu_run(vcpu);
+		assert_ucall(vcpu, UCALL_SYNC);
+		vcpu_last_completed_iteration[vcpu->id] = current_iteration;
+	}
+}
+
+static void spin_wait_for_vcpu(struct kvm_vcpu *vcpu, int target_iteration)
+{
+	while (READ_ONCE(vcpu_last_completed_iteration[vcpu->id]) !=
+	       target_iteration) {
+		continue;
+	}
+}
+
+static void run_iteration(struct kvm_vm *vm, const char *description)
+{
+	struct timespec ts_elapsed;
+	struct timespec ts_start;
+	struct kvm_vcpu *vcpu;
+	int next_iteration;
+
+	/* Kick off the vCPUs by incrementing iteration. */
+	next_iteration = ++iteration;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+	/* Wait for all vCPUs to finish the iteration. */
+	list_for_each_entry(vcpu, &vm->vcpus, list)
+		spin_wait_for_vcpu(vcpu, next_iteration);
+
+	ts_elapsed = timespec_elapsed(ts_start);
+	pr_info("%-30s: %ld.%09lds\n",
+		description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+}
+
+static void run_test(enum vm_guest_mode mode, void *arg)
+{
+	struct test_params *params = arg;
+	struct kvm_vm *vm;
+	int i;
+
+	vm = memstress_create_vm(mode, params->nr_vcpus,
+				 params->vcpu_memory_bytes, 1,
+				 params->backing_src, !overlap_memory_access);
+
+	memstress_start_vcpu_threads(params->nr_vcpus, vcpu_thread_main);
+
+	pr_info("\n");
+
+	memstress_set_write_percent(vm, 100);
+	run_iteration(vm, "Populating memory");
+
+	run_iteration(vm, "Writing to memory");
+
+	memstress_set_execute(vm, true);
+	for (i = 0; i < params->iterations; ++i)
+		run_iteration(vm, "Executing from memory");
+
+	/* Set done to signal the vCPU threads to exit */
+	done = true;
+
+	memstress_join_vcpu_threads(params->nr_vcpus);
+	memstress_destroy_vm(vm);
+}
+
+static void help(char *name)
+{
+	puts("");
+	printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v nr_vcpus] [-o] "
+	       "[-s mem_type] [-i iterations]\n",
+	       name);
+	puts("");
+	printf(" -h: Display this help message.");
+	guest_modes_help();
+	printf(" -b: specify the size of the memory region which should be\n"
+	       "     dirtied by each vCPU. e.g. 10M or 3G.\n"
+	       "     (default: 1G)\n");
+	printf(" -i: specify the number iterations to execute from memory.\n");
+	printf(" -v: specify the number of vCPUs to run.\n");
+	printf(" -o: Overlap guest memory accesses instead of partitioning\n"
+	       "     them into a separate region of memory for each vCPU.\n");
+	backing_src_help("-s");
+	puts("");
+	exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+	struct test_params params = {
+		.backing_src = DEFAULT_VM_MEM_SRC,
+		.vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
+		.nr_vcpus = 1,
+		.iterations = 1,
+	};
+	int opt;
+
+	guest_modes_append_default();
+
+	while ((opt = getopt(argc, argv, "hm:b:i:v:os:")) != -1) {
+		switch (opt) {
+		case 'm':
+			guest_modes_cmdline(optarg);
+			break;
+		case 'b':
+			params.vcpu_memory_bytes = parse_size(optarg);
+			break;
+		case 'i':
+			params.iterations = atoi(optarg);
+			break;
+		case 'v':
+			params.nr_vcpus = atoi(optarg);
+			break;
+		case 'o':
+			overlap_memory_access = true;
+			break;
+		case 's':
+			params.backing_src = parse_backing_src_type(optarg);
+			break;
+		case 'h':
+		default:
+			help(argv[0]);
+			break;
+		}
+	}
+
+	for_each_guest_mode(run_test, &params);
+
+	return 0;
+}
diff --git a/tools/testing/selftests/kvm/include/memstress.h b/tools/testing/selftests/kvm/include/memstress.h
index 9071eb6dea60a..ab2a0c05e3fd2 100644
--- a/tools/testing/selftests/kvm/include/memstress.h
+++ b/tools/testing/selftests/kvm/include/memstress.h
@@ -50,6 +50,9 @@ struct memstress_args {
  	/* Test is done, stop running vCPUs. */
  	bool stop_vcpus;
 
+	/* If vCPUs should execute from memory. */
+	bool execute;
+
 	struct memstress_vcpu_args vcpu_args[KVM_MAX_VCPUS];
 };
 
@@ -63,6 +66,7 @@ void memstress_destroy_vm(struct kvm_vm *vm);
 
 void memstress_set_write_percent(struct kvm_vm *vm, uint32_t write_percent);
 void memstress_set_random_access(struct kvm_vm *vm, bool random_access);
+void memstress_set_execute(struct kvm_vm *vm, bool execute);
 
 void memstress_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct memstress_vcpu_args *));
 void memstress_join_vcpu_threads(int vcpus);
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index 313277486a1de..49677742ec92d 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -40,6 +40,16 @@ static bool all_vcpu_threads_running;
 
 static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 
+/*
+ * When writing to guest memory, write the opcode for the `ret` instruction so
+ * that subsequent iteractions can exercise instruction fetch by calling the
+ * memory.
+ *
+ * NOTE: Non-x86 architectures would to use different values here to support
+ * execute.
+ */
+#define RETURN_OPCODE 0xC3
+
 /*
  * Continuously write to the first 8 bytes of each page in the
  * specified region.
@@ -75,8 +85,10 @@ void memstress_guest_code(uint32_t vcpu_idx)
 
 			addr = gva + (page * args->guest_page_size);
 
-			if (__guest_random_bool(&rand_state, args->write_percent))
-				*(uint64_t *)addr = 0x0123456789ABCDEF;
+			if (args->execute)
+				((void (*)(void)) addr)();
+			else if (__guest_random_bool(&rand_state, args->write_percent))
+				*(uint64_t *)addr = RETURN_OPCODE;
 			else
 				READ_ONCE(*(uint64_t *)addr);
 		}
@@ -259,6 +271,15 @@ void __weak memstress_setup_nested(struct kvm_vm *vm, int nr_vcpus, struct kvm_v
 	exit(KSFT_SKIP);
 }
 
+void memstress_set_execute(struct kvm_vm *vm, bool execute)
+{
+#ifndef __x86_64__
+	TEST_FAIL("Execute not supported on thhis architecture; see RETURN_OPCODE.");
+#endif
+	memstress_args.execute = execute;
+	sync_global_to_guest(vm, memstress_args);
+}
+
 static void *vcpu_thread_main(void *data)
 {
 	struct vcpu_thread *vcpu = data;
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 6/7] KVM: selftests: Provide extra mmap flags in vm_mem_add()
  2025-07-07 22:47 [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock James Houghton
                   ` (4 preceding siblings ...)
  2025-07-07 22:47 ` [PATCH v5 5/7] KVM: selftests: Introduce a selftest to measure execution performance James Houghton
@ 2025-07-07 22:47 ` James Houghton
  2025-07-07 22:47 ` [PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test James Houghton
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: James Houghton @ 2025-07-07 22:47 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Vipin Sharma, David Matlack, James Houghton, kvm, linux-kernel

The immediate application here is to allow selftests to pass
MAP_POPULATE (to time fault time without having to allocate guest
memory).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/include/kvm_util.h    |  3 ++-
 tools/testing/selftests/kvm/lib/kvm_util.c        | 15 +++++++++------
 .../kvm/x86/private_mem_conversions_test.c        |  2 +-
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index bee65ca087217..4aafd5bf786e2 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -629,7 +629,8 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	uint32_t flags);
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
-		uint32_t flags, int guest_memfd_fd, uint64_t guest_memfd_offset);
+		uint32_t flags, int guest_memfd_fd, uint64_t guest_memfd_offset,
+		int extra_mmap_flags);
 
 #ifndef vm_arch_has_protected_memory
 static inline bool vm_arch_has_protected_memory(struct kvm_vm *vm)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index a055343a7bf75..8157a0fd7f8b3 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -977,13 +977,15 @@ void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags
 /* FIXME: This thing needs to be ripped apart and rewritten. */
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
-		uint32_t flags, int guest_memfd, uint64_t guest_memfd_offset)
+		uint32_t flags, int guest_memfd, uint64_t guest_memfd_offset,
+		int extra_mmap_flags)
 {
 	int ret;
 	struct userspace_mem_region *region;
 	size_t backing_src_pagesz = get_backing_src_pagesz(src_type);
 	size_t mem_size = npages * vm->page_size;
 	size_t alignment;
+	int mmap_flags;
 
 	TEST_REQUIRE_SET_USER_MEMORY_REGION2();
 
@@ -1066,9 +1068,11 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 		region->fd = kvm_memfd_alloc(region->mmap_size,
 					     src_type == VM_MEM_SRC_SHARED_HUGETLB);
 
+	mmap_flags = vm_mem_backing_src_alias(src_type)->flag |
+		     extra_mmap_flags;
+
 	region->mmap_start = mmap(NULL, region->mmap_size,
-				  PROT_READ | PROT_WRITE,
-				  vm_mem_backing_src_alias(src_type)->flag,
+				  PROT_READ | PROT_WRITE, mmap_flags,
 				  region->fd, 0);
 	TEST_ASSERT(region->mmap_start != MAP_FAILED,
 		    __KVM_SYSCALL_ERROR("mmap()", (int)(unsigned long)MAP_FAILED));
@@ -1143,8 +1147,7 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 	/* If shared memory, create an alias. */
 	if (region->fd >= 0) {
 		region->mmap_alias = mmap(NULL, region->mmap_size,
-					  PROT_READ | PROT_WRITE,
-					  vm_mem_backing_src_alias(src_type)->flag,
+					  PROT_READ | PROT_WRITE, mmap_flags,
 					  region->fd, 0);
 		TEST_ASSERT(region->mmap_alias != MAP_FAILED,
 			    __KVM_SYSCALL_ERROR("mmap()",  (int)(unsigned long)MAP_FAILED));
@@ -1159,7 +1162,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 				 uint64_t guest_paddr, uint32_t slot,
 				 uint64_t npages, uint32_t flags)
 {
-	vm_mem_add(vm, src_type, guest_paddr, slot, npages, flags, -1, 0);
+	vm_mem_add(vm, src_type, guest_paddr, slot, npages, flags, -1, 0, 0);
 }
 
 /*
diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
index 82a8d88b5338e..637e9e57fce46 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
@@ -399,7 +399,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 	for (i = 0; i < nr_memslots; i++)
 		vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
 			   BASE_DATA_SLOT + i, slot_size / vm->page_size,
-			   KVM_MEM_GUEST_MEMFD, memfd, slot_size * i);
+			   KVM_MEM_GUEST_MEMFD, memfd, slot_size * i, 0);
 
 	for (i = 0; i < nr_vcpus; i++) {
 		uint64_t gpa =  BASE_DATA_GPA + i * per_cpu_size;
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test
  2025-07-07 22:47 [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock James Houghton
                   ` (5 preceding siblings ...)
  2025-07-07 22:47 ` [PATCH v5 6/7] KVM: selftests: Provide extra mmap flags in vm_mem_add() James Houghton
@ 2025-07-07 22:47 ` James Houghton
  2025-07-23 21:04   ` Sean Christopherson
  2025-07-23 20:44 ` [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock Sean Christopherson
  2025-08-19 23:12 ` Sean Christopherson
  8 siblings, 1 reply; 29+ messages in thread
From: James Houghton @ 2025-07-07 22:47 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Vipin Sharma, David Matlack, James Houghton, kvm, linux-kernel

Add a test that checks how much NX huge page recovery affects vCPUs that
are faulting on pages not undergoing NX huge page recovery. To do this,
this test uses a single vCPU to touch all of guest memory. After every
1G of guest memory, it will switch from writing to executing. Only the
writes are timed.

With this setup, while the guest is in the middle of reading a 1G
region, NX huge page recovery (provided it is set aggressive enough)
will start to recover huge pages in the previous 1G region.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../kvm/x86/nx_huge_pages_perf_test.c         | 223 ++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86/nx_huge_pages_perf_test.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 0dc435e944632..4b5be9f0bac5b 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -88,6 +88,7 @@ TEST_GEN_PROGS_x86 += x86/kvm_buslock_test
 TEST_GEN_PROGS_x86 += x86/monitor_mwait_test
 TEST_GEN_PROGS_x86 += x86/nested_emulation_test
 TEST_GEN_PROGS_x86 += x86/nested_exceptions_test
+TEST_GEN_PROGS_x86 += x86/nx_huge_pages_perf_test
 TEST_GEN_PROGS_x86 += x86/platform_info_test
 TEST_GEN_PROGS_x86 += x86/pmu_counters_test
 TEST_GEN_PROGS_x86 += x86/pmu_event_filter_test
diff --git a/tools/testing/selftests/kvm/x86/nx_huge_pages_perf_test.c b/tools/testing/selftests/kvm/x86/nx_huge_pages_perf_test.c
new file mode 100644
index 0000000000000..e33e913ec7dfa
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/nx_huge_pages_perf_test.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * nx_huge_pages_perf_test
+ *
+ * Copyright (C) 2025, Google LLC.
+ *
+ * Performance test for NX hugepage recovery.
+ *
+ * This test checks for long faults on allocated pages when NX huge page
+ * recovery is taking place on pages mapped by the VM.
+ */
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+
+#include "test_util.h"
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "ucall_common.h"
+
+/* Default guest test virtual memory offset */
+#define DEFAULT_GUEST_TEST_MEM		0xc0000000
+
+/* Default size (2GB) of the memory for testing */
+#define DEFAULT_TEST_MEM_SIZE		(2 << 30)
+
+/*
+ * Guest virtual memory offset of the testing memory slot.
+ * Must not conflict with identity mapped test code.
+ */
+static uint64_t guest_test_virt_mem = DEFAULT_GUEST_TEST_MEM;
+
+static struct kvm_vcpu *vcpu;
+
+struct test_params {
+	enum vm_mem_backing_src_type backing_src;
+	uint64_t memory_bytes;
+};
+
+struct guest_args {
+	uint64_t guest_page_size;
+	uint64_t pages;
+};
+
+static struct guest_args guest_args;
+
+#define RETURN_OPCODE 0xC3
+
+static void guest_code(int vcpu_idx)
+{
+	struct guest_args *args = &guest_args;
+	uint64_t page_size = args->guest_page_size;
+	uint64_t max_cycles = 0UL;
+	volatile char *gva;
+	uint64_t page;
+
+
+	for (page = 0; page < args->pages; ++page) {
+		gva = (volatile char *)guest_test_virt_mem + page * page_size;
+
+		/*
+		 * To time the jitter on all faults on pages that are not
+		 * undergoing nx huge page recovery, only execute on every
+		 * other 1G region, and only time the non-executing pass.
+		 */
+		if (page & (1UL << 18)) {
+			uint64_t tsc1, tsc2;
+
+			tsc1 = rdtsc();
+			*gva = 0;
+			tsc2 = rdtsc();
+
+			if (tsc2 - tsc1 > max_cycles)
+				max_cycles = tsc2 - tsc1;
+		} else {
+			*gva = RETURN_OPCODE;
+			((void (*)(void)) gva)();
+		}
+	}
+
+	GUEST_SYNC1(max_cycles);
+}
+
+struct kvm_vm *create_vm(uint64_t memory_bytes,
+			 enum vm_mem_backing_src_type backing_src)
+{
+	uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src);
+	struct guest_args *args = &guest_args;
+	uint64_t guest_num_pages;
+	uint64_t region_end_gfn;
+	uint64_t gpa, size;
+	struct kvm_vm *vm;
+
+	args->guest_page_size = getpagesize();
+
+	guest_num_pages = vm_adjust_num_guest_pages(VM_MODE_DEFAULT,
+				memory_bytes / args->guest_page_size);
+
+	TEST_ASSERT(memory_bytes % getpagesize() == 0,
+		    "Guest memory size is not host page size aligned.");
+
+	vm = __vm_create_with_one_vcpu(&vcpu, guest_num_pages, guest_code);
+
+	/* Put the test region at the top guest physical memory. */
+	region_end_gfn = vm->max_gfn + 1;
+
+	/*
+	 * If there should be more memory in the guest test region than there
+	 * can be pages in the guest, it will definitely cause problems.
+	 */
+	TEST_ASSERT(guest_num_pages < region_end_gfn,
+		    "Requested more guest memory than address space allows.\n"
+		    "    guest pages: %" PRIx64 " max gfn: %" PRIx64
+		    " wss: %" PRIx64 "]",
+		    guest_num_pages, region_end_gfn - 1, memory_bytes);
+
+	gpa = (region_end_gfn - guest_num_pages - 1) * args->guest_page_size;
+	gpa = align_down(gpa, backing_src_pagesz);
+
+	size = guest_num_pages * args->guest_page_size;
+	pr_info("guest physical test memory: [0x%lx, 0x%lx)\n",
+		gpa, gpa + size);
+
+	/*
+	 * Pass in MAP_POPULATE, because we are trying to test how long
+	 * we have to wait for a pending NX huge page recovery to take.
+	 * We do not want to also wait for GUP itself.
+	 */
+	vm_mem_add(vm, backing_src, gpa, 1,
+		   guest_num_pages, 0, -1, 0, MAP_POPULATE);
+
+	virt_map(vm, guest_test_virt_mem, gpa, guest_num_pages);
+
+	args->pages = guest_num_pages;
+
+	/* Export the shared variables to the guest. */
+	sync_global_to_guest(vm, guest_args);
+
+	return vm;
+}
+
+static void run_vcpu(struct kvm_vcpu *vcpu)
+{
+	struct timespec ts_elapsed;
+	struct timespec ts_start;
+	struct ucall uc = {};
+	int ret;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+	ret = _vcpu_run(vcpu);
+
+	ts_elapsed = timespec_elapsed(ts_start);
+
+	TEST_ASSERT(ret == 0, "vcpu_run failed: %d", ret);
+
+	TEST_ASSERT(get_ucall(vcpu, &uc) == UCALL_SYNC,
+		    "Invalid guest sync status: %" PRIu64, uc.cmd);
+
+	pr_info("Duration: %ld.%09lds\n",
+		ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+	pr_info("Max fault latency: %" PRIu64 " cycles\n", uc.args[0]);
+}
+
+static void run_test(struct test_params *params)
+{
+	/*
+	 * The fault + execute pattern in the guest relies on having more than
+	 * 1GiB to use.
+	 */
+	TEST_ASSERT(params->memory_bytes > PAGE_SIZE << 18,
+		    "Must use more than 1GiB of memory.");
+
+	create_vm(params->memory_bytes, params->backing_src);
+
+	pr_info("\n");
+
+	run_vcpu(vcpu);
+}
+
+static void help(char *name)
+{
+	puts("");
+	printf("usage: %s [-h] [-b bytes] [-s mem_type]\n",
+	       name);
+	puts("");
+	printf(" -h: Display this help message.");
+	printf(" -b: specify the size of the memory region which should be\n"
+	       "     dirtied by the guest. e.g. 2048M or 3G.\n"
+	       "     (default: 2G, must be greater than 1G)\n");
+	backing_src_help("-s");
+	puts("");
+	exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+	struct test_params params = {
+		.backing_src = DEFAULT_VM_MEM_SRC,
+		.memory_bytes = DEFAULT_TEST_MEM_SIZE,
+	};
+	int opt;
+
+	while ((opt = getopt(argc, argv, "hb:s:")) != -1) {
+		switch (opt) {
+		case 'b':
+			params.memory_bytes = parse_size(optarg);
+			break;
+		case 's':
+			params.backing_src = parse_backing_src_type(optarg);
+			break;
+		case 'h':
+		default:
+			help(argv[0]);
+			break;
+		}
+	}
+
+	run_test(&params);
+}
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2025-07-07 22:47 ` [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock James Houghton
@ 2025-07-23 20:34   ` Sean Christopherson
  2025-07-28 18:07     ` James Houghton
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2025-07-23 20:34 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Mon, Jul 07, 2025, James Houghton wrote:
> From: Vipin Sharma <vipinsh@google.com>
> 
> Use MMU read lock to recover TDP MMU NX huge pages. Iterate

Wrap at ~75 chars.

> over the huge pages list under tdp_mmu_pages_lock protection and
> unaccount the page before dropping the lock.
> 
> We must not zap an SPTE if:

No pronouns!

> - The SPTE is a root page.
> - The SPTE does not point at the SP's page table.
> 
> If the SPTE does not point at the SP's page table, then something else
> has change the SPTE, so we cannot safely zap it.
> 
> Warn if zapping SPTE fails and current SPTE is still pointing to same
> page table. This should never happen.
> 
> There is always a race between dirty logging, vCPU faults, and NX huge
> page recovery for backing a gfn by an NX huge page or an executable
> small page. Unaccounting sooner during the list traversal is increasing
> the window of that race. Functionally, it is okay, because accounting
> doesn't protect against iTLB multi-hit bug, it is there purely to
> prevent KVM from bouncing a gfn between two page sizes. The only
> downside is that a vCPU will end up doing more work in tearing down all
> the child SPTEs. This should be a very rare race.
> 
> Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
> read lock. This optimizaion is done to solve a guest jitter issue on
> Windows VM which was observing an increase in network latency.

With slight tweaking:

Use MMU read lock to recover TDP MMU NX huge pages.  To prevent
concurrent modification of the list of potential huge pages, iterate over
the list under tdp_mmu_pages_lock protection and unaccount the page
before dropping the lock.

Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
read lock, which solves a guest jitter issue on Windows VMs which were
observing an increase in network latency.

Do not zap an SPTE if:
- The SPTE is a root page.
- The SPTE does not point at the SP's page table.

If the SPTE does not point at the SP's page table, then something else
has change the SPTE, so KVM cannot safely zap it.

Warn if zapping SPTE fails and current SPTE is still pointing to same
page table, as it should be impossible for the CMPXCHG to fail due to all
other write scenarios being mutually exclusive.

There is always a race between dirty logging, vCPU faults, and NX huge
page recovery for backing a gfn by an NX huge page or an executable
small page.  Unaccounting sooner during the list traversal increases the
window of that race, but functionally, it is okay.  Accounting doesn't
protect against iTLB multi-hit bug, it is there purely to prevent KVM
from bouncing a gfn between two page sizes. The only  downside is that a
vCPU will end up doing more work in tearing down all  the child SPTEs.
This should be a very rare race.

> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> Co-developed-by: James Houghton <jthoughton@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 107 ++++++++++++++++++++++++-------------
>  arch/x86/kvm/mmu/tdp_mmu.c |  42 ++++++++++++---
>  2 files changed, 105 insertions(+), 44 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b074f7bb5cc58..7df1b4ead705b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7535,12 +7535,40 @@ static unsigned long nx_huge_pages_to_zap(struct kvm *kvm,
>  	return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
>  }
>  
> +static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> +					     struct kvm_mmu_page *sp)
> +{
> +	struct kvm_memory_slot *slot = NULL;
> +
> +	/*
> +	 * Since gfn_to_memslot() is relatively expensive, it helps to skip it if
> +	 * it the test cannot possibly return true.  On the other hand, if any
> +	 * memslot has logging enabled, chances are good that all of them do, in
> +	 * which case unaccount_nx_huge_page() is much cheaper than zapping the
> +	 * page.

And largely irrelevant, because KVM should unaccount the NX no matter what.  I
kinda get what you're saying, but honestly it adds a lot of confusion, especially
since unaccount_nx_huge_page() is in the caller.

> +	 *
> +	 * If a memslot update is in progress, reading an incorrect value of
> +	 * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
> +	 * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming
> +	 * nonzero, the page will be zapped unnecessarily.  Either way, this only
> +	 * affects efficiency in racy situations, and not correctness.
> +	 */
> +	if (atomic_read(&kvm->nr_memslots_dirty_logging)) {

Short-circuit the function to decrease indentation, and so that "slot" doesn't
need to be NULL-initialized.

> +		struct kvm_memslots *slots;
> +
> +		slots = kvm_memslots_for_spte_role(kvm, sp->role);
> +		slot = __gfn_to_memslot(slots, sp->gfn);

Then this can be:

	slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);

without creating a stupid-long line.

> +		WARN_ON_ONCE(!slot);

And then:

	if (WARN_ON_ONCE(!slot))
		return false;

	return kvm_slot_dirty_track_enabled(slot);

With a comment cleanup:

	struct kvm_memory_slot *slot;

	/*
	 * Skip the memslot lookup if dirty tracking can't possibly be enabled,
	 * as memslot lookups are relatively expensive.
	 *
	 * If a memslot update is in progress, reading an incorrect value of
	 * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
	 * zero, KVM will  do an unnecessary memslot lookup;  if it is becoming
	 * nonzero, the page will be zapped unnecessarily.  Either way, this
	 * only affects efficiency in racy situations, and not correctness.
	 */
	if (!atomic_read(&kvm->nr_memslots_dirty_logging))
		return false;

	slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);
	if (WARN_ON_ONCE(!slot))
		return false;

	return kvm_slot_dirty_track_enabled(slot);
> @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
>  	rcu_read_lock();
>  
>  	for ( ; to_zap; --to_zap) {
> -		if (list_empty(nx_huge_pages))
> +#ifdef CONFIG_X86_64

These #ifdefs still make me sad, but I also still think they're the least awful
solution.  And hopefully we will jettison 32-bit sooner than later :-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 4/7] KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU
  2025-07-07 22:47 ` [PATCH v5 4/7] KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU James Houghton
@ 2025-07-23 20:38   ` Sean Christopherson
  2025-07-28 17:51     ` James Houghton
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2025-07-23 20:38 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Mon, Jul 07, 2025, James Houghton wrote:
> Now that we have separate paths for the TDP MMU, it is trivial to only
> grab rcu_read_lock() for the TDP MMU case.

Yeah, but it's also a largely pointless change.  For the overwhelming majority of
deployments, rcu_read_{un}lock() does literally nothing.  And when it does do
something, the cost is a single atomic.

I'm leaning quite strongly toward skipping this patch, as I find the code to be
much more readable if KVM grabs RCU unconditionally.

> We do not need to grab it for the shadow MMU, as pages are not RCU-freed in
> that case.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock
  2025-07-07 22:47 [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock James Houghton
                   ` (6 preceding siblings ...)
  2025-07-07 22:47 ` [PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test James Houghton
@ 2025-07-23 20:44 ` Sean Christopherson
  2025-07-29  0:19   ` James Houghton
  2025-08-19 23:12 ` Sean Christopherson
  8 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2025-07-23 20:44 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Mon, Jul 07, 2025, James Houghton wrote:
> David Matlack (1):
>   KVM: selftests: Introduce a selftest to measure execution performance
> 
> James Houghton (3):
>   KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU
>   KVM: selftests: Provide extra mmap flags in vm_mem_add()
>   KVM: selftests: Add an NX huge pages jitter test
> 
> Vipin Sharma (3):
>   KVM: x86/mmu: Track TDP MMU NX huge pages separately
>   KVM: x86/mmu: Rename kvm_tdp_mmu_zap_sp() to better indicate its
>     purpose
>   KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

The KVM changes look good, no need for a v5 on that front (I'll do minor fixup
when applying, which will be a few weeks from now, after 6.17-rc1) .  Still
working through the selftests.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 5/7] KVM: selftests: Introduce a selftest to measure execution performance
  2025-07-07 22:47 ` [PATCH v5 5/7] KVM: selftests: Introduce a selftest to measure execution performance James Houghton
@ 2025-07-23 20:50   ` Sean Christopherson
  2025-07-29  0:18     ` James Houghton
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2025-07-23 20:50 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Mon, Jul 07, 2025, James Houghton wrote:
> From: David Matlack <dmatlack@google.com>
> 
> Introduce a new selftest, execute_perf_test, that uses the
> perf_test_util framework to measure the performance of executing code
> within a VM. This test is similar to the other perf_test_util-based
> tests in that it spins up a variable number of vCPUs and runs them
> concurrently, accessing memory.
> 
> In order to support execution, extend perf_test_util to populate guest
> memory with return instructions rather than random garbage. This way
> memory can be execute simply by calling it.
> 
> Currently only x86_64 supports execution, but other architectures can be
> easily added by providing their return code instruction.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  tools/testing/selftests/kvm/Makefile.kvm      |   1 +
>  .../testing/selftests/kvm/execute_perf_test.c | 199 ++++++++++++++++++

Honest question, is there really no way to dedup memstress tests?  This seems
like an insane amount of code just to call memstress_set_execute().

>  .../testing/selftests/kvm/include/memstress.h |   4 +
>  tools/testing/selftests/kvm/lib/memstress.c   |  25 ++-
>  4 files changed, 227 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/execute_perf_test.c
> 
> diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
> index 38b95998e1e6b..0dc435e944632 100644
> --- a/tools/testing/selftests/kvm/Makefile.kvm
> +++ b/tools/testing/selftests/kvm/Makefile.kvm
> @@ -137,6 +137,7 @@ TEST_GEN_PROGS_x86 += x86/recalc_apic_map_test
>  TEST_GEN_PROGS_x86 += access_tracking_perf_test
>  TEST_GEN_PROGS_x86 += coalesced_io_test
>  TEST_GEN_PROGS_x86 += dirty_log_perf_test
> +TEST_GEN_PROGS_x86 += execute_perf_test

How about call_ret_perf_test instead of execute_perf_test?  I like that "execute"
aligns with "read" and "write", but as a test name it ends up being quite ambiguous.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test
  2025-07-07 22:47 ` [PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test James Houghton
@ 2025-07-23 21:04   ` Sean Christopherson
  2025-07-28 18:40     ` James Houghton
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2025-07-23 21:04 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Mon, Jul 07, 2025, James Houghton wrote:
> +		/*
> +		 * To time the jitter on all faults on pages that are not
> +		 * undergoing nx huge page recovery, only execute on every
> +		 * other 1G region, and only time the non-executing pass.
> +		 */
> +		if (page & (1UL << 18)) {

This needs a #define or helper, I have no idea what 1 << 18 is doing.

> +			uint64_t tsc1, tsc2;
> +
> +			tsc1 = rdtsc();
> +			*gva = 0;
> +			tsc2 = rdtsc();
> +
> +			if (tsc2 - tsc1 > max_cycles)
> +				max_cycles = tsc2 - tsc1;
> +		} else {
> +			*gva = RETURN_OPCODE;
> +			((void (*)(void)) gva)();
> +		}
> +	}
> +
> +	GUEST_SYNC1(max_cycles);
> +}
> +
> +struct kvm_vm *create_vm(uint64_t memory_bytes,
> +			 enum vm_mem_backing_src_type backing_src)
> +{
> +	uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src);
> +	struct guest_args *args = &guest_args;
> +	uint64_t guest_num_pages;
> +	uint64_t region_end_gfn;
> +	uint64_t gpa, size;
> +	struct kvm_vm *vm;
> +
> +	args->guest_page_size = getpagesize();
> +
> +	guest_num_pages = vm_adjust_num_guest_pages(VM_MODE_DEFAULT,
> +				memory_bytes / args->guest_page_size);
> +
> +	TEST_ASSERT(memory_bytes % getpagesize() == 0,
> +		    "Guest memory size is not host page size aligned.");
> +
> +	vm = __vm_create_with_one_vcpu(&vcpu, guest_num_pages, guest_code);
> +
> +	/* Put the test region at the top guest physical memory. */
> +	region_end_gfn = vm->max_gfn + 1;
> +
> +	/*
> +	 * If there should be more memory in the guest test region than there
> +	 * can be pages in the guest, it will definitely cause problems.
> +	 */
> +	TEST_ASSERT(guest_num_pages < region_end_gfn,
> +		    "Requested more guest memory than address space allows.\n"
> +		    "    guest pages: %" PRIx64 " max gfn: %" PRIx64
> +		    " wss: %" PRIx64 "]",

Don't wrap this last one.

> +		    guest_num_pages, region_end_gfn - 1, memory_bytes);
> +
> +	gpa = (region_end_gfn - guest_num_pages - 1) * args->guest_page_size;
> +	gpa = align_down(gpa, backing_src_pagesz);
> +
> +	size = guest_num_pages * args->guest_page_size;
> +	pr_info("guest physical test memory: [0x%lx, 0x%lx)\n",
> +		gpa, gpa + size);

And don't wrap here either (82 chars is totally fine).

> +
> +	/*
> +	 * Pass in MAP_POPULATE, because we are trying to test how long
> +	 * we have to wait for a pending NX huge page recovery to take.
> +	 * We do not want to also wait for GUP itself.
> +	 */

Right, but we also don't want to wait for the initial fault-in either, no?  I.e.
plumbing in MAP_POPULATE only fixes the worst of the delay, and maybe only with
the TDP MMU enabled.

In other words, it seems like we need a helper (option?) to excplitly "prefault",
all memory from within the guest, not the ability to specify MAP_POPULATE.

> +	vm_mem_add(vm, backing_src, gpa, 1,
> +		   guest_num_pages, 0, -1, 0, MAP_POPULATE);
> +
> +	virt_map(vm, guest_test_virt_mem, gpa, guest_num_pages);
> +
> +	args->pages = guest_num_pages;
> +
> +	/* Export the shared variables to the guest. */
> +	sync_global_to_guest(vm, guest_args);
> +
> +	return vm;
> +}
> +
> +static void run_vcpu(struct kvm_vcpu *vcpu)
> +{
> +	struct timespec ts_elapsed;
> +	struct timespec ts_start;
> +	struct ucall uc = {};
> +	int ret;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &ts_start);
> +
> +	ret = _vcpu_run(vcpu);
> +
> +	ts_elapsed = timespec_elapsed(ts_start);
> +
> +	TEST_ASSERT(ret == 0, "vcpu_run failed: %d", ret);
> +
> +	TEST_ASSERT(get_ucall(vcpu, &uc) == UCALL_SYNC,
> +		    "Invalid guest sync status: %" PRIu64, uc.cmd);
> +
> +	pr_info("Duration: %ld.%09lds\n",
> +		ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> +	pr_info("Max fault latency: %" PRIu64 " cycles\n", uc.args[0]);
> +}
> +
> +static void run_test(struct test_params *params)
> +{
> +	/*
> +	 * The fault + execute pattern in the guest relies on having more than
> +	 * 1GiB to use.
> +	 */
> +	TEST_ASSERT(params->memory_bytes > PAGE_SIZE << 18,

Oooh, the 1 << 18 is 1GiB on PFNs.  Ugh.  Just use SZ_1G here.  And assert immediate
after setting params.memory_bytes, don't wait until the test runs.

> +		    "Must use more than 1GiB of memory.");
> +
> +	create_vm(params->memory_bytes, params->backing_src);
> +
> +	pr_info("\n");
> +
> +	run_vcpu(vcpu);
> +}
> +
> +static void help(char *name)
> +{
> +	puts("");
> +	printf("usage: %s [-h] [-b bytes] [-s mem_type]\n",
> +	       name);
> +	puts("");
> +	printf(" -h: Display this help message.");
> +	printf(" -b: specify the size of the memory region which should be\n"
> +	       "     dirtied by the guest. e.g. 2048M or 3G.\n"
> +	       "     (default: 2G, must be greater than 1G)\n");
> +	backing_src_help("-s");
> +	puts("");
> +	exit(0);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	struct test_params params = {
> +		.backing_src = DEFAULT_VM_MEM_SRC,
> +		.memory_bytes = DEFAULT_TEST_MEM_SIZE,
> +	};
> +	int opt;
> +
> +	while ((opt = getopt(argc, argv, "hb:s:")) != -1) {
> +		switch (opt) {
> +		case 'b':
> +			params.memory_bytes = parse_size(optarg);
> +			break;
> +		case 's':
> +			params.backing_src = parse_backing_src_type(optarg);
> +			break;
> +		case 'h':
> +		default:
> +			help(argv[0]);
> +			break;
> +		}
> +	}
> +
> +	run_test(&params);
> +}
> -- 
> 2.50.0.727.gbf7dc18ff4-goog
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 4/7] KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU
  2025-07-23 20:38   ` Sean Christopherson
@ 2025-07-28 17:51     ` James Houghton
  0 siblings, 0 replies; 29+ messages in thread
From: James Houghton @ 2025-07-28 17:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Wed, Jul 23, 2025 at 1:38 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jul 07, 2025, James Houghton wrote:
> > Now that we have separate paths for the TDP MMU, it is trivial to only
> > grab rcu_read_lock() for the TDP MMU case.
>
> Yeah, but it's also a largely pointless change.  For the overwhelming majority of
> deployments, rcu_read_{un}lock() does literally nothing.  And when it does do
> something, the cost is a single atomic.
>
> I'm leaning quite strongly toward skipping this patch, as I find the code to be
> much more readable if KVM grabs RCU unconditionally.

That's fine with me, thanks Sean.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2025-07-23 20:34   ` Sean Christopherson
@ 2025-07-28 18:07     ` James Houghton
  2025-07-28 18:17       ` David Matlack
  0 siblings, 1 reply; 29+ messages in thread
From: James Houghton @ 2025-07-28 18:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jul 07, 2025, James Houghton wrote:
> > From: Vipin Sharma <vipinsh@google.com>
> >
> > Use MMU read lock to recover TDP MMU NX huge pages. Iterate
>
> Wrap at ~75 chars.

Oh, I did indeed wrap the text pretty aggressively for this patch. Not
sure why that happened.

>
> > over the huge pages list under tdp_mmu_pages_lock protection and
> > unaccount the page before dropping the lock.
> >
> > We must not zap an SPTE if:
>
> No pronouns!

Right.

>
> > - The SPTE is a root page.
> > - The SPTE does not point at the SP's page table.
> >
> > If the SPTE does not point at the SP's page table, then something else
> > has change the SPTE, so we cannot safely zap it.
> >
> > Warn if zapping SPTE fails and current SPTE is still pointing to same
> > page table. This should never happen.
> >
> > There is always a race between dirty logging, vCPU faults, and NX huge
> > page recovery for backing a gfn by an NX huge page or an executable
> > small page. Unaccounting sooner during the list traversal is increasing
> > the window of that race. Functionally, it is okay, because accounting
> > doesn't protect against iTLB multi-hit bug, it is there purely to
> > prevent KVM from bouncing a gfn between two page sizes. The only
> > downside is that a vCPU will end up doing more work in tearing down all
> > the child SPTEs. This should be a very rare race.
> >
> > Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
> > read lock. This optimizaion is done to solve a guest jitter issue on
> > Windows VM which was observing an increase in network latency.
>
> With slight tweaking:
>
> Use MMU read lock to recover TDP MMU NX huge pages.  To prevent
> concurrent modification of the list of potential huge pages, iterate over
> the list under tdp_mmu_pages_lock protection and unaccount the page
> before dropping the lock.
>
> Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
> read lock, which solves a guest jitter issue on Windows VMs which were
> observing an increase in network latency.
>
> Do not zap an SPTE if:
> - The SPTE is a root page.
> - The SPTE does not point at the SP's page table.
>
> If the SPTE does not point at the SP's page table, then something else
> has change the SPTE, so KVM cannot safely zap it.

"has changed" (my mistake)

>
> Warn if zapping SPTE fails and current SPTE is still pointing to same
> page table, as it should be impossible for the CMPXCHG to fail due to all
> other write scenarios being mutually exclusive.
>
> There is always a race between dirty logging, vCPU faults, and NX huge
> page recovery for backing a gfn by an NX huge page or an executable
> small page.  Unaccounting sooner during the list traversal increases the
> window of that race, but functionally, it is okay.  Accounting doesn't
> protect against iTLB multi-hit bug, it is there purely to prevent KVM
> from bouncing a gfn between two page sizes. The only  downside is that a
> vCPU will end up doing more work in tearing down all  the child SPTEs.
> This should be a very rare race.

Thanks, this is much better.

>
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > Co-developed-by: James Houghton <jthoughton@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c     | 107 ++++++++++++++++++++++++-------------
> >  arch/x86/kvm/mmu/tdp_mmu.c |  42 ++++++++++++---
> >  2 files changed, 105 insertions(+), 44 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index b074f7bb5cc58..7df1b4ead705b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -7535,12 +7535,40 @@ static unsigned long nx_huge_pages_to_zap(struct kvm *kvm,
> >       return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
> >  }
> >
> > +static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> > +                                          struct kvm_mmu_page *sp)
> > +{
> > +     struct kvm_memory_slot *slot = NULL;
> > +
> > +     /*
> > +      * Since gfn_to_memslot() is relatively expensive, it helps to skip it if
> > +      * it the test cannot possibly return true.  On the other hand, if any
> > +      * memslot has logging enabled, chances are good that all of them do, in
> > +      * which case unaccount_nx_huge_page() is much cheaper than zapping the
> > +      * page.
>
> And largely irrelevant, because KVM should unaccount the NX no matter what.  I
> kinda get what you're saying, but honestly it adds a lot of confusion, especially
> since unaccount_nx_huge_page() is in the caller.
>
> > +      *
> > +      * If a memslot update is in progress, reading an incorrect value of
> > +      * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
> > +      * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming
> > +      * nonzero, the page will be zapped unnecessarily.  Either way, this only
> > +      * affects efficiency in racy situations, and not correctness.
> > +      */
> > +     if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
>
> Short-circuit the function to decrease indentation, and so that "slot" doesn't
> need to be NULL-initialized.
>
> > +             struct kvm_memslots *slots;
> > +
> > +             slots = kvm_memslots_for_spte_role(kvm, sp->role);
> > +             slot = __gfn_to_memslot(slots, sp->gfn);
>
> Then this can be:
>
>         slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);
>
> without creating a stupid-long line.
>
> > +             WARN_ON_ONCE(!slot);
>
> And then:
>
>         if (WARN_ON_ONCE(!slot))
>                 return false;
>
>         return kvm_slot_dirty_track_enabled(slot);
>
> With a comment cleanup:
>
>         struct kvm_memory_slot *slot;
>
>         /*
>          * Skip the memslot lookup if dirty tracking can't possibly be enabled,
>          * as memslot lookups are relatively expensive.
>          *
>          * If a memslot update is in progress, reading an incorrect value of
>          * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
>          * zero, KVM will  do an unnecessary memslot lookup;  if it is becoming
>          * nonzero, the page will be zapped unnecessarily.  Either way, this
>          * only affects efficiency in racy situations, and not correctness.
>          */
>         if (!atomic_read(&kvm->nr_memslots_dirty_logging))
>                 return false;
>
>         slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);
>         if (WARN_ON_ONCE(!slot))
>                 return false;
>
>         return kvm_slot_dirty_track_enabled(slot);

LGTM, thanks!

> > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> >       rcu_read_lock();
> >
> >       for ( ; to_zap; --to_zap) {
> > -             if (list_empty(nx_huge_pages))
> > +#ifdef CONFIG_X86_64
>
> These #ifdefs still make me sad, but I also still think they're the least awful
> solution.  And hopefully we will jettison 32-bit sooner than later :-)

Yeah I couldn't come up with anything better. :(

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2025-07-28 18:07     ` James Houghton
@ 2025-07-28 18:17       ` David Matlack
  2025-07-28 21:38         ` Sean Christopherson
  0 siblings, 1 reply; 29+ messages in thread
From: David Matlack @ 2025-07-28 18:17 UTC (permalink / raw)
  To: James Houghton
  Cc: Sean Christopherson, Paolo Bonzini, Vipin Sharma, kvm,
	linux-kernel

On Mon, Jul 28, 2025 at 11:08 AM James Houghton <jthoughton@google.com> wrote:
> On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > >       rcu_read_lock();
> > >
> > >       for ( ; to_zap; --to_zap) {
> > > -             if (list_empty(nx_huge_pages))
> > > +#ifdef CONFIG_X86_64
> >
> > These #ifdefs still make me sad, but I also still think they're the least awful
> > solution.  And hopefully we will jettison 32-bit sooner than later :-)
>
> Yeah I couldn't come up with anything better. :(

Could we just move the definition of tdp_mmu_pages_lock outside of
CONFIG_X86_64? The only downside I can think of is slightly larger kvm
structs for 32-bit builds.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test
  2025-07-23 21:04   ` Sean Christopherson
@ 2025-07-28 18:40     ` James Houghton
  2025-08-01 14:11       ` Sean Christopherson
  0 siblings, 1 reply; 29+ messages in thread
From: James Houghton @ 2025-07-28 18:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Wed, Jul 23, 2025 at 2:04 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jul 07, 2025, James Houghton wrote:
> > +             /*
> > +              * To time the jitter on all faults on pages that are not
> > +              * undergoing nx huge page recovery, only execute on every
> > +              * other 1G region, and only time the non-executing pass.
> > +              */
> > +             if (page & (1UL << 18)) {
>
> This needs a #define or helper, I have no idea what 1 << 18 is doing.

I'll use (SZ_1G >> PAGE_SHIFT), thanks. It's just checking if `page`
lies within an odd-numbered 1G region.

>
> > +                     uint64_t tsc1, tsc2;
> > +
> > +                     tsc1 = rdtsc();
> > +                     *gva = 0;
> > +                     tsc2 = rdtsc();
> > +
> > +                     if (tsc2 - tsc1 > max_cycles)
> > +                             max_cycles = tsc2 - tsc1;
> > +             } else {
> > +                     *gva = RETURN_OPCODE;
> > +                     ((void (*)(void)) gva)();
> > +             }
> > +     }
> > +
> > +     GUEST_SYNC1(max_cycles);
> > +}
> > +
> > +struct kvm_vm *create_vm(uint64_t memory_bytes,
> > +                      enum vm_mem_backing_src_type backing_src)
> > +{
> > +     uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src);
> > +     struct guest_args *args = &guest_args;
> > +     uint64_t guest_num_pages;
> > +     uint64_t region_end_gfn;
> > +     uint64_t gpa, size;
> > +     struct kvm_vm *vm;
> > +
> > +     args->guest_page_size = getpagesize();
> > +
> > +     guest_num_pages = vm_adjust_num_guest_pages(VM_MODE_DEFAULT,
> > +                             memory_bytes / args->guest_page_size);
> > +
> > +     TEST_ASSERT(memory_bytes % getpagesize() == 0,
> > +                 "Guest memory size is not host page size aligned.");
> > +
> > +     vm = __vm_create_with_one_vcpu(&vcpu, guest_num_pages, guest_code);
> > +
> > +     /* Put the test region at the top guest physical memory. */
> > +     region_end_gfn = vm->max_gfn + 1;
> > +
> > +     /*
> > +      * If there should be more memory in the guest test region than there
> > +      * can be pages in the guest, it will definitely cause problems.
> > +      */
> > +     TEST_ASSERT(guest_num_pages < region_end_gfn,
> > +                 "Requested more guest memory than address space allows.\n"
> > +                 "    guest pages: %" PRIx64 " max gfn: %" PRIx64
> > +                 " wss: %" PRIx64 "]",
>
> Don't wrap this last one.
>
> > +                 guest_num_pages, region_end_gfn - 1, memory_bytes);
> > +
> > +     gpa = (region_end_gfn - guest_num_pages - 1) * args->guest_page_size;
> > +     gpa = align_down(gpa, backing_src_pagesz);
> > +
> > +     size = guest_num_pages * args->guest_page_size;
> > +     pr_info("guest physical test memory: [0x%lx, 0x%lx)\n",
> > +             gpa, gpa + size);
>
> And don't wrap here either (82 chars is totally fine).

Right.

>
> > +
> > +     /*
> > +      * Pass in MAP_POPULATE, because we are trying to test how long
> > +      * we have to wait for a pending NX huge page recovery to take.
> > +      * We do not want to also wait for GUP itself.
> > +      */
>
> Right, but we also don't want to wait for the initial fault-in either, no?  I.e.
> plumbing in MAP_POPULATE only fixes the worst of the delay, and maybe only with
> the TDP MMU enabled.
>
> In other words, it seems like we need a helper (option?) to excplitly "prefault",
> all memory from within the guest, not the ability to specify MAP_POPULATE.

I don't want the EPT to be populated.

In the event of a hugepage being executed, consider another memory
access. The access can either (1) be in the executed-from hugepage or
(2) be somewhere else.

For (1), the optimization in this series doesn't help; we will often
be stuck behind the hugepage either being destroyed or reconstructed.

For (2), the optimization in this series is an improvement, and that's
what this test is trying to demonstrate. But this is only true if the
EPT does not have a valid mapping for the GPA we tried to use. If it
does, the access will just proceed like normal.

This test only times these "case 2" accesses. Now if we didn't have
MAP_POPULATE, then (non-fast) GUP time appears in these results, which
(IIRC) adds so much noise that the improvement is difficult to
ascertain. But with MAP_POPULATE, the difference is very clear.

This test is 100% contrived to consistently reproduce the memory
access timing anomalies that Vipin demonstrated with his initial
posting of this series[1].

[1]: https://lore.kernel.org/kvm/20240906204515.3276696-3-vipinsh@google.com/

>
> > +     vm_mem_add(vm, backing_src, gpa, 1,
> > +                guest_num_pages, 0, -1, 0, MAP_POPULATE);
> > +
> > +     virt_map(vm, guest_test_virt_mem, gpa, guest_num_pages);
> > +
> > +     args->pages = guest_num_pages;
> > +
> > +     /* Export the shared variables to the guest. */
> > +     sync_global_to_guest(vm, guest_args);
> > +
> > +     return vm;
> > +}
> > +
> > +static void run_vcpu(struct kvm_vcpu *vcpu)
> > +{
> > +     struct timespec ts_elapsed;
> > +     struct timespec ts_start;
> > +     struct ucall uc = {};
> > +     int ret;
> > +
> > +     clock_gettime(CLOCK_MONOTONIC, &ts_start);
> > +
> > +     ret = _vcpu_run(vcpu);
> > +
> > +     ts_elapsed = timespec_elapsed(ts_start);
> > +
> > +     TEST_ASSERT(ret == 0, "vcpu_run failed: %d", ret);
> > +
> > +     TEST_ASSERT(get_ucall(vcpu, &uc) == UCALL_SYNC,
> > +                 "Invalid guest sync status: %" PRIu64, uc.cmd);
> > +
> > +     pr_info("Duration: %ld.%09lds\n",
> > +             ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> > +     pr_info("Max fault latency: %" PRIu64 " cycles\n", uc.args[0]);
> > +}
> > +
> > +static void run_test(struct test_params *params)
> > +{
> > +     /*
> > +      * The fault + execute pattern in the guest relies on having more than
> > +      * 1GiB to use.
> > +      */
> > +     TEST_ASSERT(params->memory_bytes > PAGE_SIZE << 18,
>
> Oooh, the 1 << 18 is 1GiB on PFNs.  Ugh.  Just use SZ_1G here.  And assert immediate
> after setting params.memory_bytes, don't wait until the test runs.

Will do, thanks.

>
> > +                 "Must use more than 1GiB of memory.");
> > +
> > +     create_vm(params->memory_bytes, params->backing_src);
> > +
> > +     pr_info("\n");
> > +
> > +     run_vcpu(vcpu);
> > +}
> > +
> > +static void help(char *name)
> > +{
> > +     puts("");
> > +     printf("usage: %s [-h] [-b bytes] [-s mem_type]\n",
> > +            name);
> > +     puts("");
> > +     printf(" -h: Display this help message.");
> > +     printf(" -b: specify the size of the memory region which should be\n"
> > +            "     dirtied by the guest. e.g. 2048M or 3G.\n"
> > +            "     (default: 2G, must be greater than 1G)\n");
> > +     backing_src_help("-s");
> > +     puts("");
> > +     exit(0);
> > +}
> > +
> > +int main(int argc, char *argv[])
> > +{
> > +     struct test_params params = {
> > +             .backing_src = DEFAULT_VM_MEM_SRC,
> > +             .memory_bytes = DEFAULT_TEST_MEM_SIZE,
> > +     };
> > +     int opt;
> > +
> > +     while ((opt = getopt(argc, argv, "hb:s:")) != -1) {
> > +             switch (opt) {
> > +             case 'b':
> > +                     params.memory_bytes = parse_size(optarg);
> > +                     break;
> > +             case 's':
> > +                     params.backing_src = parse_backing_src_type(optarg);
> > +                     break;
> > +             case 'h':
> > +             default:
> > +                     help(argv[0]);
> > +                     break;
> > +             }
> > +     }
> > +
> > +     run_test(&params);
> > +}
> > --
> > 2.50.0.727.gbf7dc18ff4-goog
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2025-07-28 18:17       ` David Matlack
@ 2025-07-28 21:38         ` Sean Christopherson
  2025-07-28 21:48           ` James Houghton
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2025-07-28 21:38 UTC (permalink / raw)
  To: David Matlack
  Cc: James Houghton, Paolo Bonzini, Vipin Sharma, kvm, linux-kernel

On Mon, Jul 28, 2025, David Matlack wrote:
> On Mon, Jul 28, 2025 at 11:08 AM James Houghton <jthoughton@google.com> wrote:
> > On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > >       rcu_read_lock();
> > > >
> > > >       for ( ; to_zap; --to_zap) {
> > > > -             if (list_empty(nx_huge_pages))
> > > > +#ifdef CONFIG_X86_64
> > >
> > > These #ifdefs still make me sad, but I also still think they're the least awful
> > > solution.  And hopefully we will jettison 32-bit sooner than later :-)
> >
> > Yeah I couldn't come up with anything better. :(
> 
> Could we just move the definition of tdp_mmu_pages_lock outside of
> CONFIG_X86_64? The only downside I can think of is slightly larger kvm
> structs for 32-bit builds.

Hmm, I was going to say "no, because we'd also need to do spin_lock_init()", but
obviously spin_(un)lock() will only ever be invoked for 64-bit kernels.  I still
don't love the idea of making tdp_mmu_pages_lock visible outside of CONFIG_X86_64,
it feels like we're just asking to introduce (likely benign) bugs.

Ugh, and I just noticed this as well:

  #ifndef CONFIG_X86_64
  #define KVM_TDP_MMU -1
  #endif

Rather than expose kvm->arch.tdp_mmu_pages_lock, what about using a single #ifdef
section to bury both is_tdp_mmu and a local kvm->arch.tdp_mmu_pages_lock pointer?

Alternatively, we could do:

	const bool is_tdp_mmu = IS_ENABLED(CONFIG_X86_64) && mmu_type != KVM_SHADOW_MMU;

to avoid referencing KVM_TDP_MMU, but that's quite ugly.  Overall, I think the
below strikes the best balance between polluting the code with #ifdefs, and
generating robust code.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 52bf6a886bfd..c038d7cd187d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1372,10 +1372,6 @@ enum kvm_mmu_type {
        KVM_NR_MMU_TYPES,
 };
 
-#ifndef CONFIG_X86_64
-#define KVM_TDP_MMU -1
-#endif
-
 struct kvm_arch {
        unsigned long n_used_mmu_pages;
        unsigned long n_requested_mmu_pages;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a6a1fb42b2d1..e2bde6a5e346 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7624,8 +7624,14 @@ static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
 static void kvm_recover_nx_huge_pages(struct kvm *kvm,
                                      const enum kvm_mmu_type mmu_type)
 {
+#ifdef CONFIG_X86_64
+       const bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
+       spinlock_t *tdp_mmu_pages_lock = &kvm->arch.tdp_mmu_pages_lock;
+#else
+       const bool is_tdp_mmu = false;
+       spinlock_t *tdp_mmu_pages_lock = NULL;
+#endif
        unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
-       bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
        struct list_head *nx_huge_pages;
        struct kvm_mmu_page *sp;
        LIST_HEAD(invalid_list);
@@ -7648,15 +7654,12 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
        rcu_read_lock();
 
        for ( ; to_zap; --to_zap) {
-#ifdef CONFIG_X86_64
                if (is_tdp_mmu)
-                       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-#endif
+                       spin_lock(tdp_mmu_pages_lock);
+
                if (list_empty(nx_huge_pages)) {
-#ifdef CONFIG_X86_64
                        if (is_tdp_mmu)
-                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-#endif
+                               spin_unlock(tdp_mmu_pages_lock);
                        break;
                }
 
@@ -7675,10 +7678,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 
                unaccount_nx_huge_page(kvm, sp);
 
-#ifdef CONFIG_X86_64
                if (is_tdp_mmu)
-                       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-#endif
+                       spin_unlock(tdp_mmu_pages_lock);
 
                /*
                 * Do not attempt to recover any NX Huge Pages that are being
--

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2025-07-28 21:38         ` Sean Christopherson
@ 2025-07-28 21:48           ` James Houghton
  2025-08-01 18:17             ` David Matlack
  0 siblings, 1 reply; 29+ messages in thread
From: James Houghton @ 2025-07-28 21:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Matlack, Paolo Bonzini, Vipin Sharma, kvm, linux-kernel

On Mon, Jul 28, 2025 at 2:38 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jul 28, 2025, David Matlack wrote:
> > On Mon, Jul 28, 2025 at 11:08 AM James Houghton <jthoughton@google.com> wrote:
> > > On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > > >       rcu_read_lock();
> > > > >
> > > > >       for ( ; to_zap; --to_zap) {
> > > > > -             if (list_empty(nx_huge_pages))
> > > > > +#ifdef CONFIG_X86_64
> > > >
> > > > These #ifdefs still make me sad, but I also still think they're the least awful
> > > > solution.  And hopefully we will jettison 32-bit sooner than later :-)
> > >
> > > Yeah I couldn't come up with anything better. :(
> >
> > Could we just move the definition of tdp_mmu_pages_lock outside of
> > CONFIG_X86_64? The only downside I can think of is slightly larger kvm
> > structs for 32-bit builds.
>
> Hmm, I was going to say "no, because we'd also need to do spin_lock_init()", but
> obviously spin_(un)lock() will only ever be invoked for 64-bit kernels.  I still
> don't love the idea of making tdp_mmu_pages_lock visible outside of CONFIG_X86_64,
> it feels like we're just asking to introduce (likely benign) bugs.
>
> Ugh, and I just noticed this as well:
>
>   #ifndef CONFIG_X86_64
>   #define KVM_TDP_MMU -1
>   #endif
>
> Rather than expose kvm->arch.tdp_mmu_pages_lock, what about using a single #ifdef
> section to bury both is_tdp_mmu and a local kvm->arch.tdp_mmu_pages_lock pointer?

SGTM.

>
> Alternatively, we could do:
>
>         const bool is_tdp_mmu = IS_ENABLED(CONFIG_X86_64) && mmu_type != KVM_SHADOW_MMU;

I tried something like this before and it didn't work; my compiler
still complained. Maybe I didn't do it quite right...

>
> to avoid referencing KVM_TDP_MMU, but that's quite ugly.  Overall, I think the
> below strikes the best balance between polluting the code with #ifdefs, and
> generating robust code.
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 52bf6a886bfd..c038d7cd187d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1372,10 +1372,6 @@ enum kvm_mmu_type {
>         KVM_NR_MMU_TYPES,
>  };
>
> -#ifndef CONFIG_X86_64
> -#define KVM_TDP_MMU -1
> -#endif
> -
>  struct kvm_arch {
>         unsigned long n_used_mmu_pages;
>         unsigned long n_requested_mmu_pages;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a6a1fb42b2d1..e2bde6a5e346 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7624,8 +7624,14 @@ static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
>  static void kvm_recover_nx_huge_pages(struct kvm *kvm,
>                                       const enum kvm_mmu_type mmu_type)
>  {
> +#ifdef CONFIG_X86_64
> +       const bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> +       spinlock_t *tdp_mmu_pages_lock = &kvm->arch.tdp_mmu_pages_lock;
> +#else
> +       const bool is_tdp_mmu = false;
> +       spinlock_t *tdp_mmu_pages_lock = NULL;
> +#endif
>         unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
> -       bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
>         struct list_head *nx_huge_pages;
>         struct kvm_mmu_page *sp;
>         LIST_HEAD(invalid_list);
> @@ -7648,15 +7654,12 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
>         rcu_read_lock();
>
>         for ( ; to_zap; --to_zap) {
> -#ifdef CONFIG_X86_64
>                 if (is_tdp_mmu)
> -                       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> -#endif
> +                       spin_lock(tdp_mmu_pages_lock);
> +
>                 if (list_empty(nx_huge_pages)) {
> -#ifdef CONFIG_X86_64
>                         if (is_tdp_mmu)
> -                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> -#endif
> +                               spin_unlock(tdp_mmu_pages_lock);
>                         break;
>                 }
>
> @@ -7675,10 +7678,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
>
>                 unaccount_nx_huge_page(kvm, sp);
>
> -#ifdef CONFIG_X86_64
>                 if (is_tdp_mmu)
> -                       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> -#endif
> +                       spin_unlock(tdp_mmu_pages_lock);
>
>                 /*
>                  * Do not attempt to recover any NX Huge Pages that are being
> --

LGTM! Thanks Sean.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 5/7] KVM: selftests: Introduce a selftest to measure execution performance
  2025-07-23 20:50   ` Sean Christopherson
@ 2025-07-29  0:18     ` James Houghton
  0 siblings, 0 replies; 29+ messages in thread
From: James Houghton @ 2025-07-29  0:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Wed, Jul 23, 2025 at 1:50 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jul 07, 2025, James Houghton wrote:
> > From: David Matlack <dmatlack@google.com>
> >
> > Introduce a new selftest, execute_perf_test, that uses the
> > perf_test_util framework to measure the performance of executing code
> > within a VM. This test is similar to the other perf_test_util-based
> > tests in that it spins up a variable number of vCPUs and runs them
> > concurrently, accessing memory.
> >
> > In order to support execution, extend perf_test_util to populate guest
> > memory with return instructions rather than random garbage. This way
> > memory can be execute simply by calling it.
> >
> > Currently only x86_64 supports execution, but other architectures can be
> > easily added by providing their return code instruction.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  tools/testing/selftests/kvm/Makefile.kvm      |   1 +
> >  .../testing/selftests/kvm/execute_perf_test.c | 199 ++++++++++++++++++
>
> Honest question, is there really no way to dedup memstress tests?  This seems
> like an insane amount of code just to call memstress_set_execute().

The main pieces of this test come out to:
1. Parse arguments
2. Create a VM
3. Run a couple iterations of memory access
4. Run a few execution iterations
5. Destroy the VM

(1) is slightly difficult to de-duplicate, because the other tests
have more arguments, and so you'd have to look in two places for a
full argument list (in the source code at least).

(2+5) Creating and destroying a VM is already not much code, IMO.

(3+4) Running iterations of guest memory accesses is a pretty good
candidate for de-duplication.

The other memstress tests are:
- access_tracking_perf_test.c
- demand_paging_test.c
- dirty_log_perf_test.c
- memslot_modification_stress_test.c
- x86/dirty_log_page_splitting_test.c

Three of these tests use similar iteration logic (all but
memslot_modification_stress_test.c and demand_paging_test.c).

I could make memstress_start_vcpu_threads() take a pointer to the main
vCPU thread logic, put some iteration logic around that, and then
provide a "run_iteration" interface in memstress. The two tests that
don't really iterate multiple times can just iterate once.

I'm not really sure how much *better* that is, but I can at least give
it a go and see how it looks.

If you're happy with the nx hugepage test[1] in patch #7, feel free to
apply that one without waiting for this one.

[1]: https://lore.kernel.org/kvm/20250707224720.4016504-8-jthoughton@google.com/

>
> >  .../testing/selftests/kvm/include/memstress.h |   4 +
> >  tools/testing/selftests/kvm/lib/memstress.c   |  25 ++-
> >  4 files changed, 227 insertions(+), 2 deletions(-)
> >  create mode 100644 tools/testing/selftests/kvm/execute_perf_test.c
> >
> > diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
> > index 38b95998e1e6b..0dc435e944632 100644
> > --- a/tools/testing/selftests/kvm/Makefile.kvm
> > +++ b/tools/testing/selftests/kvm/Makefile.kvm
> > @@ -137,6 +137,7 @@ TEST_GEN_PROGS_x86 += x86/recalc_apic_map_test
> >  TEST_GEN_PROGS_x86 += access_tracking_perf_test
> >  TEST_GEN_PROGS_x86 += coalesced_io_test
> >  TEST_GEN_PROGS_x86 += dirty_log_perf_test
> > +TEST_GEN_PROGS_x86 += execute_perf_test
>
> How about call_ret_perf_test instead of execute_perf_test?  I like that "execute"
> aligns with "read" and "write", but as a test name it ends up being quite ambiguous.

call_ret_perf_test to me sounds quite x86-specific, and although the
test currently only supports x86, I think we might as well name it
something more generic in case it becomes useful to support other
architectures.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock
  2025-07-23 20:44 ` [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock Sean Christopherson
@ 2025-07-29  0:19   ` James Houghton
  0 siblings, 0 replies; 29+ messages in thread
From: James Houghton @ 2025-07-29  0:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Wed, Jul 23, 2025 at 1:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jul 07, 2025, James Houghton wrote:
> > David Matlack (1):
> >   KVM: selftests: Introduce a selftest to measure execution performance
> >
> > James Houghton (3):
> >   KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU
> >   KVM: selftests: Provide extra mmap flags in vm_mem_add()
> >   KVM: selftests: Add an NX huge pages jitter test
> >
> > Vipin Sharma (3):
> >   KVM: x86/mmu: Track TDP MMU NX huge pages separately
> >   KVM: x86/mmu: Rename kvm_tdp_mmu_zap_sp() to better indicate its
> >     purpose
> >   KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
>
> The KVM changes look good, no need for a v5 on that front (I'll do minor fixup
> when applying, which will be a few weeks from now, after 6.17-rc1) .  Still
> working through the selftests.

Thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test
  2025-07-28 18:40     ` James Houghton
@ 2025-08-01 14:11       ` Sean Christopherson
  2025-08-01 18:45         ` James Houghton
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2025-08-01 14:11 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Mon, Jul 28, 2025, James Houghton wrote:
> On Wed, Jul 23, 2025 at 2:04 PM Sean Christopherson <seanjc@google.com> wrote:
> > On Mon, Jul 07, 2025, James Houghton wrote:
> > Right, but we also don't want to wait for the initial fault-in either, no?  I.e.
> > plumbing in MAP_POPULATE only fixes the worst of the delay, and maybe only with
> > the TDP MMU enabled.
> >
> > In other words, it seems like we need a helper (option?) to excplitly "prefault",
> > all memory from within the guest, not the ability to specify MAP_POPULATE.
> 
> I don't want the EPT to be populated.
> 
> In the event of a hugepage being executed, consider another memory
> access. The access can either (1) be in the executed-from hugepage or
> (2) be somewhere else.
> 
> For (1), the optimization in this series doesn't help; we will often
> be stuck behind the hugepage either being destroyed or reconstructed.
> 
> For (2), the optimization in this series is an improvement, and that's
> what this test is trying to demonstrate. But this is only true if the
> EPT does not have a valid mapping for the GPA we tried to use. If it
> does, the access will just proceed like normal.
> 
> This test only times these "case 2" accesses. Now if we didn't have
> MAP_POPULATE, then (non-fast) GUP time appears in these results, which
> (IIRC) adds so much noise that the improvement is difficult to
> ascertain. But with MAP_POPULATE, the difference is very clear.

Oh, right, the whole point is to measure fault-in performance.

In that case, rather than MAP_POPULATE, can we do the slightly more standard (for
VMMs) thing of writing (or reading) memory from host userspace?  I don't think it's
worth plumbing in extra_mmap_flags just for MAP_POPULATE, in no small part because
MAP_POPULATE is effectively best effort, and doesn't work for VM_PFNMAP (or VM_IO).

Those quirks shouldn't matter for this case, and _probably_ won't ever matter for
any KVM selftest, but it's enough to make me think MAP_POPULATE is a pattern we
don't want to encourage.

> This test is 100% contrived to consistently reproduce the memory
> access timing anomalies that Vipin demonstrated with his initial
> posting of this series[1].

Eh, contrived, targeted, same thing :-D

> [1]: https://lore.kernel.org/kvm/20240906204515.3276696-3-vipinsh@google.com/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2025-07-28 21:48           ` James Houghton
@ 2025-08-01 18:17             ` David Matlack
  2025-08-01 22:00               ` Sean Christopherson
  0 siblings, 1 reply; 29+ messages in thread
From: David Matlack @ 2025-08-01 18:17 UTC (permalink / raw)
  To: James Houghton
  Cc: Sean Christopherson, Paolo Bonzini, Vipin Sharma, kvm,
	linux-kernel

On Mon, Jul 28, 2025 at 2:49 PM James Houghton <jthoughton@google.com> wrote:
>
> On Mon, Jul 28, 2025 at 2:38 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Jul 28, 2025, David Matlack wrote:
> > > On Mon, Jul 28, 2025 at 11:08 AM James Houghton <jthoughton@google.com> wrote:
> > > > On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > > > >       rcu_read_lock();
> > > > > >
> > > > > >       for ( ; to_zap; --to_zap) {
> > > > > > -             if (list_empty(nx_huge_pages))
> > > > > > +#ifdef CONFIG_X86_64
> > > > >
> > > > > These #ifdefs still make me sad, but I also still think they're the least awful
> > > > > solution.  And hopefully we will jettison 32-bit sooner than later :-)
> > > >
> > > > Yeah I couldn't come up with anything better. :(
> > >
> > > Could we just move the definition of tdp_mmu_pages_lock outside of
> > > CONFIG_X86_64? The only downside I can think of is slightly larger kvm
> > > structs for 32-bit builds.
> >
> > Hmm, I was going to say "no, because we'd also need to do spin_lock_init()", but
> > obviously spin_(un)lock() will only ever be invoked for 64-bit kernels.  I still
> > don't love the idea of making tdp_mmu_pages_lock visible outside of CONFIG_X86_64,
> > it feels like we're just asking to introduce (likely benign) bugs.
> >
> > Ugh, and I just noticed this as well:
> >
> >   #ifndef CONFIG_X86_64
> >   #define KVM_TDP_MMU -1
> >   #endif
> >
> > Rather than expose kvm->arch.tdp_mmu_pages_lock, what about using a single #ifdef
> > section to bury both is_tdp_mmu and a local kvm->arch.tdp_mmu_pages_lock pointer?
>
> SGTM.
>
> >
> > Alternatively, we could do:
> >
> >         const bool is_tdp_mmu = IS_ENABLED(CONFIG_X86_64) && mmu_type != KVM_SHADOW_MMU;
>
> I tried something like this before and it didn't work; my compiler
> still complained. Maybe I didn't do it quite right...
>
> >
> > to avoid referencing KVM_TDP_MMU, but that's quite ugly.  Overall, I think the
> > below strikes the best balance between polluting the code with #ifdefs, and
> > generating robust code.
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 52bf6a886bfd..c038d7cd187d 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1372,10 +1372,6 @@ enum kvm_mmu_type {
> >         KVM_NR_MMU_TYPES,
> >  };
> >
> > -#ifndef CONFIG_X86_64
> > -#define KVM_TDP_MMU -1
> > -#endif
> > -
> >  struct kvm_arch {
> >         unsigned long n_used_mmu_pages;
> >         unsigned long n_requested_mmu_pages;
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a6a1fb42b2d1..e2bde6a5e346 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -7624,8 +7624,14 @@ static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> >  static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> >                                       const enum kvm_mmu_type mmu_type)
> >  {
> > +#ifdef CONFIG_X86_64
> > +       const bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> > +       spinlock_t *tdp_mmu_pages_lock = &kvm->arch.tdp_mmu_pages_lock;
> > +#else
> > +       const bool is_tdp_mmu = false;
> > +       spinlock_t *tdp_mmu_pages_lock = NULL;
> > +#endif
> >         unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
> > -       bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> >         struct list_head *nx_huge_pages;
> >         struct kvm_mmu_page *sp;
> >         LIST_HEAD(invalid_list);
> > @@ -7648,15 +7654,12 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> >         rcu_read_lock();
> >
> >         for ( ; to_zap; --to_zap) {
> > -#ifdef CONFIG_X86_64
> >                 if (is_tdp_mmu)
> > -                       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > -#endif
> > +                       spin_lock(tdp_mmu_pages_lock);
> > +
> >                 if (list_empty(nx_huge_pages)) {
> > -#ifdef CONFIG_X86_64
> >                         if (is_tdp_mmu)
> > -                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > -#endif
> > +                               spin_unlock(tdp_mmu_pages_lock);
> >                         break;
> >                 }
> >
> > @@ -7675,10 +7678,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> >
> >                 unaccount_nx_huge_page(kvm, sp);
> >
> > -#ifdef CONFIG_X86_64
> >                 if (is_tdp_mmu)
> > -                       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > -#endif
> > +                       spin_unlock(tdp_mmu_pages_lock);
> >
> >                 /*
> >                  * Do not attempt to recover any NX Huge Pages that are being
> > --
>
> LGTM! Thanks Sean.

Is the compiler not smart enough to optimize out
kvm->arch.tdp_mmu_pages_lock? (To avoid needing the extra local
variable?) I thought there was some other KVM code that relied on
similar optimizations but I would have to go dig them up to remember.

Either way, this LGTM!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test
  2025-08-01 14:11       ` Sean Christopherson
@ 2025-08-01 18:45         ` James Houghton
  2025-08-01 22:30           ` Sean Christopherson
  0 siblings, 1 reply; 29+ messages in thread
From: James Houghton @ 2025-08-01 18:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Fri, Aug 1, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jul 28, 2025, James Houghton wrote:
> > On Wed, Jul 23, 2025 at 2:04 PM Sean Christopherson <seanjc@google.com> wrote:
> > > On Mon, Jul 07, 2025, James Houghton wrote:
> > > Right, but we also don't want to wait for the initial fault-in either, no?  I.e.
> > > plumbing in MAP_POPULATE only fixes the worst of the delay, and maybe only with
> > > the TDP MMU enabled.
> > >
> > > In other words, it seems like we need a helper (option?) to excplitly "prefault",
> > > all memory from within the guest, not the ability to specify MAP_POPULATE.
> >
> > I don't want the EPT to be populated.
> >
> > In the event of a hugepage being executed, consider another memory
> > access. The access can either (1) be in the executed-from hugepage or
> > (2) be somewhere else.
> >
> > For (1), the optimization in this series doesn't help; we will often
> > be stuck behind the hugepage either being destroyed or reconstructed.
> >
> > For (2), the optimization in this series is an improvement, and that's
> > what this test is trying to demonstrate. But this is only true if the
> > EPT does not have a valid mapping for the GPA we tried to use. If it
> > does, the access will just proceed like normal.
> >
> > This test only times these "case 2" accesses. Now if we didn't have
> > MAP_POPULATE, then (non-fast) GUP time appears in these results, which
> > (IIRC) adds so much noise that the improvement is difficult to
> > ascertain. But with MAP_POPULATE, the difference is very clear.
>
> Oh, right, the whole point is to measure fault-in performance.
>
> In that case, rather than MAP_POPULATE, can we do the slightly more standard (for
> VMMs) thing of writing (or reading) memory from host userspace?  I don't think it's
> worth plumbing in extra_mmap_flags just for MAP_POPULATE, in no small part because
> MAP_POPULATE is effectively best effort, and doesn't work for VM_PFNMAP (or VM_IO).
>
> Those quirks shouldn't matter for this case, and _probably_ won't ever matter for
> any KVM selftest, but it's enough to make me think MAP_POPULATE is a pattern we
> don't want to encourage.

What if vm_mem_add() just returned the VA of the added region, and
then the test could call mlock() on it? That's actually closer to what
I want; I want to avoid slow GUP as much as I can (though mlock()
isn't perfect, it's pretty much as good as we can do). So we can
write:

  char *mem = vm_mem_add(...);
  mlock(mem, size);

instead of

  char *mem = vm_mem_add(...);
  for (tmp = mem; tmp < mem + size; tmp += backing_src_pagesz)
       *(volatile char *)tmp = 0;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2025-08-01 18:17             ` David Matlack
@ 2025-08-01 22:00               ` Sean Christopherson
  2025-08-12 19:21                 ` David Matlack
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2025-08-01 22:00 UTC (permalink / raw)
  To: David Matlack
  Cc: James Houghton, Paolo Bonzini, Vipin Sharma, kvm, linux-kernel

On Fri, Aug 01, 2025, David Matlack wrote:
> On Mon, Jul 28, 2025 at 2:49 PM James Houghton <jthoughton@google.com> wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index a6a1fb42b2d1..e2bde6a5e346 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -7624,8 +7624,14 @@ static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> > >  static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > >                                       const enum kvm_mmu_type mmu_type)
> > >  {
> > > +#ifdef CONFIG_X86_64
> > > +       const bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> > > +       spinlock_t *tdp_mmu_pages_lock = &kvm->arch.tdp_mmu_pages_lock;
> > > +#else
> > > +       const bool is_tdp_mmu = false;
> > > +       spinlock_t *tdp_mmu_pages_lock = NULL;
> > > +#endif
> > >         unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
> > > -       bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> > >         struct list_head *nx_huge_pages;
> > >         struct kvm_mmu_page *sp;
> > >         LIST_HEAD(invalid_list);
> > > @@ -7648,15 +7654,12 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > >         rcu_read_lock();
> > >
> > >         for ( ; to_zap; --to_zap) {
> > > -#ifdef CONFIG_X86_64
> > >                 if (is_tdp_mmu)
> > > -                       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > > -#endif
> > > +                       spin_lock(tdp_mmu_pages_lock);
> > > +
> > >                 if (list_empty(nx_huge_pages)) {
> > > -#ifdef CONFIG_X86_64
> > >                         if (is_tdp_mmu)
> > > -                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > -#endif
> > > +                               spin_unlock(tdp_mmu_pages_lock);
> > >                         break;
> > >                 }
> > >
> > > @@ -7675,10 +7678,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > >
> > >                 unaccount_nx_huge_page(kvm, sp);
> > >
> > > -#ifdef CONFIG_X86_64
> > >                 if (is_tdp_mmu)
> > > -                       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > -#endif
> > > +                       spin_unlock(tdp_mmu_pages_lock);
> > >
> > >                 /*
> > >                  * Do not attempt to recover any NX Huge Pages that are being
> > > --
> >
> > LGTM! Thanks Sean.
> 
> Is the compiler not smart enough to optimize out kvm->arch.tdp_mmu_pages_lock?

Yes, the compiler will eliminate dead code at most optimization levels.  But that
optimization phase happens after initial compilation, i.e. the compiler needs to
generate the (probably intermediate?) code before it can trim away paths that are
unreachable.

> (To avoid needing the extra local variable?) I thought there was some other
> KVM code that relied on similar optimizations but I would have to go dig them
> up to remember.

KVM, and the kernel, absolutely relies on dead code elimination.  KVM most blatantly
uses the technique to avoid _defining_ stubs for code that is guarded by a Kconfig,
e.g. all of these functions are defined in sev.c (guarded by CONFIG_KVM_AMD_SEV),
but callers are guarded only with sev_guest() or sev_es_guest(), not with explicit
#idefs.

There are no build errors because the function calls aren't fully resolved until
link time (when svm.o is linked into kvm-amd.o).  But KVM still needs to _declare_
the functions, otherwise the compiler would fail during its initial code generation.

  int pre_sev_run(struct vcpu_svm *svm, int cpu);
  void sev_init_vmcb(struct vcpu_svm *svm);
  void sev_vcpu_after_set_cpuid(struct vcpu_svm *svm);
  int sev_es_string_io(struct vcpu_svm *svm, int size, unsigned int port, int in);
  void sev_es_vcpu_reset(struct vcpu_svm *svm);
  void sev_es_recalc_msr_intercepts(struct kvm_vcpu *vcpu);
  void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
  void sev_es_prepare_switch_to_guest(struct vcpu_svm *svm, struct sev_es_save_area *hostsa);
  void sev_es_unmap_ghcb(struct vcpu_svm *svm);

Other notable "users" of dead code elimination are the BUILD_BUG_ON() family of
compile-time asserts.  So long as the condition can be resolved to a constant
false value during compile time, the "call" to __compiletime_error() will be
elided and everyone is happy.

  #ifdef __OPTIMIZE__
  /*
   * #ifdef __OPTIMIZE__ is only a good approximation; for instance "make
   * CFLAGS_foo.o=-Og" defines __OPTIMIZE__, does not elide the conditional code
   * and can break compilation with wrong error message(s). Combine with
   * -U__OPTIMIZE__ when needed.
   */
  # define __compiletime_assert(condition, msg, prefix, suffix)		\
	do {								\
		/*							\
		 * __noreturn is needed to give the compiler enough	\
		 * information to avoid certain possibly-uninitialized	\
		 * warnings (regardless of the build failing).		\
		 */							\
		__noreturn extern void prefix ## suffix(void)		\
			__compiletime_error(msg);			\
		if (!(condition))					\
			prefix ## suffix();				\
	} while (0)
  #else
  # define __compiletime_assert(condition, msg, prefix, suffix) ((void)(condition))
  #endif

Note, static_assert() is different in that it's a true assertion that's resolved
early on during compilation.

 * Contrary to BUILD_BUG_ON(), static_assert() can be used at global
 * scope, but requires the expression to be an integer constant
 * expression (i.e., it is not enough that __builtin_constant_p() is
 * true for expr).


From a previous thread related to asserts (https://lore.kernel.org/all/aFGY0KVUksf1a6xB@google.com):

 : The advantage of BUILD_BUG_ON() is that it works so long as the condition is
 : compile-time constant, whereas static_assert() requires the condition to an
 : integer constant expression.  E.g. BUILD_BUG_ON() can be used so long as the
 : condition is eventually resolved to a constant, whereas static_assert() has
 : stricter requirements.
 : 
 : E.g. the fls64() assert below is fully resolved at compile time, but isn't a
 : purely constant expression, i.e. that one *needs* to be BUILD_BUG_ON().
 : 
 : --
 : arch/x86/kvm/svm/avic.c: In function ‘avic_init_backing_page’:
 : arch/x86/kvm/svm/avic.c:293:45: error: expression in static assertion is not constant
 :   293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
 : include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
 :    78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
 :       |                                                        ^~~~
 : arch/x86/kvm/svm/avic.c:293:9: note: in expansion of macro ‘static_assert’
 :   293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
 :       |         ^~~~~~~~~~~~~
 : make[5]: *** [scripts/Makefile.build:203: arch/x86/kvm/svm/avic.o] Error 1
 : --
 : 
 : The downside of BUILD_BUG_ON() is that it can't be used at global scope, i.e.
 : needs to be called from a function.
 : 
 : As a result, when adding an assertion in a function, using BUILD_BUG_ON() is
 : slightly preferred, because it's less likely to break in the future.  E.g. if
 : X2AVIC_MAX_PHYSICAL_ID were changed to something that is a compile-time constant,
 : but for whatever reason isn't a pure integer constant.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test
  2025-08-01 18:45         ` James Houghton
@ 2025-08-01 22:30           ` Sean Christopherson
  0 siblings, 0 replies; 29+ messages in thread
From: Sean Christopherson @ 2025-08-01 22:30 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Fri, Aug 01, 2025, James Houghton wrote:
> On Fri, Aug 1, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Jul 28, 2025, James Houghton wrote:
> > > On Wed, Jul 23, 2025 at 2:04 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > On Mon, Jul 07, 2025, James Houghton wrote:
> > > > Right, but we also don't want to wait for the initial fault-in either, no?  I.e.
> > > > plumbing in MAP_POPULATE only fixes the worst of the delay, and maybe only with
> > > > the TDP MMU enabled.
> > > >
> > > > In other words, it seems like we need a helper (option?) to excplitly "prefault",
> > > > all memory from within the guest, not the ability to specify MAP_POPULATE.
> > >
> > > I don't want the EPT to be populated.
> > >
> > > In the event of a hugepage being executed, consider another memory
> > > access. The access can either (1) be in the executed-from hugepage or
> > > (2) be somewhere else.
> > >
> > > For (1), the optimization in this series doesn't help; we will often
> > > be stuck behind the hugepage either being destroyed or reconstructed.
> > >
> > > For (2), the optimization in this series is an improvement, and that's
> > > what this test is trying to demonstrate. But this is only true if the
> > > EPT does not have a valid mapping for the GPA we tried to use. If it
> > > does, the access will just proceed like normal.
> > >
> > > This test only times these "case 2" accesses. Now if we didn't have
> > > MAP_POPULATE, then (non-fast) GUP time appears in these results, which
> > > (IIRC) adds so much noise that the improvement is difficult to
> > > ascertain. But with MAP_POPULATE, the difference is very clear.
> >
> > Oh, right, the whole point is to measure fault-in performance.
> >
> > In that case, rather than MAP_POPULATE, can we do the slightly more standard (for
> > VMMs) thing of writing (or reading) memory from host userspace?  I don't think it's
> > worth plumbing in extra_mmap_flags just for MAP_POPULATE, in no small part because
> > MAP_POPULATE is effectively best effort, and doesn't work for VM_PFNMAP (or VM_IO).
> >
> > Those quirks shouldn't matter for this case, and _probably_ won't ever matter for
> > any KVM selftest, but it's enough to make me think MAP_POPULATE is a pattern we
> > don't want to encourage.
> 
> What if vm_mem_add() just returned the VA of the added region, and
> then the test could call mlock() on it? 

I like it!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2025-08-01 22:00               ` Sean Christopherson
@ 2025-08-12 19:21                 ` David Matlack
  0 siblings, 0 replies; 29+ messages in thread
From: David Matlack @ 2025-08-12 19:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, Paolo Bonzini, Vipin Sharma, kvm, linux-kernel

On Fri, Aug 1, 2025 at 3:00 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Aug 01, 2025, David Matlack wrote:
> > On Mon, Jul 28, 2025 at 2:49 PM James Houghton <jthoughton@google.com> wrote:
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index a6a1fb42b2d1..e2bde6a5e346 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -7624,8 +7624,14 @@ static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> > > >  static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > >                                       const enum kvm_mmu_type mmu_type)
> > > >  {
> > > > +#ifdef CONFIG_X86_64
> > > > +       const bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> > > > +       spinlock_t *tdp_mmu_pages_lock = &kvm->arch.tdp_mmu_pages_lock;
> > > > +#else
> > > > +       const bool is_tdp_mmu = false;
> > > > +       spinlock_t *tdp_mmu_pages_lock = NULL;
> > > > +#endif
> > > >         unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
> > > > -       bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> > > >         struct list_head *nx_huge_pages;
> > > >         struct kvm_mmu_page *sp;
> > > >         LIST_HEAD(invalid_list);
> > > > @@ -7648,15 +7654,12 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > >         rcu_read_lock();
> > > >
> > > >         for ( ; to_zap; --to_zap) {
> > > > -#ifdef CONFIG_X86_64
> > > >                 if (is_tdp_mmu)
> > > > -                       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > > > -#endif
> > > > +                       spin_lock(tdp_mmu_pages_lock);
> > > > +
> > > >                 if (list_empty(nx_huge_pages)) {
> > > > -#ifdef CONFIG_X86_64
> > > >                         if (is_tdp_mmu)
> > > > -                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > > -#endif
> > > > +                               spin_unlock(tdp_mmu_pages_lock);
> > > >                         break;
> > > >                 }
> > > >
> > > > @@ -7675,10 +7678,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > >
> > > >                 unaccount_nx_huge_page(kvm, sp);
> > > >
> > > > -#ifdef CONFIG_X86_64
> > > >                 if (is_tdp_mmu)
> > > > -                       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > > -#endif
> > > > +                       spin_unlock(tdp_mmu_pages_lock);
> > > >
> > > >                 /*
> > > >                  * Do not attempt to recover any NX Huge Pages that are being
> > > > --
> > >
> > > LGTM! Thanks Sean.
> >
> > Is the compiler not smart enough to optimize out kvm->arch.tdp_mmu_pages_lock?
>
> Yes, the compiler will eliminate dead code at most optimization levels.  But that
> optimization phase happens after initial compilation, i.e. the compiler needs to
> generate the (probably intermediate?) code before it can trim away paths that are
> unreachable.
>
> > (To avoid needing the extra local variable?) I thought there was some other
> > KVM code that relied on similar optimizations but I would have to go dig them
> > up to remember.
>
> KVM, and the kernel, absolutely relies on dead code elimination.  KVM most blatantly
> uses the technique to avoid _defining_ stubs for code that is guarded by a Kconfig,
> e.g. all of these functions are defined in sev.c (guarded by CONFIG_KVM_AMD_SEV),
> but callers are guarded only with sev_guest() or sev_es_guest(), not with explicit
> #idefs.
>
> There are no build errors because the function calls aren't fully resolved until
> link time (when svm.o is linked into kvm-amd.o).  But KVM still needs to _declare_
> the functions, otherwise the compiler would fail during its initial code generation.
>
>   int pre_sev_run(struct vcpu_svm *svm, int cpu);
>   void sev_init_vmcb(struct vcpu_svm *svm);
>   void sev_vcpu_after_set_cpuid(struct vcpu_svm *svm);
>   int sev_es_string_io(struct vcpu_svm *svm, int size, unsigned int port, int in);
>   void sev_es_vcpu_reset(struct vcpu_svm *svm);
>   void sev_es_recalc_msr_intercepts(struct kvm_vcpu *vcpu);
>   void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
>   void sev_es_prepare_switch_to_guest(struct vcpu_svm *svm, struct sev_es_save_area *hostsa);
>   void sev_es_unmap_ghcb(struct vcpu_svm *svm);
>
> Other notable "users" of dead code elimination are the BUILD_BUG_ON() family of
> compile-time asserts.  So long as the condition can be resolved to a constant
> false value during compile time, the "call" to __compiletime_error() will be
> elided and everyone is happy.
>
>   #ifdef __OPTIMIZE__
>   /*
>    * #ifdef __OPTIMIZE__ is only a good approximation; for instance "make
>    * CFLAGS_foo.o=-Og" defines __OPTIMIZE__, does not elide the conditional code
>    * and can break compilation with wrong error message(s). Combine with
>    * -U__OPTIMIZE__ when needed.
>    */
>   # define __compiletime_assert(condition, msg, prefix, suffix)         \
>         do {                                                            \
>                 /*                                                      \
>                  * __noreturn is needed to give the compiler enough     \
>                  * information to avoid certain possibly-uninitialized  \
>                  * warnings (regardless of the build failing).          \
>                  */                                                     \
>                 __noreturn extern void prefix ## suffix(void)           \
>                         __compiletime_error(msg);                       \
>                 if (!(condition))                                       \
>                         prefix ## suffix();                             \
>         } while (0)
>   #else
>   # define __compiletime_assert(condition, msg, prefix, suffix) ((void)(condition))
>   #endif
>
> Note, static_assert() is different in that it's a true assertion that's resolved
> early on during compilation.
>
>  * Contrary to BUILD_BUG_ON(), static_assert() can be used at global
>  * scope, but requires the expression to be an integer constant
>  * expression (i.e., it is not enough that __builtin_constant_p() is
>  * true for expr).
>
>
> From a previous thread related to asserts (https://lore.kernel.org/all/aFGY0KVUksf1a6xB@google.com):
>
>  : The advantage of BUILD_BUG_ON() is that it works so long as the condition is
>  : compile-time constant, whereas static_assert() requires the condition to an
>  : integer constant expression.  E.g. BUILD_BUG_ON() can be used so long as the
>  : condition is eventually resolved to a constant, whereas static_assert() has
>  : stricter requirements.
>  :
>  : E.g. the fls64() assert below is fully resolved at compile time, but isn't a
>  : purely constant expression, i.e. that one *needs* to be BUILD_BUG_ON().
>  :
>  : --
>  : arch/x86/kvm/svm/avic.c: In function ‘avic_init_backing_page’:
>  : arch/x86/kvm/svm/avic.c:293:45: error: expression in static assertion is not constant
>  :   293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
>  : include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
>  :    78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
>  :       |                                                        ^~~~
>  : arch/x86/kvm/svm/avic.c:293:9: note: in expansion of macro ‘static_assert’
>  :   293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
>  :       |         ^~~~~~~~~~~~~
>  : make[5]: *** [scripts/Makefile.build:203: arch/x86/kvm/svm/avic.o] Error 1
>  : --
>  :
>  : The downside of BUILD_BUG_ON() is that it can't be used at global scope, i.e.
>  : needs to be called from a function.
>  :
>  : As a result, when adding an assertion in a function, using BUILD_BUG_ON() is
>  : slightly preferred, because it's less likely to break in the future.  E.g. if
>  : X2AVIC_MAX_PHYSICAL_ID were changed to something that is a compile-time constant,
>  : but for whatever reason isn't a pure integer constant.

Thank you so much for the detailed explanation!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 1/7] KVM: x86/mmu: Track TDP MMU NX huge pages separately
  2025-07-07 22:47 ` [PATCH v5 1/7] KVM: x86/mmu: Track TDP MMU NX huge pages separately James Houghton
@ 2025-08-19 17:57   ` Sean Christopherson
  0 siblings, 0 replies; 29+ messages in thread
From: Sean Christopherson @ 2025-08-19 17:57 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, Vipin Sharma, David Matlack, kvm, linux-kernel

On Mon, Jul 07, 2025, James Houghton wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4e06e2e89a8fa..f44d7f3acc179 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -65,9 +65,9 @@ int __read_mostly nx_huge_pages = -1;
>  static uint __read_mostly nx_huge_pages_recovery_period_ms;
>  #ifdef CONFIG_PREEMPT_RT
>  /* Recovery can cause latency spikes, disable it for PREEMPT_RT.  */
> -static uint __read_mostly nx_huge_pages_recovery_ratio = 0;
> +unsigned int __read_mostly nx_huge_pages_recovery_ratio;
>  #else
> -static uint __read_mostly nx_huge_pages_recovery_ratio = 60;
> +unsigned int __read_mostly nx_huge_pages_recovery_ratio = 60;

Spurious changes.

>  #endif
>  
>  static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp);

...

> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index db8f33e4de624..a8fd2de13f707 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -413,7 +413,10 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
>  
> -void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> -void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> +void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> +				 enum kvm_mmu_type mmu_type);
> +void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> +				   enum kvm_mmu_type mmu_type);
>  
> +extern unsigned int nx_huge_pages_recovery_ratio;

And here as well.  I'll fixup when applying.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock
  2025-07-07 22:47 [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock James Houghton
                   ` (7 preceding siblings ...)
  2025-07-23 20:44 ` [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock Sean Christopherson
@ 2025-08-19 23:12 ` Sean Christopherson
  8 siblings, 0 replies; 29+ messages in thread
From: Sean Christopherson @ 2025-08-19 23:12 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, James Houghton
  Cc: Vipin Sharma, David Matlack, kvm, linux-kernel

On Mon, 07 Jul 2025 22:47:13 +0000, James Houghton wrote:
> I'm finishing off Vipin's NX huge page recovery optimization for the TDP
> MMU from last year. This is a respin on the series I sent a couple weeks
> ago, v4. Below is a mostly unchanged cover letter from v4.
> 
> NX huge page recovery can cause guest performance jitter, originally
> noticed with network tests in Windows guests. Please see Vipin's earlier
> performance results[1]. Below is some new data I have collected with the
> nx_huge_pages_perf_test that I've included with this series.
> 
> [...]

Applied 1-3 to kvm-x86 mmu, with the fixups described (hopefully I got 'em all)

Thanks!

[1/7] KVM: x86/mmu: Track TDP MMU NX huge pages separately
      https://github.com/kvm-x86/linux/commit/6777885605e1
[2/7] KVM: x86/mmu: Rename kvm_tdp_mmu_zap_sp() to better indicate its purpose
      https://github.com/kvm-x86/linux/commit/62105564226e
[3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
      https://github.com/kvm-x86/linux/commit/a57750909580

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-08-19 23:16 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-07 22:47 [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock James Houghton
2025-07-07 22:47 ` [PATCH v5 1/7] KVM: x86/mmu: Track TDP MMU NX huge pages separately James Houghton
2025-08-19 17:57   ` Sean Christopherson
2025-07-07 22:47 ` [PATCH v5 2/7] KVM: x86/mmu: Rename kvm_tdp_mmu_zap_sp() to better indicate its purpose James Houghton
2025-07-07 22:47 ` [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock James Houghton
2025-07-23 20:34   ` Sean Christopherson
2025-07-28 18:07     ` James Houghton
2025-07-28 18:17       ` David Matlack
2025-07-28 21:38         ` Sean Christopherson
2025-07-28 21:48           ` James Houghton
2025-08-01 18:17             ` David Matlack
2025-08-01 22:00               ` Sean Christopherson
2025-08-12 19:21                 ` David Matlack
2025-07-07 22:47 ` [PATCH v5 4/7] KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU James Houghton
2025-07-23 20:38   ` Sean Christopherson
2025-07-28 17:51     ` James Houghton
2025-07-07 22:47 ` [PATCH v5 5/7] KVM: selftests: Introduce a selftest to measure execution performance James Houghton
2025-07-23 20:50   ` Sean Christopherson
2025-07-29  0:18     ` James Houghton
2025-07-07 22:47 ` [PATCH v5 6/7] KVM: selftests: Provide extra mmap flags in vm_mem_add() James Houghton
2025-07-07 22:47 ` [PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test James Houghton
2025-07-23 21:04   ` Sean Christopherson
2025-07-28 18:40     ` James Houghton
2025-08-01 14:11       ` Sean Christopherson
2025-08-01 18:45         ` James Houghton
2025-08-01 22:30           ` Sean Christopherson
2025-07-23 20:44 ` [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock Sean Christopherson
2025-07-29  0:19   ` James Houghton
2025-08-19 23:12 ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).