[PATCH 0/7] KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log
@ 2024-08-05 23:31 David Matlack
  2024-08-05 23:31 ` [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs" David Matlack
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: David Matlack @ 2024-08-05 23:31 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, David Matlack

Rework the TDP MMU disable-dirty-log path to batch TLB flushes and
recover huge page mappings, rather than zapping and flushing for every
potential huge page mapping.

With this series, dirty_log_perf_test shows a decrease in the time it
takes to disable dirty logging, as well as a decrease in the number of
vCPU faults:

 $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g

 Before: Disabling dirty logging time: 14.334453428s (131072 flushes)
 After:  Disabling dirty logging time: 4.794969689s  (76 flushes)

 Before: 393,599      kvm:kvm_page_fault
 After:  262,575      kvm:kvm_page_fault

David Matlack (7):
  Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping
    collapsible SPTEs"
  KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level()
  KVM: x86/mmu: Batch TLB flushes when zapping collapsible TDP MMU SPTEs
  KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of
    zapping
  KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte()
  KVM: x86/mmu: WARN if huge page recovery triggered during dirty
    logging
  KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery

 arch/x86/include/asm/kvm_host.h |   4 +-
 arch/x86/kvm/mmu/mmu.c          |  16 ++--
 arch/x86/kvm/mmu/mmu_internal.h |   3 +-
 arch/x86/kvm/mmu/spte.c         |  40 ++++++++--
 arch/x86/kvm/mmu/spte.h         |   5 +-
 arch/x86/kvm/mmu/tdp_iter.c     |   9 +++
 arch/x86/kvm/mmu/tdp_iter.h     |   1 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 127 ++++++++++++++------------------
 arch/x86/kvm/mmu/tdp_mmu.h      |   4 +-
 arch/x86/kvm/x86.c              |  18 ++---
 10 files changed, 121 insertions(+), 106 deletions(-)


base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f
-- 
2.46.0.rc2.264.g509ed76dc8-goog


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs"
  2024-08-05 23:31 [PATCH 0/7] KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log David Matlack
@ 2024-08-05 23:31 ` David Matlack
  2024-08-16 22:47   ` Sean Christopherson
  2024-08-05 23:31 ` [PATCH 2/7] KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level() David Matlack
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: David Matlack @ 2024-08-05 23:31 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, David Matlack

This reverts commit 85f44f8cc07b5f61bef30fe5343d629fd4263230.

Bring back the logic that walks down to leafs when zapping collapsible
SPTEs. Stepping down to leafs is technically unnecessary when zapping,
but the leaf SPTE will be used in a subsequent commit to construct a
huge SPTE and recover the huge mapping in place.

Note, this revert does not revert the function comment changes above
zap_collapsible_spte_range() and kvm_tdp_mmu_zap_collapsible_sptes()
since those are still relevant.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.c |  9 +++++++
 arch/x86/kvm/mmu/tdp_iter.h |  1 +
 arch/x86/kvm/mmu/tdp_mmu.c  | 47 ++++++++++++++++++-------------------
 3 files changed, 33 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 04c247bfe318..1279babbc72c 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -142,6 +142,15 @@ static bool try_step_up(struct tdp_iter *iter)
 	return true;
 }
 
+/*
+ * Step the iterator back up a level in the paging structure. Should only be
+ * used when the iterator is below the root level.
+ */
+void tdp_iter_step_up(struct tdp_iter *iter)
+{
+	WARN_ON(!try_step_up(iter));
+}
+
 /*
  * Step to the next SPTE in a pre-order traversal of the paging structure.
  * To get to the next SPTE, the iterator either steps down towards the goal
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 2880fd392e0c..821fde2ac7b0 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -136,5 +136,6 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 		    int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
+void tdp_iter_step_up(struct tdp_iter *iter);
 
 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c7dc49ee7388..ebe2ab3686c7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1628,49 +1628,48 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
-retry:
+	tdp_root_for_each_pte(iter, root, start, end) {
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
 
-		if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
-		    !is_shadow_present_pte(iter.old_spte))
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
+		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
+							      iter.gfn, PG_LEVEL_NUM);
+
+		WARN_ON(max_mapping_level < iter.level);
+
 		/*
-		 * Don't zap leaf SPTEs, if a leaf SPTE could be replaced with
-		 * a large page size, then its parent would have been zapped
-		 * instead of stepping down.
+		 * If this page is already mapped at the highest
+		 * viable level, there's nothing more to do.
 		 */
-		if (is_last_spte(iter.old_spte, iter.level))
+		if (max_mapping_level == iter.level)
 			continue;
 
 		/*
-		 * If iter.gfn resides outside of the slot, i.e. the page for
-		 * the current level overlaps but is not contained by the slot,
-		 * then the SPTE can't be made huge.  More importantly, trying
-		 * to query that info from slot->arch.lpage_info will cause an
-		 * out-of-bounds access.
+		 * The page can be remapped at a higher level, so step
+		 * up to zap the parent SPTE.
 		 */
-		if (iter.gfn < start || iter.gfn >= end)
-			continue;
-
-		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
-							      iter.gfn, PG_LEVEL_NUM);
-		if (max_mapping_level < iter.level)
-			continue;
+		while (max_mapping_level > iter.level)
+			tdp_iter_step_up(&iter);
 
 		/* Note, a successful atomic zap also does a remote TLB flush. */
-		if (tdp_mmu_zap_spte_atomic(kvm, &iter))
-			goto retry;
+		(void)tdp_mmu_zap_spte_atomic(kvm, &iter);
+
+		/*
+		 * If the atomic zap fails, the iter will recurse back into
+		 * the same subtree to retry.
+		 */
 	}
 
 	rcu_read_unlock();
 }
 
 /*
- * Zap non-leaf SPTEs (and free their associated page tables) which could
- * be replaced by huge pages, for GFNs within the slot.
+ * Zap non-leaf SPTEs (and free their associated page tables) which could be
+ * replaced by huge pages, for GFNs within the slot.
  */
 void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       const struct kvm_memory_slot *slot)
-- 
2.46.0.rc2.264.g509ed76dc8-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/7] KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level()
  2024-08-05 23:31 [PATCH 0/7] KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log David Matlack
  2024-08-05 23:31 ` [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs" David Matlack
@ 2024-08-05 23:31 ` David Matlack
  2024-08-05 23:31 ` [PATCH 3/7] KVM: x86/mmu: Batch TLB flushes when zapping collapsible TDP MMU SPTEs David Matlack
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: David Matlack @ 2024-08-05 23:31 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, David Matlack

Drop the @max_level parameter from kvm_mmu_max_mapping_level(). All
callers pass in PG_LEVEL_NUM, so @max_level can be replaced with
PG_LEVEL_NUM in the function body.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 8 +++-----
 arch/x86/kvm/mmu/mmu_internal.h | 3 +--
 arch/x86/kvm/mmu/tdp_mmu.c      | 4 +---
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 901be9e420a4..1b4e14ac512b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3127,13 +3127,12 @@ static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
 }
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
-			      const struct kvm_memory_slot *slot, gfn_t gfn,
-			      int max_level)
+			      const struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	bool is_private = kvm_slot_can_be_private(slot) &&
 			  kvm_mem_is_private(kvm, gfn);
 
-	return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private);
+	return __kvm_mmu_max_mapping_level(kvm, slot, gfn, PG_LEVEL_NUM, is_private);
 }
 
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -6890,8 +6889,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 * mapping if the indirect sp has level = 1.
 		 */
 		if (sp->role.direct &&
-		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
-							       PG_LEVEL_NUM)) {
+		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn)) {
 			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
 
 			if (kvm_available_flush_remote_tlbs_range())
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1721d97743e9..fee385e75405 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -344,8 +344,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 }
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
-			      const struct kvm_memory_slot *slot, gfn_t gfn,
-			      int max_level);
+			      const struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ebe2ab3686c7..f881e79243b3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1636,9 +1636,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
-		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
-							      iter.gfn, PG_LEVEL_NUM);
-
+		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn);
 		WARN_ON(max_mapping_level < iter.level);
 
 		/*
-- 
2.46.0.rc2.264.g509ed76dc8-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/7] KVM: x86/mmu: Batch TLB flushes when zapping collapsible TDP MMU SPTEs
  2024-08-05 23:31 [PATCH 0/7] KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log David Matlack
  2024-08-05 23:31 ` [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs" David Matlack
  2024-08-05 23:31 ` [PATCH 2/7] KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level() David Matlack
@ 2024-08-05 23:31 ` David Matlack
  2024-08-05 23:31 ` [PATCH 4/7] KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zapping David Matlack
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: David Matlack @ 2024-08-05 23:31 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, David Matlack

Set SPTEs directly to SHADOW_NONPRESENT_VALUE and batch up TLB flushes
when zapping collapsible SPTEs, rather than freezing them first.

Freezing the SPTE first is not required. It is fine for another thread
holding mmu_lock for read to immediately install a present entry before
TLBs are flushed because the underlying mapping is not changing. vCPUs
that translate through the stale 4K mappings or a new huge page mapping
will still observe the same GPA->HPA translations.

KVM must only flush TLBs before dropping RCU (to avoid use-after-free of
the zapped page tables) and before dropping mmu_lock (to synchronize
with mmu_notifiers invalidating mappings).

In VMs backed with 2MiB pages, batching TLB flushes improves the time it
takes to zap collapsible SPTEs to disable dirty logging:

 $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g

 Before: Disabling dirty logging time: 14.334453428s (131072 flushes)
 After:  Disabling dirty logging time: 4.794969689s  (76 flushes)

Skipping freezing SPTEs also avoids stalling vCPU threads on the frozen
SPTE for the time it takes to perform a remote TLB flush. vCPUs faulting
on the zapped mapping can now immediately install a new huge mapping and
proceed with guest execution.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 54 +++++++-------------------------------
 1 file changed, 9 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f881e79243b3..fad2912d3d4c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -591,48 +591,6 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	return 0;
 }
 
-static inline int __must_check tdp_mmu_zap_spte_atomic(struct kvm *kvm,
-						       struct tdp_iter *iter)
-{
-	int ret;
-
-	lockdep_assert_held_read(&kvm->mmu_lock);
-
-	/*
-	 * Freeze the SPTE by setting it to a special, non-present value. This
-	 * will stop other threads from immediately installing a present entry
-	 * in its place before the TLBs are flushed.
-	 *
-	 * Delay processing of the zapped SPTE until after TLBs are flushed and
-	 * the FROZEN_SPTE is replaced (see below).
-	 */
-	ret = __tdp_mmu_set_spte_atomic(iter, FROZEN_SPTE);
-	if (ret)
-		return ret;
-
-	kvm_flush_remote_tlbs_gfn(kvm, iter->gfn, iter->level);
-
-	/*
-	 * No other thread can overwrite the frozen SPTE as they must either
-	 * wait on the MMU lock or use tdp_mmu_set_spte_atomic() which will not
-	 * overwrite the special frozen SPTE value. Use the raw write helper to
-	 * avoid an unnecessary check on volatile bits.
-	 */
-	__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
-
-	/*
-	 * Process the zapped SPTE after flushing TLBs, and after replacing
-	 * FROZEN_SPTE with 0. This minimizes the amount of time vCPUs are
-	 * blocked by the FROZEN_SPTE and reduces contention on the child
-	 * SPTEs.
-	 */
-	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			    SHADOW_NONPRESENT_VALUE, iter->level, true);
-
-	return 0;
-}
-
-
 /*
  * tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
  * @kvm:	      KVM instance
@@ -1625,12 +1583,15 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 	gfn_t end = start + slot->npages;
 	struct tdp_iter iter;
 	int max_mapping_level;
+	bool flush = false;
 
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
+			flush = false;
 			continue;
+		}
 
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
@@ -1653,8 +1614,8 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 		while (max_mapping_level > iter.level)
 			tdp_iter_step_up(&iter);
 
-		/* Note, a successful atomic zap also does a remote TLB flush. */
-		(void)tdp_mmu_zap_spte_atomic(kvm, &iter);
+		if (!tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
+			flush = true;
 
 		/*
 		 * If the atomic zap fails, the iter will recurse back into
@@ -1662,6 +1623,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 		 */
 	}
 
+	if (flush)
+		kvm_flush_remote_tlbs_memslot(kvm, slot);
+
 	rcu_read_unlock();
 }
 
-- 
2.46.0.rc2.264.g509ed76dc8-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/7] KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zapping
  2024-08-05 23:31 [PATCH 0/7] KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log David Matlack
                   ` (2 preceding siblings ...)
  2024-08-05 23:31 ` [PATCH 3/7] KVM: x86/mmu: Batch TLB flushes when zapping collapsible TDP MMU SPTEs David Matlack
@ 2024-08-05 23:31 ` David Matlack
  2024-08-05 23:31 ` [PATCH 5/7] KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte() David Matlack
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: David Matlack @ 2024-08-05 23:31 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, David Matlack

Recover TDP MMU huge page mappings in-place instead of zapping them when
dirty logging is disabled, and rename functions that recover huge page
mappings when dirty logging is disabled to move away from the "zap
collapsible spte" terminology.

Before KVM flushes TLBs, guest accesses may be translated through either
the (stale) small SPTE or the (new) huge SPTE. This is already possible
when KVM is doing eager page splitting (where TLB flushes are also
batched), and when vCPUs are faulting in huge mappings (where TLBs are
flushed after the new huge SPTE is installed).

Recovering huge pages reduces the number of page faults when dirty
logging is disabled:

 $ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g

 Before: 393,599      kvm:kvm_page_fault
 After:  262,575      kvm:kvm_page_fault

vCPU throughput and the latency of disabling dirty-logging are about
equal compared to zapping, but avoiding faults can be beneficial to
remove vCPU jitter in extreme scenarios.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++--
 arch/x86/kvm/mmu/mmu.c          |  6 +++---
 arch/x86/kvm/mmu/spte.c         | 36 ++++++++++++++++++++++++++++++---
 arch/x86/kvm/mmu/spte.h         |  1 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 32 +++++++++++++++++------------
 arch/x86/kvm/mmu/tdp_mmu.h      |  4 ++--
 arch/x86/kvm/x86.c              | 18 +++++++----------
 7 files changed, 67 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 950a03e0181e..ed3b724db4d7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1952,8 +1952,8 @@ void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
 				  const struct kvm_memory_slot *memslot,
 				  u64 start, u64 end,
 				  int target_level);
-void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
-				   const struct kvm_memory_slot *memslot);
+void kvm_mmu_recover_huge_pages(struct kvm *kvm,
+				const struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1b4e14ac512b..34e59210d94e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6917,8 +6917,8 @@ static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
 		kvm_flush_remote_tlbs_memslot(kvm, slot);
 }
 
-void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
-				   const struct kvm_memory_slot *slot)
+void kvm_mmu_recover_huge_pages(struct kvm *kvm,
+				const struct kvm_memory_slot *slot)
 {
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
@@ -6928,7 +6928,7 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 
 	if (tdp_mmu_enabled) {
 		read_lock(&kvm->mmu_lock);
-		kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot);
+		kvm_tdp_mmu_recover_huge_pages(kvm, slot);
 		read_unlock(&kvm->mmu_lock);
 	}
 }
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d4527965e48c..979387d4ebfa 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -268,15 +268,14 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	return wrprot;
 }
 
-static u64 make_spte_executable(u64 spte)
+static u64 modify_spte_protections(u64 spte, u64 set, u64 clear)
 {
 	bool is_access_track = is_access_track_spte(spte);
 
 	if (is_access_track)
 		spte = restore_acc_track_spte(spte);
 
-	spte &= ~shadow_nx_mask;
-	spte |= shadow_x_mask;
+	spte = (spte | set) & ~clear;
 
 	if (is_access_track)
 		spte = mark_spte_for_access_track(spte);
@@ -284,6 +283,16 @@ static u64 make_spte_executable(u64 spte)
 	return spte;
 }
 
+static u64 make_spte_executable(u64 spte)
+{
+	return modify_spte_protections(spte, shadow_x_mask, shadow_nx_mask);
+}
+
+static u64 make_spte_nonexecutable(u64 spte)
+{
+	return modify_spte_protections(spte, shadow_nx_mask, shadow_x_mask);
+}
+
 /*
  * Construct an SPTE that maps a sub-page of the given huge page SPTE where
  * `index` identifies which sub-page.
@@ -320,6 +329,27 @@ u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
 	return child_spte;
 }
 
+u64 make_huge_spte(struct kvm *kvm, u64 small_spte, int level)
+{
+	u64 huge_spte;
+
+	KVM_BUG_ON(!is_shadow_present_pte(small_spte), kvm);
+	KVM_BUG_ON(level == PG_LEVEL_4K, kvm);
+
+	huge_spte = small_spte | PT_PAGE_SIZE_MASK;
+
+	/*
+	 * huge_spte already has the address of the sub-page being collapsed
+	 * from small_spte, so just clear the lower address bits to create the
+	 * huge page address.
+	 */
+	huge_spte &= KVM_HPAGE_MASK(level) | ~PAGE_MASK;
+
+	if (is_nx_huge_page_enabled(kvm))
+		huge_spte = make_spte_nonexecutable(huge_spte);
+
+	return huge_spte;
+}
 
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
 {
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index ef793c459b05..498c30b6ba71 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -503,6 +503,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       bool host_writable, u64 *new_spte);
 u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
 		      	      union kvm_mmu_page_role role, int index);
+u64 make_huge_spte(struct kvm *kvm, u64 small_spte, int level);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fad2912d3d4c..3f2d7343194e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1575,15 +1575,16 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
 }
 
-static void zap_collapsible_spte_range(struct kvm *kvm,
-				       struct kvm_mmu_page *root,
-				       const struct kvm_memory_slot *slot)
+static void recover_huge_pages_range(struct kvm *kvm,
+				     struct kvm_mmu_page *root,
+				     const struct kvm_memory_slot *slot)
 {
 	gfn_t start = slot->base_gfn;
 	gfn_t end = start + slot->npages;
 	struct tdp_iter iter;
 	int max_mapping_level;
 	bool flush = false;
+	u64 huge_spte;
 
 	rcu_read_lock();
 
@@ -1608,18 +1609,19 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 			continue;
 
 		/*
-		 * The page can be remapped at a higher level, so step
-		 * up to zap the parent SPTE.
+		 * Construct the huge SPTE based on the small SPTE and then step
+		 * back up to install it.
 		 */
+		huge_spte = make_huge_spte(kvm, iter.old_spte, max_mapping_level);
 		while (max_mapping_level > iter.level)
 			tdp_iter_step_up(&iter);
 
-		if (!tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
+		if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte))
 			flush = true;
 
 		/*
-		 * If the atomic zap fails, the iter will recurse back into
-		 * the same subtree to retry.
+		 * If the cmpxchg fails, the iter will recurse back into the
+		 * same subtree to retry.
 		 */
 	}
 
@@ -1630,17 +1632,21 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 }
 
 /*
- * Zap non-leaf SPTEs (and free their associated page tables) which could be
- * replaced by huge pages, for GFNs within the slot.
+ * Recover huge page mappings within the slot by replacing non-leaf SPTEs with
+ * huge SPTEs where possible.
+ *
+ * Note that all huge page mappings are recovered, including NX huge pages that
+ * were split by guest instruction fetches and huge pages that were split for
+ * dirty tracking.
  */
-void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
-				       const struct kvm_memory_slot *slot)
+void kvm_tdp_mmu_recover_huge_pages(struct kvm *kvm,
+				    const struct kvm_memory_slot *slot)
 {
 	struct kvm_mmu_page *root;
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id)
-		zap_collapsible_spte_range(kvm, root, slot);
+		recover_huge_pages_range(kvm, root, slot);
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 1b74e058a81c..ddea2827d1ad 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -40,8 +40,8 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 				       struct kvm_memory_slot *slot,
 				       gfn_t gfn, unsigned long mask,
 				       bool wrprot);
-void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
-				       const struct kvm_memory_slot *slot);
+void kvm_tdp_mmu_recover_huge_pages(struct kvm *kvm,
+				    const struct kvm_memory_slot *slot);
 
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index af6c8cf6a37a..b83bebe53840 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13056,19 +13056,15 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 
 	if (!log_dirty_pages) {
 		/*
-		 * Dirty logging tracks sptes in 4k granularity, meaning that
-		 * large sptes have to be split.  If live migration succeeds,
-		 * the guest in the source machine will be destroyed and large
-		 * sptes will be created in the destination.  However, if the
-		 * guest continues to run in the source machine (for example if
-		 * live migration fails), small sptes will remain around and
-		 * cause bad performance.
+		 * Recover huge page mappings in the slot now that dirty logging
+		 * is disabled, i.e. now that KVM does not have to track guest
+		 * writes at 4KiB granularity.
 		 *
-		 * Scan sptes if dirty logging has been stopped, dropping those
-		 * which can be collapsed into a single large-page spte.  Later
-		 * page faults will create the large-page sptes.
+		 * Dirty logging might be disabled by userspace if an ongoing VM
+		 * live migration is cancelled and the VM must continue running
+		 * on the source.
 		 */
-		kvm_mmu_zap_collapsible_sptes(kvm, new);
+		kvm_mmu_recover_huge_pages(kvm, new);
 	} else {
 		/*
 		 * Initially-all-set does not require write protecting any page,
-- 
2.46.0.rc2.264.g509ed76dc8-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/7] KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte()
  2024-08-05 23:31 [PATCH 0/7] KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log David Matlack
                   ` (3 preceding siblings ...)
  2024-08-05 23:31 ` [PATCH 4/7] KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zapping David Matlack
@ 2024-08-05 23:31 ` David Matlack
  2024-08-05 23:31 ` [PATCH 6/7] KVM: x86/mmu: WARN if huge page recovery triggered during dirty logging David Matlack
  2024-08-05 23:31 ` [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery David Matlack
  6 siblings, 0 replies; 14+ messages in thread
From: David Matlack @ 2024-08-05 23:31 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, David Matlack

Rename make_huge_page_split_spte() to make_small_spte(). This ensures
that the usage of "small_spte" and "huge_spte" are consistent between
make_huge_spte() and make_small_spte().

This should also reduce some confusion as make_huge_page_split_spte()
almost reads like it will create a huge SPTE, when in fact it is
creating a small SPTE to split the huge SPTE.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 2 +-
 arch/x86/kvm/mmu/spte.c    | 4 ++--
 arch/x86/kvm/mmu/spte.h    | 4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 34e59210d94e..3610896cd9d6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6706,7 +6706,7 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm,
 			continue;
 		}
 
-		spte = make_huge_page_split_spte(kvm, huge_spte, sp->role, index);
+		spte = make_small_spte(kvm, huge_spte, sp->role, index);
 		mmu_spte_set(sptep, spte);
 		__rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
 	}
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 979387d4ebfa..5b38b8c5ba51 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -300,8 +300,8 @@ static u64 make_spte_nonexecutable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
-			      union kvm_mmu_page_role role, int index)
+u64 make_small_spte(struct kvm *kvm, u64 huge_spte,
+		    union kvm_mmu_page_role role, int index)
 {
 	u64 child_spte = huge_spte;
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 498c30b6ba71..515d7e801f5e 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -501,8 +501,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
-		      	      union kvm_mmu_page_role role, int index);
+u64 make_small_spte(struct kvm *kvm, u64 huge_spte,
+		    union kvm_mmu_page_role role, int index);
 u64 make_huge_spte(struct kvm *kvm, u64 small_spte, int level);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3f2d7343194e..9da319fd840e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1328,7 +1328,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < SPTE_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(kvm, huge_spte, sp->role, i);
+		sp->spt[i] = make_small_spte(kvm, huge_spte, sp->role, i);
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-- 
2.46.0.rc2.264.g509ed76dc8-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/7] KVM: x86/mmu: WARN if huge page recovery triggered during dirty logging
  2024-08-05 23:31 [PATCH 0/7] KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log David Matlack
                   ` (4 preceding siblings ...)
  2024-08-05 23:31 ` [PATCH 5/7] KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte() David Matlack
@ 2024-08-05 23:31 ` David Matlack
  2024-08-05 23:31 ` [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery David Matlack
  6 siblings, 0 replies; 14+ messages in thread
From: David Matlack @ 2024-08-05 23:31 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, David Matlack

WARN and bail out of recover_huge_pages_range() if dirty logging is
enabled. KVM shouldn't be recovering huge pages during dirty logging
anyway, since KVM needs to track writes at 4KiB. However its not out of
the possibility that that changes in the future.

If KVM wants to recover huge pages during dirty logging,
make_huge_spte() must be updated to write-protect the new huge page
mapping. Otherwise, writes through the newly recovered huge page mapping
will not be tracked.

Note that this potential risk did not exist back when KVM zapped to
recover huge page mappings, since subsequent accesses would just be
faulted in at PG_LEVEL_4K if dirty logging was enabled.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9da319fd840e..07d5363c9db7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1586,6 +1586,9 @@ static void recover_huge_pages_range(struct kvm *kvm,
 	bool flush = false;
 	u64 huge_spte;

+	if (WARN_ON_ONCE(kvm_slot_dirty_track_enabled(slot)))
+		return;
+
 	rcu_read_lock();

 	tdp_root_for_each_pte(iter, root, start, end) {
-- 
2.46.0.rc2.264.g509ed76dc8-goog

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery
  2024-08-05 23:31 [PATCH 0/7] KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log David Matlack
                   ` (5 preceding siblings ...)
  2024-08-05 23:31 ` [PATCH 6/7] KVM: x86/mmu: WARN if huge page recovery triggered during dirty logging David Matlack
@ 2024-08-05 23:31 ` David Matlack
  2024-08-22 16:50   ` Sean Christopherson
  6 siblings, 1 reply; 14+ messages in thread
From: David Matlack @ 2024-08-05 23:31 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, David Matlack

Recheck the iter.old_spte still points to a page table when recovering
huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
another CPU in between stepping down and back up.

This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
recovery worker, or vCPUs taking faults on the huge page region).

This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
thus can see a different value, which is not immediately obvious when
reading the code.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 07d5363c9db7..bdc7fd476721 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm,
 		while (max_mapping_level > iter.level)
 			tdp_iter_step_up(&iter);
 
+		/*
+		 * Re-check that iter.old_spte still points to a page table.
+		 * Since mmu_lock is held for read and tdp_iter_step_up()
+		 * re-reads iter.sptep, it's possible the SPTE was zapped or
+		 * recovered by another CPU in between stepping down and
+		 * stepping back up.
+		 */
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    is_last_spte(iter.old_spte, iter.level))
+			continue;
+
 		if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte))
 			flush = true;
 
-- 
2.46.0.rc2.264.g509ed76dc8-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs"
  2024-08-05 23:31 ` [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs" David Matlack
@ 2024-08-16 22:47   ` Sean Christopherson
  2024-08-16 23:35     ` David Matlack
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2024-08-16 22:47 UTC (permalink / raw)
  To: David Matlack; +Cc: Paolo Bonzini, kvm

On Mon, Aug 05, 2024, David Matlack wrote:
> This reverts commit 85f44f8cc07b5f61bef30fe5343d629fd4263230.
> 
> Bring back the logic that walks down to leafs when zapping collapsible
> SPTEs. Stepping down to leafs is technically unnecessary when zapping,
> but the leaf SPTE will be used in a subsequent commit to construct a
> huge SPTE and recover the huge mapping in place.

Please no, getting rid of the step-up code made me so happy. :-D

It's also suboptimal, e.g. in the worst case scenario (which is comically
unlikely, but theoretically possible), if the first present leaf SPTE in a 1GiB
region is the last SPTE in the last 2MiB range, and all non-leaf SPTEs are
somehow present (this is the super unlikely part), then KVM will read 512*512
SPTEs before encountering a shadow-present leaf SPTE.

The proposed approach will also ignore nx_huge_page_disallowed, and just always
create a huge NX page.  On the plus side, "free" NX hugepage recovery!  The
downside is that it means KVM is pretty much guaranteed to force the guest to
re-fault all of its code pages, and zap a non-trivial number of huge pages that
were just created.  IIRC, we deliberately did that for the zapping case, e.g. to
use the opportunity to recover NX huge pages, but zap+create+zap+create is a bit
different than zap+create (if the guest is still using the region for code).

So rather than looking for a present leaf SPTE, what about "stopping" as soon as
KVM find a SP that can be replaced with a huge SPTE, pre-checking
nx_huge_page_disallowed, and invoking kvm_mmu_do_page_fault() to install a new
SPTE?  Or maybe even use kvm_tdp_map_page()?  Though it might be more work to
massage kvm_tdp_map_page() into a usable form.

By virtue of there being a present non-leaf SPTE, KVM knows the guest accessed
the region at some point.  And now that MTRR virtualization is gone, the only time
KVM zaps _only_ leafs is for mmu_notifiers and and the APIC access page, i.e. the
odds of stepping down NOT finding a present SPTE somewhere in the region is very
small.  Lastly, kvm_mmu_max_mapping_level() has verified there is a valid mapping
in the pr imary MMU, else the max level would be PG_LEVEL_4K.  So the odds of
getting a false positive and effectively pre-faulting memory the guest isn't using
are quite small.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs"
  2024-08-16 22:47   ` Sean Christopherson
@ 2024-08-16 23:35     ` David Matlack
  2024-08-22 15:10       ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: David Matlack @ 2024-08-16 23:35 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm

On Fri, Aug 16, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Aug 05, 2024, David Matlack wrote:
> > This reverts commit 85f44f8cc07b5f61bef30fe5343d629fd4263230.
> >
> > Bring back the logic that walks down to leafs when zapping collapsible
> > SPTEs. Stepping down to leafs is technically unnecessary when zapping,
> > but the leaf SPTE will be used in a subsequent commit to construct a
> > huge SPTE and recover the huge mapping in place.
>
> Please no, getting rid of the step-up code made me so happy. :-D
>
> It's also suboptimal, e.g. in the worst case scenario (which is comically
> unlikely, but theoretically possible), if the first present leaf SPTE in a 1GiB
> region is the last SPTE in the last 2MiB range, and all non-leaf SPTEs are
> somehow present (this is the super unlikely part), then KVM will read 512*512
> SPTEs before encountering a shadow-present leaf SPTE.

Haha, I didn't consider that case. That seems extremely unlikely. And
even if it did happen, KVM is holding mmu_lock for read and is able to
drop the lock and yield. So I don't see a strong need to structure the
approach around that edge case.

>
> The proposed approach will also ignore nx_huge_page_disallowed, and just always
> create a huge NX page.  On the plus side, "free" NX hugepage recovery!  The
> downside is that it means KVM is pretty much guaranteed to force the guest to
> re-fault all of its code pages, and zap a non-trivial number of huge pages that
> were just created.  IIRC, we deliberately did that for the zapping case, e.g. to
> use the opportunity to recover NX huge pages, but zap+create+zap+create is a bit
> different than zap+create (if the guest is still using the region for code).

I'm ok with skipping nx_huge_page_disallowed pages during disable-dirty-log.

But why is recovering in-place is worse/different than zapping? They
both incur the same TLB flushes. And recovering might even result in
less vCPU faults, since vCPU faults use tdp_mmu_split_huge_page() to
install a fully populated lower level page table (vs faulting on a
zapped mapping will just install a 4K SPTE).

>
> So rather than looking for a present leaf SPTE, what about "stopping" as soon as
> KVM find a SP that can be replaced with a huge SPTE, pre-checking
> nx_huge_page_disallowed, and invoking kvm_mmu_do_page_fault() to install a new
> SPTE?  Or maybe even use kvm_tdp_map_page()?  Though it might be more work to
> massage kvm_tdp_map_page() into a usable form.

My intuition is that going through the full page fault flow would be
more expensive than just stepping down+up in 99.99999% of cases. And
will require more code churn. So it doesn't seem like a win?

>
> By virtue of there being a present non-leaf SPTE, KVM knows the guest accessed
> the region at some point.  And now that MTRR virtualization is gone, the only time
> KVM zaps _only_ leafs is for mmu_notifiers and and the APIC access page, i.e. the
> odds of stepping down NOT finding a present SPTE somewhere in the region is very
> small.  Lastly, kvm_mmu_max_mapping_level() has verified there is a valid mapping
> in the pr imary MMU, else the max level would be PG_LEVEL_4K.  So the odds of
> getting a false positive and effectively pre-faulting memory the guest isn't using
> are quite small.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs"
  2024-08-16 23:35     ` David Matlack
@ 2024-08-22 15:10       ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2024-08-22 15:10 UTC (permalink / raw)
  To: David Matlack; +Cc: Paolo Bonzini, kvm

On Fri, Aug 16, 2024, David Matlack wrote:
> On Fri, Aug 16, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
> > The proposed approach will also ignore nx_huge_page_disallowed, and just always
> > create a huge NX page.  On the plus side, "free" NX hugepage recovery!  The
> > downside is that it means KVM is pretty much guaranteed to force the guest to
> > re-fault all of its code pages, and zap a non-trivial number of huge pages that
> > were just created.  IIRC, we deliberately did that for the zapping case, e.g. to
> > use the opportunity to recover NX huge pages, but zap+create+zap+create is a bit
> > different than zap+create (if the guest is still using the region for code).
> 
> I'm ok with skipping nx_huge_page_disallowed pages during disable-dirty-log.
> 
> But why is recovering in-place is worse/different than zapping? They
> both incur the same TLB flushes. And recovering might even result in
> less vCPU faults, since vCPU faults use tdp_mmu_split_huge_page() to
> install a fully populated lower level page table (vs faulting on a
> zapped mapping will just install a 4K SPTE).

Doh, never mind, I was thinking zapping collapsible SPTEs zapped leafs to induce
faults, but it does the opposite and zaps at the level KVM thinks can be huge.

> > So rather than looking for a present leaf SPTE, what about "stopping" as soon as
> > KVM find a SP that can be replaced with a huge SPTE, pre-checking
> > nx_huge_page_disallowed, and invoking kvm_mmu_do_page_fault() to install a new
> > SPTE?  Or maybe even use kvm_tdp_map_page()?  Though it might be more work to
> > massage kvm_tdp_map_page() into a usable form.
> 
> My intuition is that going through the full page fault flow would be
> more expensive than just stepping down+up in 99.99999% of cases.

Hmm, yeah, doing the full fault flow isn't the right place to hook in.  Ugh, right,
and it's the whole problem with not having a vCPU for tdp_mmu_map_handle_target_level().
But that's solvable as it's really just is_rsvd_spte(), which I would be a-ok
skipping.  Ah, no, host_writable is also problematic.  Blech.

That's solvable too, e.g. host_pfn_mapping_level() could get the protection, but
that would require checking for an ongoing mmu_notifier invalidation.  So again,
probably not worth it.  Double blech.

> And will require more code churn.

I'm not terribly concerned with code churn.  I care much more about long-term
maintenance, and especially about having multiple ways of doing the same thing
(installing a shadow-present leaf SPTE).  But I agree that trying to remedy that
last point (similar flows) is probably a fool's errand in this case, as creating
a new SPTE from scratch really is different than up-leveling an existing SPTE.

I still have concerns about the step-up code, but I'll respond to those in the
context of the patch I think is problematic.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery
  2024-08-05 23:31 ` [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery David Matlack
@ 2024-08-22 16:50   ` Sean Christopherson
  2024-08-22 18:35     ` David Matlack
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2024-08-22 16:50 UTC (permalink / raw)
  To: David Matlack; +Cc: Paolo Bonzini, kvm

On Mon, Aug 05, 2024, David Matlack wrote:
> Recheck the iter.old_spte still points to a page table when recovering
> huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
> re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
> another CPU in between stepping down and back up.
> 
> This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
> another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
> recovery worker, or vCPUs taking faults on the huge page region).
> 
> This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
> thus can see a different value, which is not immediately obvious when
> reading the code.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 07d5363c9db7..bdc7fd476721 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm,
>  		while (max_mapping_level > iter.level)
>  			tdp_iter_step_up(&iter);
>  
> +		/*
> +		 * Re-check that iter.old_spte still points to a page table.
> +		 * Since mmu_lock is held for read and tdp_iter_step_up()
> +		 * re-reads iter.sptep, it's possible the SPTE was zapped or
> +		 * recovered by another CPU in between stepping down and
> +		 * stepping back up.
> +		 */
> +		if (!is_shadow_present_pte(iter.old_spte) ||
> +		    is_last_spte(iter.old_spte, iter.level))
> +			continue;

This is the part of the step-up logic that I do not like.  Even this check doesn't
guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was
used to reach the leaf SPTE.  E.g. in an absurdly theoretical situation, the SPTE
could be zapped and then re-set with another non-leaf SPTE.  Which is fine, but
only because of several very subtle mechanisms.

kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere
in the huge range, so it's safe to propagate any and all WRITABLE bits.  This
requires knowing/remembering that KVM disallows huge pages when a gfn is write-
tracked, and relies on that never changing (which is a fairly safe bet, but the
behavior isn't fully set in stone).
not set.

And the presence of a shadow-present leaf SPTE ensures there are no in-flight
mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't
return until all relevant leaf SPTEs have been zapped.

And even more subtly, recover_huge_pages_range() can install a huge SPTE while
tdp_mmu_zap_leafs() is running, e.g. if tdp_mmu_zap_leafs() is processing 4KiB
SPTEs because the greater 2MiB page is being unmapped by the primary MMU, and
tdp_mmu_zap_leafs() yields.   That's again safe only because upon regaining
control, tdp_mmu_zap_leafs() will restart at the root and thus observe and zap
the new huge SPTE.

So while I'm pretty sure this logic is functionally ok, its safety is extremely
dependent on a number of behaviors in KVM.

That said, I can't tell which option I dislike less.  E.g. we could do something
like this, where kvm_mmu_name_tbd() grabs the pfn+writable information from the
primary MMU's PTE/PMD/PUD.  Ideally, KVM would use GUP, but then KVM wouldn't
be able to create huge SPTEs for non-GUP-able memory, e.g. PFNMAP'd memory.

I don't love this either, primarily because not using GUP makes this yet another
custom flow, i.e. isn't any less tricky than reusing a child SPTE.  It does have
the advantage of not having to find a shadow-present child though, i.e. is likely
the most performant option.  I agree that that likely doesn't matter in practice,
especially since the raw latency of disabling dirty logging isn't terribly
important.

	rcu_read_lock();

	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
retry:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
			continue;

		if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
		    !is_shadow_present_pte(iter.old_spte))
			continue;

		/* 
		 * TODO: this should skip to the end of the parent, because if
		 * the first SPTE can't be made huger, then no SPTEs at this
		 * level can be huger.
		 */
		if (is_last_spte(iter.old_spte, iter.level))
			continue;

		/*
		 * If iter.gfn resides outside of the slot, i.e. the page for
		 * the current level overlaps but is not contained by the slot,
		 * then the SPTE can't be made huge.  More importantly, trying
		 * to query that info from slot->arch.lpage_info will cause an
		 * out-of-bounds access.
		 */
		if (iter.gfn < start || iter.gfn >= end)
			continue;

		if (kvm_mmu_name_tbd(kvm, slot, iter.gfn, &pfn, &writable,
				     &max_mapping_level))
			continue;

		/*
		 * If the range is being invalidated, do not install a SPTE as
		 * KVM may have already zapped this specific gfn, e.g. if KVM's
		 * unmapping has run to completion, but the primary MMU hasn't
		 * zapped its PTEs.  There is no need to check for *past*
		 * invalidations, because all information is gathered while
		 * holding mmu_lock, i.e. it can't have become stale due to a
		 * entire mmu_notifier invalidation sequence completing.
		 */
		if (mmu_invalidate_retry_gfn(kvm, kvm->mmu_invalidate_seq, iter.gfn))
			continue;

		/*
		 * KVM disallows huge pages for write-protected gfns, it should
		 * impossible for make_spte() to encounter such a gfn since
		 * write-protecting a gfn requires holding mmu_lock for write.
		 */
		flush |= __tdp_mmu_map_gfn(...);
		WARN_ON_ONCE(r == RET_PF_EMULATE);
	}

	rcu_read_unlock();

Assuming you don't like the above idea (I'm not sure I do), what if instead of
doing the step-up, KVM starts a second iteration using the shadow page it wants
to replace as the root of the walk?

This has the same subtle dependencies on kvm_mmu_max_mapping_level() and the
ordering with respect to an mmu_notifier invalidation, but it at least avoids
having to reason about the correctness of re-reading a SPTE and modifying the
iteration level within the body of an interation loop.

It should also yield smaller diffs overall, e.g. no revert, no separate commit
to recheck the SPTE, etc.  And I believe it's more performant that the step-up
approach when there are SPTE that _can't_ be huge, as KVM won't traverse down
into the leafs in that case.

An alternative to the tdp_mmu_iter_needs_reched() logic would be to pass in
&flush, but I think that ends up being more confusing and harder to follow.

static int tdp_mmu_make_huge_spte(struct kvm *kvm, struct tdp_iter *parent,
				  u64 *huge_spte)
{
	struct kvm_mmu_page *root = sptep_to_sp(parent->sptep);
	gfn_t start = parent->gfn;
	gfn_t end = start + ???; /* use parent's level */
	struct tdp_iter iter;

	tdp_root_for_each_leaf_pte(iter, root, start, end) 	{
		/*
		 * Use the parent iterator when checking for forward progress,
		 * so that KVM doesn't get stuck due to always yielding while
		 * walking child SPTEs.
		 */
		if (tdp_mmu_iter_needs_reched(kvm, parent))
			return -EAGAIN;

		*huge_spte = make_huge_spte(kvm, iter.old_spte);
		return 0;
	}

	return -ENOENT;
}

static void recover_huge_pages_range(struct kvm *kvm,
				     struct kvm_mmu_page *root,
				     const struct kvm_memory_slot *slot)
{
	gfn_t start = slot->base_gfn;
	gfn_t end = start + slot->npages;
	struct tdp_iter iter;
	int max_mapping_level;
	bool flush = false;
	u64 huge_spte;

	if (WARN_ON_ONCE(kvm_slot_dirty_track_enabled(slot)))
		return;

	rcu_read_lock();

	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
restart:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
			flush = false;
			continue;
		}

		if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
		    !is_shadow_present_pte(iter.old_spte))
			continue;

                /*
                 * If iter.gfn resides outside of the slot, i.e. the page for
                 * the current level overlaps but is not contained by the slot,
                 * then the SPTE can't be made huge.  More importantly, trying
                 * to query that info from slot->arch.lpage_info will cause an
                 * out-of-bounds access.
                 */
                if (iter.gfn < start || iter.gfn >= end)
                        continue;

                max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn);
                if (max_mapping_level < iter.level)
                        continue;

		r = tdp_mmu_make_huge_spte(kvm, &iter, &huge_spte);
		if (r == -EAGAIN)
			goto restart;
		else if (r)
			continue;

		/*
		 * If setting the SPTE fails, e.g. because it has been modified
		 * by a different task, iteration will naturally continue with
		 * the next SPTE.  Don't bother retrying this SPTE, races are
		 * uncommon and odds are good the SPTE 
		 */
		if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte))
			flush = true;
	}

	if (flush)
		kvm_flush_remote_tlbs_memslot(kvm, slot);

	rcu_read_unlock();
}

static inline bool tdp_mmu_iter_needs_reched(struct kvm *kvm,
					     struct tdp_iter *iter)
{
	/* Ensure forward progress has been made before yielding. */
	return iter->next_last_level_gfn != iter->yielded_gfn &&
	       (need_resched() || rwlock_needbreak(&kvm->mmu_lock));

}

/*
 * Yield if the MMU lock is contended or this thread needs to return control
 * to the scheduler.
 *
 * If this function should yield and flush is set, it will perform a remote
 * TLB flush before yielding.
 *
 * If this function yields, iter->yielded is set and the caller must skip to
 * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
 * over the paging structures to allow the iterator to continue its traversal
 * from the paging structure root.
 *
 * Returns true if this function yielded.
 */
static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
							  struct tdp_iter *iter,
							  bool flush, bool shared)
{
	WARN_ON_ONCE(iter->yielded);

	if (!tdp_mmu_iter_needs_reched(kvm, iter))
		return false;

	if (flush)
		kvm_flush_remote_tlbs(kvm);

	rcu_read_unlock();

	if (shared)
		cond_resched_rwlock_read(&kvm->mmu_lock);
	else
		cond_resched_rwlock_write(&kvm->mmu_lock);

	rcu_read_lock();

	WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn);

	iter->yielded = true;
	return true;
}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery
  2024-08-22 16:50   ` Sean Christopherson
@ 2024-08-22 18:35     ` David Matlack
  2024-08-22 20:11       ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: David Matlack @ 2024-08-22 18:35 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm

On Thu, Aug 22, 2024 at 9:50 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Aug 05, 2024, David Matlack wrote:
> > Recheck the iter.old_spte still points to a page table when recovering
> > huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
> > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
> > another CPU in between stepping down and back up.
> >
> > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
> > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
> > recovery worker, or vCPUs taking faults on the huge page region).
> >
> > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
> > thus can see a different value, which is not immediately obvious when
> > reading the code.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 07d5363c9db7..bdc7fd476721 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm,
> >               while (max_mapping_level > iter.level)
> >                       tdp_iter_step_up(&iter);
> >
> > +             /*
> > +              * Re-check that iter.old_spte still points to a page table.
> > +              * Since mmu_lock is held for read and tdp_iter_step_up()
> > +              * re-reads iter.sptep, it's possible the SPTE was zapped or
> > +              * recovered by another CPU in between stepping down and
> > +              * stepping back up.
> > +              */
> > +             if (!is_shadow_present_pte(iter.old_spte) ||
> > +                 is_last_spte(iter.old_spte, iter.level))
> > +                     continue;
>
> This is the part of the step-up logic that I do not like.  Even this check doesn't
> guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was
> used to reach the leaf SPTE.  E.g. in an absurdly theoretical situation, the SPTE
> could be zapped and then re-set with another non-leaf SPTE.  Which is fine, but
> only because of several very subtle mechanisms.

I'm not sure why that matters. The only thing that matters is that the
GFN->PFN and permissions cannot change, and that is guaranteed by
holding mmu_lock for read.

At the end of the day, we never actually care about the value of the
SPTE we are replacing. We only care that it's a non-leaf SPTE.

>
> kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere
> in the huge range, so it's safe to propagate any and all WRITABLE bits.  This
> requires knowing/remembering that KVM disallows huge pages when a gfn is write-
> tracked, and relies on that never changing (which is a fairly safe bet, but the
> behavior isn't fully set in stone).
> not set.
>
> And the presence of a shadow-present leaf SPTE ensures there are no in-flight
> mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't
> return until all relevant leaf SPTEs have been zapped.

As you point out in the next paragraph there could be an inflight
invalidate that yielded, no?

>
> And even more subtly, recover_huge_pages_range() can install a huge SPTE while
> tdp_mmu_zap_leafs() is running, e.g. if tdp_mmu_zap_leafs() is processing 4KiB
> SPTEs because the greater 2MiB page is being unmapped by the primary MMU, and
> tdp_mmu_zap_leafs() yields.   That's again safe only because upon regaining
> control, tdp_mmu_zap_leafs() will restart at the root and thus observe and zap
> the new huge SPTE.

I agree it's subtle, but only in the sense that the TDP MMU is subtle.
Restarting at the root after dropping mmu_lock is a fundamental
concept in the TDP MMU.

>
> So while I'm pretty sure this logic is functionally ok, its safety is extremely
> dependent on a number of behaviors in KVM.
>
> That said, I can't tell which option I dislike less.  E.g. we could do something
> like this, where kvm_mmu_name_tbd() grabs the pfn+writable information from the
> primary MMU's PTE/PMD/PUD.  Ideally, KVM would use GUP, but then KVM wouldn't
> be able to create huge SPTEs for non-GUP-able memory, e.g. PFNMAP'd memory.
>
> I don't love this either, primarily because not using GUP makes this yet another
> custom flow

Yeah. I don't like having the huge page recovery path needing its own
special way to construct SPTEs from scratch. e.g. I could see this
approach becoming a problem if KVM gains support for R/W/X GFN
attributes.

> i.e. isn't any less tricky than reusing a child SPTE.  It does have
> the advantage of not having to find a shadow-present child though, i.e. is likely
> the most performant option.  I agree that that likely doesn't matter in practice,
> especially since the raw latency of disabling dirty logging isn't terribly
> important.
>
>         rcu_read_lock();
>
>         for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
> retry:
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>                         continue;
>
>                 if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
>                     !is_shadow_present_pte(iter.old_spte))
>                         continue;
>
>                 /*
>                  * TODO: this should skip to the end of the parent, because if
>                  * the first SPTE can't be made huger, then no SPTEs at this
>                  * level can be huger.
>                  */
>                 if (is_last_spte(iter.old_spte, iter.level))
>                         continue;
>
>                 /*
>                  * If iter.gfn resides outside of the slot, i.e. the page for
>                  * the current level overlaps but is not contained by the slot,
>                  * then the SPTE can't be made huge.  More importantly, trying
>                  * to query that info from slot->arch.lpage_info will cause an
>                  * out-of-bounds access.
>                  */
>                 if (iter.gfn < start || iter.gfn >= end)
>                         continue;
>
>                 if (kvm_mmu_name_tbd(kvm, slot, iter.gfn, &pfn, &writable,
>                                      &max_mapping_level))
>                         continue;
>
>                 /*
>                  * If the range is being invalidated, do not install a SPTE as
>                  * KVM may have already zapped this specific gfn, e.g. if KVM's
>                  * unmapping has run to completion, but the primary MMU hasn't
>                  * zapped its PTEs.  There is no need to check for *past*
>                  * invalidations, because all information is gathered while
>                  * holding mmu_lock, i.e. it can't have become stale due to a
>                  * entire mmu_notifier invalidation sequence completing.
>                  */
>                 if (mmu_invalidate_retry_gfn(kvm, kvm->mmu_invalidate_seq, iter.gfn))
>                         continue;
>
>                 /*
>                  * KVM disallows huge pages for write-protected gfns, it should
>                  * impossible for make_spte() to encounter such a gfn since
>                  * write-protecting a gfn requires holding mmu_lock for write.
>                  */
>                 flush |= __tdp_mmu_map_gfn(...);
>                 WARN_ON_ONCE(r == RET_PF_EMULATE);
>         }
>
>         rcu_read_unlock();
>
> Assuming you don't like the above idea (I'm not sure I do), what if instead of
> doing the step-up, KVM starts a second iteration using the shadow page it wants
> to replace as the root of the walk?
>
> This has the same subtle dependencies on kvm_mmu_max_mapping_level() and the
> ordering with respect to an mmu_notifier invalidation, but it at least avoids
> having to reason about the correctness of re-reading a SPTE and modifying the
> iteration level within the body of an interation loop.
>
> It should also yield smaller diffs overall, e.g. no revert, no separate commit
> to recheck the SPTE, etc.  And I believe it's more performant that the step-up
> approach when there are SPTE that _can't_ be huge, as KVM won't traverse down
> into the leafs in that case.

This approach looks good to me. I'll try it out and see if I run into
any issues.

>
> An alternative to the tdp_mmu_iter_needs_reched() logic would be to pass in
> &flush, but I think that ends up being more confusing and harder to follow.

Yeah I think that would be more complicated. If we drop mmu_lock then
we need to re-check kvm_mmu_max_mapping_level() and restart at the
root.

>
> static int tdp_mmu_make_huge_spte(struct kvm *kvm, struct tdp_iter *parent,
>                                   u64 *huge_spte)
> {
>         struct kvm_mmu_page *root = sptep_to_sp(parent->sptep);
>         gfn_t start = parent->gfn;
>         gfn_t end = start + ???; /* use parent's level */
>         struct tdp_iter iter;
>
>         tdp_root_for_each_leaf_pte(iter, root, start, end)      {
>                 /*
>                  * Use the parent iterator when checking for forward progress,
>                  * so that KVM doesn't get stuck due to always yielding while
>                  * walking child SPTEs.
>                  */
>                 if (tdp_mmu_iter_needs_reched(kvm, parent))
>                         return -EAGAIN;
>
>                 *huge_spte = make_huge_spte(kvm, iter.old_spte);
>                 return 0;
>         }
>
>         return -ENOENT;
> }
>
> static void recover_huge_pages_range(struct kvm *kvm,
>                                      struct kvm_mmu_page *root,
>                                      const struct kvm_memory_slot *slot)
> {
>         gfn_t start = slot->base_gfn;
>         gfn_t end = start + slot->npages;
>         struct tdp_iter iter;
>         int max_mapping_level;
>         bool flush = false;
>         u64 huge_spte;
>
>         if (WARN_ON_ONCE(kvm_slot_dirty_track_enabled(slot)))
>                 return;
>
>         rcu_read_lock();
>
>         for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
> restart:
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
>                         flush = false;
>                         continue;
>                 }
>
>                 if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
>                     !is_shadow_present_pte(iter.old_spte))
>                         continue;
>
>                 /*
>                  * If iter.gfn resides outside of the slot, i.e. the page for
>                  * the current level overlaps but is not contained by the slot,
>                  * then the SPTE can't be made huge.  More importantly, trying
>                  * to query that info from slot->arch.lpage_info will cause an
>                  * out-of-bounds access.
>                  */
>                 if (iter.gfn < start || iter.gfn >= end)
>                         continue;
>
>                 max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn);
>                 if (max_mapping_level < iter.level)
>                         continue;
>
>                 r = tdp_mmu_make_huge_spte(kvm, &iter, &huge_spte);
>                 if (r == -EAGAIN)
>                         goto restart;
>                 else if (r)
>                         continue;
>
>                 /*
>                  * If setting the SPTE fails, e.g. because it has been modified
>                  * by a different task, iteration will naturally continue with
>                  * the next SPTE.  Don't bother retrying this SPTE, races are
>                  * uncommon and odds are good the SPTE
>                  */
>                 if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte))
>                         flush = true;
>         }
>
>         if (flush)
>                 kvm_flush_remote_tlbs_memslot(kvm, slot);
>
>         rcu_read_unlock();
> }
>
> static inline bool tdp_mmu_iter_needs_reched(struct kvm *kvm,
>                                              struct tdp_iter *iter)
> {
>         /* Ensure forward progress has been made before yielding. */
>         return iter->next_last_level_gfn != iter->yielded_gfn &&
>                (need_resched() || rwlock_needbreak(&kvm->mmu_lock));
>
> }
>
> /*
>  * Yield if the MMU lock is contended or this thread needs to return control
>  * to the scheduler.
>  *
>  * If this function should yield and flush is set, it will perform a remote
>  * TLB flush before yielding.
>  *
>  * If this function yields, iter->yielded is set and the caller must skip to
>  * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
>  * over the paging structures to allow the iterator to continue its traversal
>  * from the paging structure root.
>  *
>  * Returns true if this function yielded.
>  */
> static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
>                                                           struct tdp_iter *iter,
>                                                           bool flush, bool shared)
> {
>         WARN_ON_ONCE(iter->yielded);
>
>         if (!tdp_mmu_iter_needs_reched(kvm, iter))
>                 return false;
>
>         if (flush)
>                 kvm_flush_remote_tlbs(kvm);
>
>         rcu_read_unlock();
>
>         if (shared)
>                 cond_resched_rwlock_read(&kvm->mmu_lock);
>         else
>                 cond_resched_rwlock_write(&kvm->mmu_lock);
>
>         rcu_read_lock();
>
>         WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn);
>
>         iter->yielded = true;
>         return true;
> }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery
  2024-08-22 18:35     ` David Matlack
@ 2024-08-22 20:11       ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2024-08-22 20:11 UTC (permalink / raw)
  To: David Matlack; +Cc: Paolo Bonzini, kvm

On Thu, Aug 22, 2024, David Matlack wrote:
> On Thu, Aug 22, 2024 at 9:50 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Aug 05, 2024, David Matlack wrote:
> > > Recheck the iter.old_spte still points to a page table when recovering
> > > huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
> > > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
> > > another CPU in between stepping down and back up.
> > >
> > > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
> > > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
> > > recovery worker, or vCPUs taking faults on the huge page region).
> > >
> > > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
> > > thus can see a different value, which is not immediately obvious when
> > > reading the code.
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 07d5363c9db7..bdc7fd476721 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm,
> > >               while (max_mapping_level > iter.level)
> > >                       tdp_iter_step_up(&iter);
> > >
> > > +             /*
> > > +              * Re-check that iter.old_spte still points to a page table.
> > > +              * Since mmu_lock is held for read and tdp_iter_step_up()
> > > +              * re-reads iter.sptep, it's possible the SPTE was zapped or
> > > +              * recovered by another CPU in between stepping down and
> > > +              * stepping back up.
> > > +              */
> > > +             if (!is_shadow_present_pte(iter.old_spte) ||
> > > +                 is_last_spte(iter.old_spte, iter.level))
> > > +                     continue;
> >
> > This is the part of the step-up logic that I do not like.  Even this check doesn't
> > guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was
> > used to reach the leaf SPTE.  E.g. in an absurdly theoretical situation, the SPTE
> > could be zapped and then re-set with another non-leaf SPTE.  Which is fine, but
> > only because of several very subtle mechanisms.
> 
> I'm not sure why that matters. The only thing that matters is that the
> GFN->PFN and permissions cannot change, and that is guaranteed by
> holding mmu_lock for read.

Because it introduces possible edge cases, that while benign, require the reader
to think about and reason through.  E.g. if _this_ task is trying to replace a
4KiB page with a 1GiB page, what happens if some other task replaces the parent
2MiB page?  It's not immediately obvious that looping on tdp_iter_step_up() will
happily blaze past a !PRESENT SPTE, which might not even be the actual SPTE in
the tree at this point.

> At the end of the day, we never actually care about the value of the
> SPTE we are replacing. We only care that it's a non-leaf SPTE.
>
> > kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere
> > in the huge range, so it's safe to propagate any and all WRITABLE bits.  This
> > requires knowing/remembering that KVM disallows huge pages when a gfn is write-
> > tracked, and relies on that never changing (which is a fairly safe bet, but the
> > behavior isn't fully set in stone).
> > not set.
> >
> > And the presence of a shadow-present leaf SPTE ensures there are no in-flight
> > mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't
> > return until all relevant leaf SPTEs have been zapped.
> 
> As you point out in the next paragraph there could be an inflight invalidate
> that yielded, no?

Yeah, ignore this, I forgot to amend it after remembering that invalidation can
yield.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-08-22 20:11 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-05 23:31 [PATCH 0/7] KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log David Matlack
2024-08-05 23:31 ` [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs" David Matlack
2024-08-16 22:47   ` Sean Christopherson
2024-08-16 23:35     ` David Matlack
2024-08-22 15:10       ` Sean Christopherson
2024-08-05 23:31 ` [PATCH 2/7] KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level() David Matlack
2024-08-05 23:31 ` [PATCH 3/7] KVM: x86/mmu: Batch TLB flushes when zapping collapsible TDP MMU SPTEs David Matlack
2024-08-05 23:31 ` [PATCH 4/7] KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zapping David Matlack
2024-08-05 23:31 ` [PATCH 5/7] KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte() David Matlack
2024-08-05 23:31 ` [PATCH 6/7] KVM: x86/mmu: WARN if huge page recovery triggered during dirty logging David Matlack
2024-08-05 23:31 ` [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery David Matlack
2024-08-22 16:50   ` Sean Christopherson
2024-08-22 18:35     ` David Matlack
2024-08-22 20:11       ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).