[PATCH v3 00/17] TDX MMU prep series part 1

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/17] TDX MMU prep series part 1
@ 2024-06-19 22:35 Rick Edgecombe
  2024-06-19 22:35 ` [PATCH v3 01/17] KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE Rick Edgecombe
                   ` (16 more replies)
  0 siblings, 17 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:35 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe

Hi,

This is v3 of the TDX MMU prep series, split out of the giant 130 patch 
TDX base enabling series [0]. It is focusing on the changes to the x86 MMU 
to support TDX’s separation of private/shared EPT into separate roots. A 
future breakout series will include the changes to actually interact with 
the TDX module to actually map private memory. The purpose of sending out 
a smaller series is to focus review, and hopefully rapidly iterate. We 
would like the series to go into kvm-coco-queue when it is ready.

I think patches are in pretty good shape at this point.

There is a larger team working on TDX KVM base enabling. The patches were 
originally authored by Sean Christopherson and Isaku Yamahata, but 
otherwise it especially represents the work of Isaku and Yan Y Zhao and 
myself.

The series has been tested as part of a development branch for the TDX base
series [1]. The testing of this series consists TDX kvm-unit-tests [2],
regular KVM selftests, and booting a Linux TD. Rebasing the TDX selftests
is still in progress.

Updates from v3
===============
For v2, Paolo did an extensive review. Most of that feedback was
non-functional, but internally we found two issues with the TDP MMU
iterator changes implemented in v2:
 - Shared bits not applied in try_step_side(), which affects the math in
   4-level paging
 - Shared bit passed into tdp_iter_start() for fast page fault path

Besides those fixes some other changes of note:
 - "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU" is
    pushed out of this series for later.
 - New patch "KVM: x86/tdp_mmu: Take a GFN in
   kvm_tdp_mmu_fast_pf_get_last_sptep()" to fix one of the previously
   mentioned bugs found earlier.
 - New patch "KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE", which
   Paolo mentioned might be a candidate for 6.11

With respect to other series in progress:
 - The "Introduce a quirk to control memslot zap behavior" [3] series will
   need a change in order to work with this. The attr_filter field
   (previously process) added in this series will need to be set in that
   series' kvm_mmu_zap_memslot_leafs() function.

   I will post an additional patch, that goes on top of that series and
   can be added when they are combined. Any branch with either of those
   series should be fine functionally until we can create VM of the TDX
   type.

 - The series to make max gfn configurable is still outstanding, but not
   required functionally.

This series is on top of kvm-coco-queue and two fixes to that branch. The
only critical one is the patch at [4].

Private/shared memory in TDX
============================
Confidential computing solutions have concepts of private and shared
memory. Often the guest accesses either private or shared memory via a bit
in the PTE. Solutions like SEV treat this bit more like a permission bit,
where solutions like TDX and ARM CCA treat it more like a GPA bit. In the
latter case, the host maps private memory in one half of the address space
and shared in another. For TDX these two halves are mapped by different
EPT roots. The private half (also called Secure EPT in Intel
documentation) gets managed by the privileged TDX Module. The shared half
is managed by the untrusted part of the VMM (KVM).

In addition to the separate roots for private and shared, there are
limitations on what operations can be done on the private side. TDX wants
to protect against protected memory being reset or otherwise scrambled by
the host. In order to prevent this, the guest has to take specific action
to “accept” memory after changes are made by the VMM to the private EPT.
This prevents the VMM from performing many of the usual memory management
operations that involve zapping and refaulting memory. The private memory
also is always RWX and cannot have VMM specified cache attribute
attributes applied.

TDX KVM MMU Design For Private Memory
=====================================

Private/shared split
--------------------
The operations that actually change the private half of the EPT are
limited and relatively slow compared to reading a PTE. For this reason the
design for KVM is to keep a “mirrored” copy of the private EPT in KVM’s
memory. This will allow KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.

To clarify the definitions of the three EPT trees at this point:
external EPT - Protected by the TDX module, modified via TDX module
               calls.
mirror EPT   - Bookkeeping tree used as an optimization by KVM, not
               mapped.
shared EPT   - Normal EPT that maps unencrypted shared memory.
               Managed like the EPT of a normal VM.

It’s worth noting that we are making an effort to remove optimizations
that have complexity for the base enabling. Although keeping a mirrored
copy of the private page tables kind of fits into that category, it has
been so fundamental to the design for so long, dropping it would be too
disruptive.

Mirror EPT
------------
The mirror EPT needs to keep a mirrored version of the private EPT
maintained in the TDX module in order to be able to find out if a GPA’s
mid-level pagetable have already been installed. So this mirrored copy has
the same structure as the private EPT, having a page table present for
every GPA range and level in the mirrored EPT where a page table is
present private. The private page tables also cannot be zapped while the
range has anything mapped, so the mirrored/private page tables need to be
protected from KVM operations that zap any non-leaf PTEs, for example
kvm_mmu_reset_context() or kvm_mmu_zap_all_fast()

Modifications to the mirrored page tables need to also perform the same
operations to the private page tables. The actual TDX module calls to do
this are not covered in this prep series.

For convenience SPs for private page tables are tracked with a role bit
out of convenience. (Note to reviewers, please consider if this is really
needed).

Zapping Changes
---------------
For normal VMs, guest memory is zapped for several reasons, like user
memory getting paged out by the guest, memslots getting deleted or
virtualization operations like MTRRs, and attachment of non-coherent DMA.
For TDX (and SNP) there is also zapping associated with the conversion of
memory between shared and privates. These operations need to take care to
do two things:
1. Not zap any private memory that is in use by the guest.
2. Not zap any memory alias unnecessarily (i.e. Don’t zap anything more
than needed). The purpose of this is to not have any unnecessary behavior
userspace could grow to rely on.

For 1, this is possible because the zapping that is out of the control of
KVM/userspace (paging out of userspace memory) will only apply to shared
memory. Guest mem fd operations are protected from mmu notifier
operations. During TD runtime, zapping of private memory will only be from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.

For 2, KVM needs to be taught which operations will operate on which
aliases. An enum based scheme is introduced such that operations can
target specific aliases like:
Memslot deletion           - Private and shared
MMU notifier based zapping - Shared only
Conversion to shared       - Private only
Conversion to private      - Shared only
MTRRs, etc                 - Zapping will be avoided all together

For zapping arising from other virtualization based operations, there are
four scenarios:
1. MTRR update
2. CR0.CD update
3. APICv update
4. Non-coherent DMA status update

KVM TDX will not support 1-3. In future changes (after this series) the
features will not be supported for TDX. For 4, there isn’t an easy way to
not support the feature as the notification is just passed to KVM and it
has to act accordingly. However, other proposed changes [5] will avoid the
need for zapping on non-coherent DMA notification for selfsnoop CPUs. So
KVM can follow this logic and just always honor guest PAT for shared
memory.

Atomically updating private EPT
-------------------------------
Although this prep series does not interact with the TDX module at all to
actually configure the private EPT, it does lay the ground work for doing
this. In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module, but there is one
tricky property that is worth elaborating on. That is how to handle that
the TDP MMU allows modification of PTEs with the mmu_lock held only for
read and uses the PTEs themselves to perform synchronization.

Unfortunately while operating on a single PTE can be done atomically,
operating on both the mirrored and private PTEs at the same time needs
additional solution. To handle this situation, REMOVED_SPTE is used to
prevent concurrent operations while a call to the TDX module updates the
private EPT.

The series is based on kvm-coco-queue.

[0] https://lore.kernel.org/kvm/cover.1708933498.git.isaku.yamahata@intel.com/
[1] https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-06-19
[2] https://lore.kernel.org/kvm/20231218072247.2573516-1-qian.wen@intel.com/
[3] https://lore.kernel.org/kvm/20240613060708.11761-1-yan.y.zhao@intel.com/
[4] https://lore.kernel.org/lkml/20240518000430.1118488-2-seanjc@google.com/
[5] https://lore.kernel.org/kvm/20240309010929.1403984-6-seanjc@google.com/

Isaku Yamahata (13):
  KVM: Add member to struct kvm_gfn_range for target alias
  KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page
  KVM: x86/mmu: Add an is_mirror member for union kvm_mmu_page_role
  KVM: x86/tdp_mmu: Take struct kvm in iter loops
  KVM: x86/mmu: Support GFN direct bits
  KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root()
  KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table
    type
  KVM: x86/tdp_mmu: Take root in tdp_mmu_for_each_pte()
  KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  KVM: x86/tdp_mmu: Propagate attr_filter to MMU notifier callbacks
  KVM: x86/tdp_mmu: Propagate building mirror page tables
  KVM: x86/tdp_mmu: Propagate tearing down mirror page tables
  KVM: x86/tdp_mmu: Take root types for
    kvm_tdp_mmu_invalidate_all_roots()

Rick Edgecombe (4):
  KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE
  KVM: x86: Add a VM type define for TDX
  KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void
  KVM: x86/tdp_mmu: Take a GFN in kvm_tdp_mmu_fast_pf_get_last_sptep()

 arch/x86/include/asm/kvm-x86-ops.h |   4 +
 arch/x86/include/asm/kvm_host.h    |  26 ++-
 arch/x86/include/uapi/asm/kvm.h    |   1 +
 arch/x86/kvm/mmu.h                 |  17 ++
 arch/x86/kvm/mmu/mmu.c             |  41 +++-
 arch/x86/kvm/mmu/mmu_internal.h    |  64 +++++-
 arch/x86/kvm/mmu/spte.c            |   2 +-
 arch/x86/kvm/mmu/spte.h            |  15 +-
 arch/x86/kvm/mmu/tdp_iter.c        |  10 +-
 arch/x86/kvm/mmu/tdp_iter.h        |  16 +-
 arch/x86/kvm/mmu/tdp_mmu.c         | 328 +++++++++++++++++++++--------
 arch/x86/kvm/mmu/tdp_mmu.h         |  51 ++++-
 include/linux/kvm_host.h           |   6 +
 virt/kvm/guest_memfd.c             |   2 +
 virt/kvm/kvm_main.c                |  14 ++
 15 files changed, 481 insertions(+), 116 deletions(-)

base-commit: 698ca1e403579ca00e16a5b28ae4d576d9f1b20e
prerequisite-patch-id: c10f666d6a21c1485a32f393a50f38aa69c25caa
prerequisite-patch-id: a525ee37f0c8f41e74909343517d57a553724152
-- 
2.34.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 01/17] KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
@ 2024-06-19 22:35 ` Rick Edgecombe
  2024-06-19 22:35 ` [PATCH v3 02/17] KVM: Add member to struct kvm_gfn_range for target alias Rick Edgecombe
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:35 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe

Rename REMOVED_SPTE to FROZEN_SPTE so that it can be used for other
multi-part operations.

REMOVED_SPTE is used as a non-present intermediate value for multi-part
operations that can happen when a thread doesn't have an MMU write lock.
Today these operations are when removing PTEs.

However, future changes will want to use the same concept for setting a
PTE. In that case the REMOVED_SPTE name does not quite fit. So rename it
to FROZEN_SPTE so it can be used for both types of operations.

Also rename the relevant helpers and comments that refer to "removed"
within the context of the SPTE value. Take care to not update naming
referring the "remove" operations, which are still distinct.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c     |  2 +-
 arch/x86/kvm/mmu/spte.c    |  2 +-
 arch/x86/kvm/mmu/spte.h    | 10 ++++-----
 arch/x86/kvm/mmu/tdp_mmu.c | 42 +++++++++++++++++++-------------------
 4 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 61982da8c8b2..828c70ead96f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3455,7 +3455,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * available as the vCPU holds a reference to its root(s).
 		 */
 		if (WARN_ON_ONCE(!sptep))
-			spte = REMOVED_SPTE;
+			spte = FROZEN_SPTE;
 
 		if (!is_shadow_present_pte(spte))
 			break;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index a5e014d7bc62..59cac37615b6 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -383,7 +383,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
 	 * not set any RWX bits.
 	 */
 	if (WARN_ON((mmio_value & mmio_mask) != mmio_value) ||
-	    WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
+	    WARN_ON(mmio_value && (FROZEN_SPTE & mmio_mask) == mmio_value))
 		mmio_value = 0;
 
 	if (!mmio_value)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 5dd5405fa07a..86e5259aa824 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -200,7 +200,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
 
 /*
  * If a thread running without exclusive control of the MMU lock must perform a
- * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
+ * multi-part operation on an SPTE, it can set the SPTE to FROZEN_SPTE as a
  * non-present intermediate value. Other threads which encounter this value
  * should not modify the SPTE.
  *
@@ -210,14 +210,14 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
  *
  * Only used by the TDP MMU.
  */
-#define REMOVED_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
+#define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
 
 /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
-static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
+static_assert(!(FROZEN_SPTE & SPTE_MMU_PRESENT_MASK));
 
-static inline bool is_removed_spte(u64 spte)
+static inline bool is_frozen_spte(u64 spte)
 {
-	return spte == REMOVED_SPTE;
+	return spte == FROZEN_SPTE;
 }
 
 /* Get an SPTE's index into its parent's page table (and the spt array). */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 36539c1b36cd..16b54208e8d7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -365,8 +365,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			 * value to the removed SPTE value.
 			 */
 			for (;;) {
-				old_spte = kvm_tdp_mmu_write_spte_atomic(sptep, REMOVED_SPTE);
-				if (!is_removed_spte(old_spte))
+				old_spte = kvm_tdp_mmu_write_spte_atomic(sptep, FROZEN_SPTE);
+				if (!is_frozen_spte(old_spte))
 					break;
 				cpu_relax();
 			}
@@ -397,11 +397,11 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			 * No retry is needed in the atomic update path as the
 			 * sole concern is dropping a Dirty bit, i.e. no other
 			 * task can zap/remove the SPTE as mmu_lock is held for
-			 * write.  Marking the SPTE as a removed SPTE is not
+			 * write.  Marking the SPTE as a frozen SPTE is not
 			 * strictly necessary for the same reason, but using
-			 * the remove SPTE value keeps the shared/exclusive
+			 * the frozen SPTE value keeps the shared/exclusive
 			 * paths consistent and allows the handle_changed_spte()
-			 * call below to hardcode the new value to REMOVED_SPTE.
+			 * call below to hardcode the new value to FROZEN_SPTE.
 			 *
 			 * Note, even though dropping a Dirty bit is the only
 			 * scenario where a non-atomic update could result in a
@@ -413,10 +413,10 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			 * it here.
 			 */
 			old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
-							  REMOVED_SPTE, level);
+							  FROZEN_SPTE, level);
 		}
 		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
-				    old_spte, REMOVED_SPTE, level, shared);
+				    old_spte, FROZEN_SPTE, level, shared);
 	}
 
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
@@ -490,19 +490,19 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 */
 	if (!was_present && !is_present) {
 		/*
-		 * If this change does not involve a MMIO SPTE or removed SPTE,
+		 * If this change does not involve a MMIO SPTE or frozen SPTE,
 		 * it is unexpected. Log the change, though it should not
 		 * impact the guest since both the former and current SPTEs
 		 * are nonpresent.
 		 */
 		if (WARN_ON_ONCE(!is_mmio_spte(kvm, old_spte) &&
 				 !is_mmio_spte(kvm, new_spte) &&
-				 !is_removed_spte(new_spte)))
+				 !is_frozen_spte(new_spte)))
 			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
 			       "should not be replaced with another,\n"
 			       "different nonpresent SPTE, unless one or both\n"
 			       "are MMIO SPTEs, or the new SPTE is\n"
-			       "a temporary removed SPTE.\n"
+			       "a temporary frozen SPTE.\n"
 			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
 			       as_id, gfn, old_spte, new_spte, level);
 		return;
@@ -540,7 +540,7 @@ static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
 	 * and pre-checking before inserting a new SPTE is advantageous as it
 	 * avoids unnecessary work.
 	 */
-	WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
+	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
 
 	/*
 	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
@@ -603,26 +603,26 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * in its place before the TLBs are flushed.
 	 *
 	 * Delay processing of the zapped SPTE until after TLBs are flushed and
-	 * the REMOVED_SPTE is replaced (see below).
+	 * the FROZEN_SPTE is replaced (see below).
 	 */
-	ret = __tdp_mmu_set_spte_atomic(iter, REMOVED_SPTE);
+	ret = __tdp_mmu_set_spte_atomic(iter, FROZEN_SPTE);
 	if (ret)
 		return ret;
 
 	kvm_flush_remote_tlbs_gfn(kvm, iter->gfn, iter->level);
 
 	/*
-	 * No other thread can overwrite the removed SPTE as they must either
+	 * No other thread can overwrite the frozen SPTE as they must either
 	 * wait on the MMU lock or use tdp_mmu_set_spte_atomic() which will not
-	 * overwrite the special removed SPTE value. Use the raw write helper to
+	 * overwrite the special frozen SPTE value. Use the raw write helper to
 	 * avoid an unnecessary check on volatile bits.
 	 */
 	__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
 
 	/*
 	 * Process the zapped SPTE after flushing TLBs, and after replacing
-	 * REMOVED_SPTE with 0. This minimizes the amount of time vCPUs are
-	 * blocked by the REMOVED_SPTE and reduces contention on the child
+	 * FROZEN_SPTE with 0. This minimizes the amount of time vCPUs are
+	 * blocked by the FROZEN_SPTE and reduces contention on the child
 	 * SPTEs.
 	 */
 	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
@@ -652,12 +652,12 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 
 	/*
 	 * No thread should be using this function to set SPTEs to or from the
-	 * temporary removed SPTE value.
+	 * temporary frozen SPTE value.
 	 * If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
 	 * should be used. If operating under the MMU lock in write mode, the
-	 * use of the removed SPTE should not be necessary.
+	 * use of the frozen SPTE should not be necessary.
 	 */
-	WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
+	WARN_ON_ONCE(is_frozen_spte(old_spte) || is_frozen_spte(new_spte));
 
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
@@ -1126,7 +1126,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * If SPTE has been frozen by another thread, just give up and
 		 * retry, avoiding unnecessary page table allocation and free.
 		 */
-		if (is_removed_spte(iter.old_spte))
+		if (is_frozen_spte(iter.old_spte))
 			goto retry;
 
 		if (iter.level == fault->goal_level)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 02/17] KVM: Add member to struct kvm_gfn_range for target alias
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
  2024-06-19 22:35 ` [PATCH v3 01/17] KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE Rick Edgecombe
@ 2024-06-19 22:35 ` Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 03/17] KVM: x86: Add a VM type define for TDX Rick Edgecombe
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:35 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add new members to strut kvm_gfn_range to indicate which mapping
(private-vs-shared) to operate on: enum kvm_gfn_range_filter
attr_filter. Update the core zapping operations to set them appropriately.

TDX utilizes two GPA aliases for the same memslots, one for memory that is
for private memory and one that is for shared. For private memory, KVM
cannot always perform the same operations it does on memory for default
VMs, such as zapping pages and having them be faulted back in, as this
requires guest coordination. However, some operations such as guest driven
conversion of memory between private and shared should zap private memory.

Internally to the MMU, private and shared mappings are tracked on separate
roots. Mapping and zapping operations will operate on the respective GFN
alias for each root (private or shared). So zapping operations will by
default zap both aliases. Add fields in struct kvm_gfn_range to allow
callers to specify which aliases so they can only target the aliases
appropriate for their specific operation.

There was feedback that target aliases should be specified such that the
default value (0) is to operate on both aliases. Several options were
considered. Several variations of having separate bools defined such
that the default behavior was to process both aliases. They either allowed
nonsensical configurations, or were confusing for the caller. A simple
enum was also explored and was close, but was hard to process in the
caller. Instead, use an enum with the default value (0) reserved as a
disallowed value. Catch ranges that didn't have the target aliases
specified by looking for that specific value.

Set target alias with enum appropriately for these MMU operations:
 - For KVM's mmu notifier callbacks, zap shared pages only because private
   pages won't have a userspace mapping
 - For setting memory attributes, kvm_arch_pre_set_memory_attributes()
   chooses the aliases based on the attribute.
 - For guest_memfd invalidations, zap private only.

Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Fix typo in comment (Paolo)
 - Remove KVM_PROCESS_PRIVATE_AND_SHARED (Paolo)
 - Remove outdated reference to exclude_{private,shared} (Paolo)
 - Set process member in new kvm_mmu_zap_memslot_leafs() function
 - Rename process -> filter (Paolo)

TDX MMU Prep:
 - Replaced KVM_PROCESS_BASED_ON_ARG with BUGGY_KVM_INVALIDATION to follow
   the original suggestion and not populte kvm_handle_gfn_range(). And add
   WARN_ON_ONCE().
 - Move attribute specific logic into kvm_vm_set_mem_attributes()
 - Drop Sean's suggested-by tag as the solution has changed
 - Re-write commit log

v18:
 - rebased to kvm-next

v3:
 - Drop the KVM_GFN_RANGE flags
 - Updated struct kvm_gfn_range
 - Change kvm_arch_set_memory_attributes() to return bool for flush
 - Added set_memory_attributes x86 op for vendor backends
 - Refined commit message to describe TDX care concretely
---
 arch/x86/kvm/mmu/mmu.c   |  6 ++++++
 include/linux/kvm_host.h |  6 ++++++
 virt/kvm/guest_memfd.c   |  2 ++
 virt/kvm/kvm_main.c      | 14 ++++++++++++++
 4 files changed, 28 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 828c70ead96f..f41c498fcdb5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7451,6 +7451,12 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
 		return false;
 
+	/* Unmap the old attribute page. */
+	if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
+		range->attr_filter = KVM_FILTER_SHARED;
+	else
+		range->attr_filter = KVM_FILTER_PRIVATE;
+
 	return kvm_unmap_gfn_range(kvm, range);
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c3c922bf077f..8dce85962583 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -260,11 +260,17 @@ union kvm_mmu_notifier_arg {
 	unsigned long attributes;
 };
 
+enum kvm_gfn_range_filter {
+	KVM_FILTER_SHARED		= BIT(0),
+	KVM_FILTER_PRIVATE		= BIT(1),
+};
+
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
 	gfn_t end;
 	union kvm_mmu_notifier_arg arg;
+	enum kvm_gfn_range_filter attr_filter;
 	bool may_block;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 9714add38852..86aaf26c1144 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -109,6 +109,8 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
 			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
 			.slot = slot,
 			.may_block = true,
+			/* guest memfd is relevant to only private mappings. */
+			.attr_filter = KVM_FILTER_PRIVATE,
 		};
 
 		if (!found_memslot) {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 81b90bf03f2f..93c7b227aae0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -635,6 +635,11 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 			 */
 			gfn_range.arg = range->arg;
 			gfn_range.may_block = range->may_block;
+			/*
+			 * HVA-based notifications aren't relevant to private
+			 * mappings as they don't have a userspace mapping.
+			 */
+			gfn_range.attr_filter = KVM_FILTER_SHARED;
 
 			/*
 			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -2450,6 +2455,14 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 	gfn_range.arg = range->arg;
 	gfn_range.may_block = range->may_block;
 
+	/*
+	 * If/when KVM supports more attributes beyond private .vs shared, this
+	 * _could_ set KVM_FILTER_{SHARED,PRIVATE} appropriately if the entire target
+	 * range already has the desired private vs. shared state (it's unclear
+	 * if that is a net win).  For now, KVM reaches this point if and only
+	 * if the private flag is being toggled, i.e. all mappings are in play.
+	 */
+
 	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		slots = __kvm_memslots(kvm, i);
 
@@ -2506,6 +2519,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	struct kvm_mmu_notifier_range pre_set_range = {
 		.start = start,
 		.end = end,
+		.arg.attributes = attributes,
 		.handler = kvm_pre_set_memory_attributes,
 		.on_lock = kvm_mmu_invalidate_begin,
 		.flush_on_ret = true,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 03/17] KVM: x86: Add a VM type define for TDX
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
  2024-06-19 22:35 ` [PATCH v3 01/17] KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE Rick Edgecombe
  2024-06-19 22:35 ` [PATCH v3 02/17] KVM: Add member to struct kvm_gfn_range for target alias Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 04/17] KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page Rick Edgecombe
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe

Add a VM type define for TDX.

Future changes will need to lay the ground work for TDX support by
making some behavior conditional on the VM being a TDX guest.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep:
 - New patch, split from main series
---
 arch/x86/include/uapi/asm/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 988b5204d636..4dea0cfeee51 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -922,5 +922,6 @@ struct kvm_hyperv_eventfd {
 #define KVM_X86_SEV_VM		2
 #define KVM_X86_SEV_ES_VM	3
 #define KVM_X86_SNP_VM		4
+#define KVM_X86_TDX_VM		5
 
 #endif /* _ASM_X86_KVM_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 04/17] KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (2 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 03/17] KVM: x86: Add a VM type define for TDX Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-07-03  7:03   ` Yan Zhao
  2024-06-19 22:36 ` [PATCH v3 05/17] KVM: x86/mmu: Add an is_mirror member for union kvm_mmu_page_role Rick Edgecombe
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata, Binbin Wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a external pointer to struct kvm_mmu_page for TDX's private page table
and add helper functions to allocate/initialize/free a private page table
page. TDX will only be supported with the TDP MMU. Because KVM TDP MMU
doesn't use unsync_children and write_flooding_count, pack them to have
room for a pointer and use a union to avoid memory overhead.

For private GPA, CPU refers to a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used, and their cost is expensive.

When KVM resolves the KVM page fault, it walks the page tables. To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table allocate two sets of page tables for the private
half of the GPA space.

For the page tables that KVM will walk, allocate them like normal and refer
to them as mirror page tables. Additionally allocate one more page for the
page tables the CPU will walk, and call them external page tables. Resolve
the KVM page fault with the existing code, and do additional operations
necessary for modifying the external page table in future patches.

The relationship of the types of page tables in this scheme is depicted
below:

              KVM page fault                     |
                     |                           |
                     V                           |
        -------------+----------                 |
        |                      |                 |
        V                      V                 |
     shared GPA           private GPA            |
        |                      |                 |
        V                      V                 |
    shared PT root      mirror PT root           |    private PT root
        |                      |                 |           |
        V                      V                 |           V
     shared PT           mirror PT        --propagate--> external PT
        |                      |                 |           |
        |                      \-----------------+------\    |
        |                                        |      |    |
        V                                        |      V    V
  shared guest page                              |    private guest page
                                                 |
                           non-encrypted memory  |    encrypted memory
                                                 |
PT          - Page table
Shared PT   - Visible to KVM, and the CPU uses it for shared mappings.
External PT - The CPU uses it, but it is invisible to KVM. TDX module
              updates this table to map private guest pages.
Mirror PT   - It is visible to KVM, but the CPU doesn't use it. KVM uses
              it to propagate PT change to the actual private PT.

Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it
to the TDX vm type.

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX MMU Prep v3:
 - mirrored->external rename (Paolo)
 - Remove accidentally included kvm_mmu_alloc_private_spt() (Paolo)
 - Those -> These (Paolo)
 - Change log updates to make external/mirrored naming more clear

TDX MMU Prep v2:
 - Rename private->mirror
 - Don't trigger off of shared mask

TDX MMU Prep:
- Rename terminology, dummy PT => mirror PT. and updated the commit message
  By Rick and Kai.
- Added a comment on union of private_spt by Rick.
- Don't handle the root case in kvm_mmu_alloc_private_spt(), it will not
  be needed in future patches. (Rick)
- Update comments (Yan)
- Remove kvm_mmu_init_private_spt(), open code it in later patches (Yan)

v19:
- typo in the comment in kvm_mmu_alloc_private_spt()
- drop CONFIG_KVM_MMU_PRIVATE
---
 arch/x86/include/asm/kvm_host.h |  5 +++++
 arch/x86/kvm/mmu.h              |  5 +++++
 arch/x86/kvm/mmu/mmu.c          |  7 +++++++
 arch/x86/kvm/mmu/mmu_internal.h | 31 +++++++++++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aabf1648a56a..9e35fe32f500 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -817,6 +817,11 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	/*
+	 * This cache is to allocate private page table. E.g. private EPT used
+	 * by the TDX module.
+	 */
+	struct kvm_mmu_memory_cache mmu_external_spt_cache;
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index dc80e72e4848..0c3bf89cf7db 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -318,4 +318,9 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
 		return gpa;
 	return translate_nested_gpa(vcpu, gpa, access, exception);
 }
+
+static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
+{
+	return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f41c498fcdb5..8023cebeefaa 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -685,6 +685,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
 	if (r)
 		return r;
+	if (kvm_has_mirrored_tdp(vcpu->kvm)) {
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_external_spt_cache,
+					       PT64_ROOT_MAX_LEVEL);
+		if (r)
+			return r;
+	}
 	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
 				       PT64_ROOT_MAX_LEVEL);
 	if (r)
@@ -704,6 +710,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_external_spt_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 706f0ce8784c..d2837f796f34 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -101,7 +101,22 @@ struct kvm_mmu_page {
 		int root_count;
 		refcount_t tdp_mmu_root_count;
 	};
-	unsigned int unsync_children;
+	union {
+		/* These two members aren't used for TDP MMU */
+		struct {
+			unsigned int unsync_children;
+			/*
+			 * Number of writes since the last time traversal
+			 * visited this page.
+			 */
+			atomic_t write_flooding_count;
+		};
+		/*
+		 * Page table page of private PT.
+		 * Passed to TDX module, not accessed by KVM.
+		 */
+		void *external_spt;
+	};
 	union {
 		struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 		tdp_ptep_t ptep;
@@ -124,9 +139,6 @@ struct kvm_mmu_page {
 	int clear_spte_count;
 #endif
 
-	/* Number of writes since the last time traversal visited this page.  */
-	atomic_t write_flooding_count;
-
 #ifdef CONFIG_X86_64
 	/* Used for freeing the page asynchronously if it is a TDP MMU page. */
 	struct rcu_head rcu_head;
@@ -145,6 +157,17 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 	return kvm_mmu_role_as_id(sp->role);
 }
 
+static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+	/*
+	 * external_spt is allocated for TDX module to hold private EPT mappings,
+	 * TDX module will initialize the page by itself.
+	 * Therefore, KVM does not need to initialize or access external_spt.
+	 * KVM only interacts with sp->spt for external EPT operations.
+	 */
+	sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
+}
+
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 16b54208e8d7..35249555b585 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -53,6 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
+	free_page((unsigned long)sp->external_spt);
 	free_page((unsigned long)sp->spt);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 05/17] KVM: x86/mmu: Add an is_mirror member for union kvm_mmu_page_role
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (3 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 04/17] KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 06/17] KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void Rick Edgecombe
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Introduce a "is_mirror" member to the kvm_mmu_page_role union to identify
SPTEs associated with the mirrored EPT.

The TDX module maintains the private half of the EPT mapped in the TD in
its protected memory. KVM keeps a copy of the private GPAs in a mirrored
EPT tree within host memory. This "is_mirror" attribute enables vCPUs to
find and get the root page of mirrored EPT from the MMU root list for a
guest TD. This also allows KVM MMU code to detect changes in mirrored EPT
according to the "is_mirror" mmu page role and propagate the changes to
the private EPT managed by TDX module.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Rename role mirror_pt -> is_mirror (Paolo)
 - Remove unnessary helpers that just access a member (Paolo)

TDX MMU Prep v2:
 - Rename private -> mirrored

TDX MMU Prep:
- Remove warning and NULL check in is_private_sptep() (Rick)
- Update commit log (Yan)

v19:
- Fix is_private_sptep() when NULL case.
- drop CONFIG_KVM_MMU_PRIVATE
---
 arch/x86/include/asm/kvm_host.h | 3 ++-
 arch/x86/kvm/mmu/mmu_internal.h | 5 +++++
 arch/x86/kvm/mmu/spte.h         | 5 +++++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9e35fe32f500..6c59b129f382 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -351,7 +351,8 @@ union kvm_mmu_page_role {
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
 		unsigned passthrough:1;
-		unsigned :5;
+		unsigned is_mirror:1;
+		unsigned :4;
 
 		/*
 		 * This is left at the top of the word so that
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d2837f796f34..5a2c9be23627 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -157,6 +157,11 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 	return kvm_mmu_role_as_id(sp->role);
 }
 
+static inline bool is_mirror_sp(const struct kvm_mmu_page *sp)
+{
+	return sp->role.is_mirror;
+}
+
 static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 86e5259aa824..4883d139761b 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -265,6 +265,11 @@ static inline struct kvm_mmu_page *root_to_sp(hpa_t root)
 	return spte_to_child_sp(root);
 }
 
+static inline bool is_mirror_sptep(u64 *sptep)
+{
+	return is_mirror_sp(sptep_to_sp(sptep));
+}
+
 static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
 {
 	return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 06/17] KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (4 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 05/17] KVM: x86/mmu: Add an is_mirror member for union kvm_mmu_page_role Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 07/17] KVM: x86/tdp_mmu: Take struct kvm in iter loops Rick Edgecombe
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe

The kvm_tdp_mmu_alloc_root() function currently always returns 0. This
allows for the caller, mmu_alloc_direct_roots(), to call
kvm_tdp_mmu_alloc_root() and also return 0 in one line:
   return kvm_tdp_mmu_alloc_root(vcpu);

So it is useful even though the return value of kvm_tdp_mmu_alloc_root()
is always the same. However, in future changes, kvm_tdp_mmu_alloc_root()
will be called twice in mmu_alloc_direct_roots(). This will force the
first call to either awkwardly handle the return value that will always
be zero or ignore it. So change kvm_tdp_mmu_alloc_root() to return void.
Do it in a separate change so the future change will be cleaner.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX MMU Prep v3:
 - Add Paolo's reviewed-by

TDX MMU Prep:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c     | 6 ++++--
 arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
 arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8023cebeefaa..138e7bbcda1e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3700,8 +3700,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	unsigned i;
 	int r;
 
-	if (tdp_mmu_enabled)
-		return kvm_tdp_mmu_alloc_root(vcpu);
+	if (tdp_mmu_enabled) {
+		kvm_tdp_mmu_alloc_root(vcpu);
+		return 0;
+	}
 
 	write_lock(&vcpu->kvm->mmu_lock);
 	r = make_mmu_pages_available(vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 35249555b585..07b0b884e246 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -224,7 +224,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
 	tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
 }
 
-int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	union kvm_mmu_page_role role = mmu->root_role;
@@ -285,7 +285,6 @@ int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
 	 */
 	mmu->root.hpa = __pa(root->spt);
 	mmu->root.pgd = 0;
-	return 0;
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 58b55e61bd33..437ddd4937a9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,7 +10,7 @@
 void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
 
-int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
 
 __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 07/17] KVM: x86/tdp_mmu: Take struct kvm in iter loops
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (5 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 06/17] KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 08/17] KVM: x86/tdp_mmu: Take a GFN in kvm_tdp_mmu_fast_pf_get_last_sptep() Rick Edgecombe
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a struct kvm argument to the TDP MMU iterators.

Future changes will want to change how the iterator behaves based on a
member of struct kvm. Change the signature and callers of the iterator
loop helpers in a separate patch to make the future one easier to review.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Split from "KVM: x86/mmu: Support GFN direct mask" (Paolo)
---
 arch/x86/kvm/mmu/tdp_iter.h |  6 +++---
 arch/x86/kvm/mmu/tdp_mmu.c  | 36 ++++++++++++++++++------------------
 2 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index fae559559a80..62c9ca32d922 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -120,13 +120,13 @@ struct tdp_iter {
  * Iterates over every SPTE mapping the GFN range [start, end) in a
  * preorder traversal.
  */
-#define for_each_tdp_pte_min_level(iter, root, min_level, start, end) \
+#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end)		  \
 	for (tdp_iter_start(&iter, root, min_level, start); \
 	     iter.valid && iter.gfn < end;		     \
 	     tdp_iter_next(&iter))
 
-#define for_each_tdp_pte(iter, root, start, end) \
-	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end)
+#define for_each_tdp_pte(iter, kvm, root, start, end)				\
+	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end)
 
 tdp_ptep_t spte_to_child_pt(u64 pte, int level);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 07b0b884e246..4a7518c9ba7e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -674,18 +674,18 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 					  iter->gfn, iter->level);
 }
 
-#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
-	for_each_tdp_pte(_iter, _root, _start, _end)
+#define tdp_root_for_each_pte(_iter, _kvm, _root, _start, _end)	\
+	for_each_tdp_pte(_iter, _kvm, _root, _start, _end)
 
-#define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end)	\
-	tdp_root_for_each_pte(_iter, _root, _start, _end)		\
+#define tdp_root_for_each_leaf_pte(_iter, _kvm, _root, _start, _end)	\
+	tdp_root_for_each_pte(_iter, _kvm, _root, _start, _end)		\
 		if (!is_shadow_present_pte(_iter.old_spte) ||		\
 		    !is_last_spte(_iter.old_spte, _iter.level))		\
 			continue;					\
 		else
 
-#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)		\
-	for_each_tdp_pte(_iter, root_to_sp(_mmu->root.hpa), _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _kvm, _mmu, _start, _end)		\
+	for_each_tdp_pte(_iter, _kvm, root_to_sp(_mmu->root.hpa), _start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -751,7 +751,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	gfn_t end = tdp_mmu_max_gfn_exclusive();
 	gfn_t start = 0;
 
-	for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, zap_level, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
@@ -855,7 +855,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
 		if (can_yield &&
 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
 			flush = false;
@@ -1116,7 +1116,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 	rcu_read_lock();
 
-	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
+	tdp_mmu_for_each_pte(iter, kvm, mmu, fault->gfn, fault->gfn + 1) {
 		int r;
 
 		if (fault->nx_huge_page_workaround_enabled)
@@ -1214,7 +1214,7 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
 		rcu_read_lock();
 
-		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+		tdp_root_for_each_leaf_pte(iter, kvm, root, range->start, range->end)
 			ret |= handler(kvm, &iter, range);
 
 		rcu_read_unlock();
@@ -1297,7 +1297,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
-	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
@@ -1460,7 +1460,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	 * level above the target level (e.g. splitting a 1GB to 512 2MB pages,
 	 * and then splitting each of those to 512 4KB pages).
 	 */
-	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
@@ -1545,7 +1545,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	tdp_root_for_each_pte(iter, root, start, end) {
+	tdp_root_for_each_pte(iter, kvm, root, start, end) {
 retry:
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
@@ -1600,7 +1600,7 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
+	tdp_root_for_each_leaf_pte(iter, kvm, root, gfn + __ffs(mask),
 				    gfn + BITS_PER_LONG) {
 		if (!mask)
 			break;
@@ -1657,7 +1657,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_2M, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
@@ -1727,7 +1727,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
+	for_each_tdp_pte_min_level(iter, kvm, root, min_level, gfn, gfn + 1) {
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
@@ -1782,7 +1782,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 
 	*root_level = vcpu->arch.mmu->root_role.level;
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	tdp_mmu_for_each_pte(iter, vcpu->kvm, mmu, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
 	}
@@ -1809,7 +1809,7 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	tdp_ptep_t sptep = NULL;
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	tdp_mmu_for_each_pte(iter, vcpu->kvm, mmu, gfn, gfn + 1) {
 		*spte = iter.old_spte;
 		sptep = iter.sptep;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 08/17] KVM: x86/tdp_mmu: Take a GFN in kvm_tdp_mmu_fast_pf_get_last_sptep()
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (6 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 07/17] KVM: x86/tdp_mmu: Take struct kvm in iter loops Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 09/17] KVM: x86/mmu: Support GFN direct bits Rick Edgecombe
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe

Pass fault->gfn into kvm_tdp_mmu_fast_pf_get_last_sptep(), instead of
passing fault->addr and then converting it to a GFN.

Future changes will make fault->addr and fault->gfn differ when running
TDX guests. The GFN will be conceptually the same as it is for normal VMs,
but fault->addr may contain a TDX specific bit that differentiates between
"shared" and "private" memory. This bit will be used to direct faults to
be handled on different roots, either the normal "direct" root or a new
type of root that handles private memory. The TDP iterators will process
the traditional GFN concept and apply the required TDX specifics depending
on the root type. For this reason, it needs to operate on regular GFN and
not the addr, which may contain these special TDX specific bits.

Today kvm_tdp_mmu_fast_pf_get_last_sptep() takes fault->addr and then
immediately converts it to a GFN with a bit shift. However, this would
unfortunately retain the TDX specific bits in what is supposed to be a
traditional GFN. Excluding TDX's needs, it is also is unnecessary to pass
fault->addr and convert it to a GFN when the GFN is already on hand.

So instead just pass the GFN into kvm_tdp_mmu_fast_pf_get_last_sptep() and
use it directly.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c     | 2 +-
 arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
 arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 138e7bbcda1e..e9c1783a8743 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3452,7 +3452,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		u64 new_spte;
 
 		if (tdp_mmu_enabled)
-			sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
+			sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault->gfn, &spte);
 		else
 			sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4a7518c9ba7e..067249dbbb5e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1801,12 +1801,11 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
  *
  * WARNING: This function is only intended to be called during fast_page_fault.
  */
-u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
+u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 					u64 *spte)
 {
 	struct tdp_iter iter;
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
-	gfn_t gfn = addr >> PAGE_SHIFT;
 	tdp_ptep_t sptep = NULL;
 
 	tdp_mmu_for_each_pte(iter, vcpu->kvm, mmu, gfn, gfn + 1) {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 437ddd4937a9..1ba84487f3b7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -64,7 +64,7 @@ static inline void kvm_tdp_mmu_walk_lockless_end(void)
 
 int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 			 int *root_level);
-u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
+u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 					u64 *spte);
 
 #ifdef CONFIG_X86_64
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 09/17] KVM: x86/mmu: Support GFN direct bits
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (7 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 08/17] KVM: x86/tdp_mmu: Take a GFN in kvm_tdp_mmu_fast_pf_get_last_sptep() Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 10/17] KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root() Rick Edgecombe
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Teach the MMU to map guest GFNs at a massaged position on the TDP, to aid
in implementing TDX shared memory.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

For TDX, the shared half will be mapped in the higher alias, with a "shared
bit" set in the GPA. However, KVM will still manage it with the same
memslots as the private half. This means memslot looks ups and zapping
operations will be provided with a GFN without the shared bit set.

So KVM will either need to apply or strip the shared bit before mapping or
zapping the shared EPT. Having GFNs sometimes have the shared bit and
sometimes not would make the code confusing.

So instead arrange the code such that GFNs never have shared bit set.
Create a concept of "direct bits", that is stripped from the fault
address when setting fault->gfn, and applied within the TDP MMU iterator.
Calling code will behave as if is operating on the PTE mapping the GFN
(without shared bits) but within the iterator, the actual mappings will be
shifted using bits specific for the root. SPs will have the GFN set
without the shared bit. In the end the TDP MMU will behave like it is
mapping things at the GFN without the shared bit but with a strange page
table format where everything is offset by the shared bit.

Since TDX only needs to shift the mapping like this for the shared bit,
which is mapped as the normal TDP root, add a "gfn_direct_bits" field to
the kvm_arch structure for each VM with a default value of 0. It will
have the bit set at the position of the GPA shared bit in GFN through TD
specific initialization code. Keep TDX specific concepts out of the MMU
code by not naming it "shared".

Ranged TLB flushes (i.e. flush_remote_tlbs_range()) target specific GFN
ranges. In convention established above, these would need to target the
shifted GFN range. It won't matter functionally, since the actual
implementation will always result in a full flush for the only planned
user (TDX). For correctness reasons, future changes can provide a TDX
x86_ops.flush_remote_tlbs_range implementation to return -EOPNOTSUPP and
force the full flush for TDs.

This leaves one drawback. Some operations use a concept of max gfn (i.e.
kvm_mmu_max_gfn()), to iterate over the whole TDP range. These would then
exceed the range actually covered by each root. It should only result in a
bit of extra iterating, and not cause functional problems. This will be
addressed in a future change.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Add comment for kvm_gfn_root_mask() (Paolo)
 - Change names mask -> bits (Paolo)
 - Add comment in struct definition for fault->gfn not containing shared
   bit. (Paolo)
 - Drop special handling in kvm_arch_flush_remote_tlbs_range(),
   implement kvm_x86_ops.flush_remote_tlbs_range in a future patch.
   (Paolo)
 - Do addition of kvm arg to iterator in previous patch (Paolo)
 - OR gfn_bits in try_step_side() too, because of issue seen with 4
   level EPT
 - Add warning for GFN bits in wrong arg in tdp_iter_start()

TDX MMU Prep v2:
 - Rename from "KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA"
 - Dropped Binbin's reviewed-by tag because of the extend of the changes
 - Rename gfn_shared_mask to gfn_direct_mask.
 - Don't include shared bits in GFNs, hide the existence in the TDP MMU
   iterator.
 - Don't do range flushes if a gfn_direct_mask is present.
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.h              |  5 +++++
 arch/x86/kvm/mmu/mmu_internal.h | 28 ++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/tdp_iter.c     | 10 ++++++----
 arch/x86/kvm/mmu/tdp_iter.h     | 10 ++++++----
 5 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6c59b129f382..6e07eff06a58 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1525,6 +1525,8 @@ struct kvm_arch {
 	 */
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
 	struct kvm_mmu_memory_cache split_desc_cache;
+
+	gfn_t gfn_direct_bits;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 0c3bf89cf7db..63179a4fba7b 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -323,4 +323,9 @@ static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
 {
 	return kvm->arch.vm_type == KVM_X86_TDX_VM;
 }
+
+static inline gfn_t kvm_gfn_direct_bits(const struct kvm *kvm)
+{
+	return kvm->arch.gfn_direct_bits;
+}
 #endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 5a2c9be23627..a19a4a05566a 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -6,6 +6,8 @@
 #include <linux/kvm_host.h>
 #include <asm/kvm_host.h>
 
+#include "mmu.h"
+
 #ifdef CONFIG_KVM_PROVE_MMU
 #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
 #else
@@ -173,6 +175,18 @@ static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_
 	sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
 }
 
+static inline gfn_t kvm_gfn_root_bits(const struct kvm *kvm, const struct kvm_mmu_page *root)
+{
+	/*
+	 * Since mirror SPs are used only for TDX, which maps private memory
+	 * at its "natural" GFN, no mask needs to be applied to them - and, dually,
+	 * we expect that the bits is only used for the shared PT.
+	 */
+	if (is_mirror_sp(root))
+		return 0;
+	return kvm_gfn_direct_bits(kvm);
+}
+
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
@@ -257,7 +271,12 @@ struct kvm_page_fault {
 	 */
 	u8 goal_level;
 
-	/* Shifted addr, or result of guest page table walk if addr is a gva.  */
+	/*
+	 * Shifted addr, or result of guest page table walk if addr is a gva. In
+	 * the case of VM where memslot's can be mapped at multiple GPA aliases
+	 * (i.e. TDX), the gfn field does not contain the bit that selects between
+	 * the aliases (i.e. the shared bit for TDX).
+	 */
 	gfn_t gfn;
 
 	/* The memslot containing gfn. May be NULL. */
@@ -343,7 +362,12 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
 	int r;
 
 	if (vcpu->arch.mmu->root_role.direct) {
-		fault.gfn = fault.addr >> PAGE_SHIFT;
+		/*
+		 * Things like memslots don't understand the concept of a shared
+		 * bit. Strip it so that the GFN can be used like normal, and the
+		 * fault.addr can be used when the shared bit is needed.
+		 */
+		fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_direct_bits(vcpu->kvm);
 		fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
 	}
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 04c247bfe318..9e17bfa80901 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@
 static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
 {
 	iter->sptep = iter->pt_path[iter->level - 1] +
-		SPTE_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
+		SPTE_INDEX((iter->gfn | iter->gfn_bits) << PAGE_SHIFT, iter->level);
 	iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
 }
 
@@ -37,15 +37,17 @@ void tdp_iter_restart(struct tdp_iter *iter)
  * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
  */
 void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
-		    int min_level, gfn_t next_last_level_gfn)
+		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_bits)
 {
 	if (WARN_ON_ONCE(!root || (root->role.level < 1) ||
-			 (root->role.level > PT64_ROOT_MAX_LEVEL))) {
+			 (root->role.level > PT64_ROOT_MAX_LEVEL) ||
+			 (gfn_bits && next_last_level_gfn >= gfn_bits))) {
 		iter->valid = false;
 		return;
 	}
 
 	iter->next_last_level_gfn = next_last_level_gfn;
+	iter->gfn_bits = gfn_bits;
 	iter->root_level = root->role.level;
 	iter->min_level = min_level;
 	iter->pt_path[iter->root_level - 1] = (tdp_ptep_t)root->spt;
@@ -113,7 +115,7 @@ static bool try_step_side(struct tdp_iter *iter)
 	 * Check if the iterator is already at the end of the current page
 	 * table.
 	 */
-	if (SPTE_INDEX(iter->gfn << PAGE_SHIFT, iter->level) ==
+	if (SPTE_INDEX((iter->gfn | iter->gfn_bits) << PAGE_SHIFT, iter->level) ==
 	    (SPTE_ENT_PER_PAGE - 1))
 		return false;
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 62c9ca32d922..e9b4dff7c129 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -91,8 +91,10 @@ struct tdp_iter {
 	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
 	/* A pointer to the current SPTE */
 	tdp_ptep_t sptep;
-	/* The lowest GFN mapped by the current SPTE */
+	/* The lowest GFN (mask bits excluded) mapped by the current SPTE */
 	gfn_t gfn;
+	/* Mask applied to convert the GFN to the mapping GPA */
+	gfn_t gfn_bits;
 	/* The level of the root page given to the iterator */
 	int root_level;
 	/* The lowest level the iterator should traverse to */
@@ -121,8 +123,8 @@ struct tdp_iter {
  * preorder traversal.
  */
 #define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end)		  \
-	for (tdp_iter_start(&iter, root, min_level, start); \
-	     iter.valid && iter.gfn < end;		     \
+	for (tdp_iter_start(&iter, root, min_level, start, kvm_gfn_root_bits(kvm, root)); \
+	     iter.valid && iter.gfn < end;						  \
 	     tdp_iter_next(&iter))
 
 #define for_each_tdp_pte(iter, kvm, root, start, end)				\
@@ -131,7 +133,7 @@ struct tdp_iter {
 tdp_ptep_t spte_to_child_pt(u64 pte, int level);
 
 void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
-		    int min_level, gfn_t next_last_level_gfn);
+		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_bits);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 10/17] KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root()
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (8 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 09/17] KVM: x86/mmu: Support GFN direct bits Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 11/17] KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type Rick Edgecombe
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Extract tdp_mmu_root_match() to check if the root has given types and use
it for the root page table iterator.  It checks only_invalid now.

TDX KVM operates on a shared page table only (Shared-EPT), a mirrored page
table only (Secure-EPT), or both based on the operation.  KVM MMU notifier
operations only on shared page table.  KVM guest_memfd invalidation
operations only on mirrored page table, and so on.  Introduce a centralized
matching function instead of open coding matching logic in the iterator.
The next step is to extend the function to check whether the page is shared
or private

Link: https://lore.kernel.org/kvm/ZivazWQw1oCU8VBC@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep:
 - New patch
---
 arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 067249dbbb5e..cecc25947001 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -92,6 +92,14 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
 	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
+static bool tdp_mmu_root_match(struct kvm_mmu_page *root, bool only_valid)
+{
+	if (only_valid && root->role.invalid)
+		return false;
+
+	return true;
+}
+
 /*
  * Returns the next root after @prev_root (or the first root if @prev_root is
  * NULL).  A reference to the returned root is acquired, and the reference to
@@ -125,7 +133,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 						   typeof(*next_root), link);
 
 	while (next_root) {
-		if ((!only_valid || !next_root->role.invalid) &&
+		if (tdp_mmu_root_match(next_root, only_valid) &&
 		    kvm_tdp_mmu_get_root(next_root))
 			break;
 
@@ -176,7 +184,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)		\
 		if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) &&		\
 		    ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||	\
-		     ((_only_valid) && (_root)->role.invalid))) {		\
+		     !tdp_mmu_root_match((_root), (_only_valid)))) {		\
 		} else
 
 #define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 11/17] KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (9 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 10/17] KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root() Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 12/17] KVM: x86/tdp_mmu: Take root in tdp_mmu_for_each_pte() Rick Edgecombe
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Define an enum kvm_tdp_mmu_root_types to specify the KVM MMU root type [1]
so that the iterator on the root page table can consistently filter the
root page table type instead of only_valid.

TDX KVM will operate on KVM page tables with specified types.  Shared page
table, private page table, or both.  Introduce an enum instead of bool
only_valid so that we can easily enhance page table types applicable to
shared, private, or both in addition to valid or not.  Replace
only_valid=false with KVM_ANY_ROOTS and only_valid=true with
KVM_ANY_VALID_ROOTS.  Use KVM_ANY_ROOTS and KVM_ANY_VALID_ROOTS to wrap
KVM_VALID_ROOTS to avoid further code churn when direct vs mirror root
concepts are introduced in future patches.

Link: https://lore.kernel.org/kvm/ZivazWQw1oCU8VBC@google.com/ [1]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Drop KVM_ANY_ROOTS, KVM_ANY_VALID_ROOTS and switch to KVM_VALID_ROOTS
   and KVM_ALL_ROOTS. (Paolo)

TDX MMU Prep:
 - Newly introduced.
---
 arch/x86/kvm/mmu/tdp_mmu.c | 41 +++++++++++++++++++-------------------
 arch/x86/kvm/mmu/tdp_mmu.h |  7 +++++++
 2 files changed, 28 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cecc25947001..c8e5e779967e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -92,27 +92,28 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
 	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
-static bool tdp_mmu_root_match(struct kvm_mmu_page *root, bool only_valid)
+static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
+			       enum kvm_tdp_mmu_root_types types)
 {
-	if (only_valid && root->role.invalid)
-		return false;
+	if (root->role.invalid)
+		return types & KVM_INVALID_ROOTS;
 
 	return true;
 }
 
 /*
  * Returns the next root after @prev_root (or the first root if @prev_root is
- * NULL).  A reference to the returned root is acquired, and the reference to
- * @prev_root is released (the caller obviously must hold a reference to
- * @prev_root if it's non-NULL).
+ * NULL) that matches with @types.  A reference to the returned root is
+ * acquired, and the reference to @prev_root is released (the caller obviously
+ * must hold a reference to @prev_root if it's non-NULL).
  *
- * If @only_valid is true, invalid roots are skipped.
+ * Roots that doesn't match with @types are skipped.
  *
  * Returns NULL if the end of tdp_mmu_roots was reached.
  */
 static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 					      struct kvm_mmu_page *prev_root,
-					      bool only_valid)
+					      enum kvm_tdp_mmu_root_types types)
 {
 	struct kvm_mmu_page *next_root;
 
@@ -133,7 +134,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 						   typeof(*next_root), link);
 
 	while (next_root) {
-		if (tdp_mmu_root_match(next_root, only_valid) &&
+		if (tdp_mmu_root_match(next_root, types) &&
 		    kvm_tdp_mmu_get_root(next_root))
 			break;
 
@@ -158,20 +159,20 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
  * If shared is set, this function is operating under the MMU lock in read
  * mode.
  */
-#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _only_valid)	\
-	for (_root = tdp_mmu_next_root(_kvm, NULL, _only_valid);		\
+#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _types)	\
+	for (_root = tdp_mmu_next_root(_kvm, NULL, _types);		\
 	     ({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root;		\
-	     _root = tdp_mmu_next_root(_kvm, _root, _only_valid))		\
+	     _root = tdp_mmu_next_root(_kvm, _root, _types))		\
 		if (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) {	\
 		} else
 
 #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id)	\
-	__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, true)
+	__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, KVM_VALID_ROOTS)
 
 #define for_each_tdp_mmu_root_yield_safe(_kvm, _root)			\
-	for (_root = tdp_mmu_next_root(_kvm, NULL, false);		\
+	for (_root = tdp_mmu_next_root(_kvm, NULL, KVM_ALL_ROOTS);		\
 	     ({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root;	\
-	     _root = tdp_mmu_next_root(_kvm, _root, false))
+	     _root = tdp_mmu_next_root(_kvm, _root, KVM_ALL_ROOTS))
 
 /*
  * Iterate over all TDP MMU roots.  Requires that mmu_lock be held for write,
@@ -180,18 +181,18 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
  * Holding mmu_lock for write obviates the need for RCU protection as the list
  * is guaranteed to be stable.
  */
-#define __for_each_tdp_mmu_root(_kvm, _root, _as_id, _only_valid)		\
+#define __for_each_tdp_mmu_root(_kvm, _root, _as_id, _types)			\
 	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)		\
 		if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) &&		\
 		    ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||	\
-		     !tdp_mmu_root_match((_root), (_only_valid)))) {		\
+		     !tdp_mmu_root_match((_root), (_types)))) {			\
 		} else
 
 #define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
-	__for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
+	__for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_ALL_ROOTS)
 
 #define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id)		\
-	__for_each_tdp_mmu_root(_kvm, _root, _as_id, true)
+	__for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_VALID_ROOTS)
 
 static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 {
@@ -1197,7 +1198,7 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 {
 	struct kvm_mmu_page *root;
 
-	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
+	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, KVM_ALL_ROOTS)
 		flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
 					  range->may_block, flush);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 1ba84487f3b7..b887c225ff24 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -19,6 +19,13 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
+enum kvm_tdp_mmu_root_types {
+	KVM_INVALID_ROOTS = BIT(0),
+
+	KVM_VALID_ROOTS = BIT(1),
+	KVM_ALL_ROOTS = KVM_VALID_ROOTS | KVM_INVALID_ROOTS,
+};
+
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 12/17] KVM: x86/tdp_mmu: Take root in tdp_mmu_for_each_pte()
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (10 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 11/17] KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU Rick Edgecombe
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Take the root as an argument of tdp_mmu_for_each_pte() instead of looking
it up in the mmu. With no other purpose of passing the mmu, drop it.

Future changes will want to change which root is used based on the context
of the MMU operation. So change the callers to pass in the root currently
used, mmu->root.hpa in a preparatory patch to make the later one smaller
and easier to review.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Split from "KVM: x86/mmu: Support GFN direct mask" (Paolo)
---
 arch/x86/kvm/mmu/tdp_mmu.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c8e5e779967e..2200bdc7681f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -693,8 +693,8 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 			continue;					\
 		else
 
-#define tdp_mmu_for_each_pte(_iter, _kvm, _mmu, _start, _end)		\
-	for_each_tdp_pte(_iter, _kvm, root_to_sp(_mmu->root.hpa), _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end)	\
+	for_each_tdp_pte(_iter, _kvm, _root, _start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -1113,8 +1113,8 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
  */
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
-	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	struct kvm *kvm = vcpu->kvm;
+	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
 	int ret = RET_PF_RETRY;
@@ -1125,7 +1125,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 	rcu_read_lock();
 
-	tdp_mmu_for_each_pte(iter, kvm, mmu, fault->gfn, fault->gfn + 1) {
+	tdp_mmu_for_each_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
 		int r;
 
 		if (fault->nx_huge_page_workaround_enabled)
@@ -1784,14 +1784,14 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 			 int *root_level)
 {
+	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
 	struct tdp_iter iter;
-	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
 
 	*root_level = vcpu->arch.mmu->root_role.level;
 
-	tdp_mmu_for_each_pte(iter, vcpu->kvm, mmu, gfn, gfn + 1) {
+	tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
 	}
@@ -1813,11 +1813,11 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 					u64 *spte)
 {
+	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
 	struct tdp_iter iter;
-	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	tdp_ptep_t sptep = NULL;
 
-	tdp_mmu_for_each_pte(iter, vcpu->kvm, mmu, gfn, gfn + 1) {
+	tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
 		*spte = iter.old_spte;
 		sptep = iter.sptep;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (11 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 12/17] KVM: x86/tdp_mmu: Take root in tdp_mmu_for_each_pte() Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-24  8:30   ` Yan Zhao
  2024-07-04  8:51   ` Yan Zhao
  2024-06-19 22:36 ` [PATCH v3 14/17] KVM: x86/tdp_mmu: Propagate attr_filter to MMU notifier callbacks Rick Edgecombe
                   ` (3 subsequent siblings)
  16 siblings, 2 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add the ability for the TDP MMU to maintain a mirror of a separate
mapping.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

In order to handle both shared and private memory, KVM needs to learn to
handle faults and other operations on the correct root for the operation.
KVM could learn the concept of private roots, and operate on them by
calling out to operations that call into the TDX module. But there are two
problems with that:
1. Calls into the TDX module are relatively slow compared to the simple
   accesses required to read a PTE managed directly by KVM.
2. Other Coco technologies deal with private memory completely differently
   and it will make the code confusing when being read from their
   perspective. Special operations added for TDX that set private or zap
   private memory will have nothing to do with these other private memory
   technologies. (SEV, etc).

To handle these, instead teach the TDP MMU about a new concept "mirror
roots". Such roots maintain page tables that are not actually mapped,
and are just used to traverse quickly to determine if the mid level page
tables need to be installed. When the memory be mirrored needs to actually
be changed, calls can be made to via x86_ops.

  private KVM page fault   |
      |                    |
      V                    |
 private GPA               |     CPU protected EPTP
      |                    |           |
      V                    |           V
 mirror PT root            |     private PT root
      |                    |           |
      V                    |           V
   mirror PT   --hook to propagate-->private PT
      |                    |           |
      \--------------------+------\    |
                           |      |    |
                           |      V    V
                           |    private guest page
                           |
                           |
     non-encrypted memory  |    encrypted memory
                           |

Leave calling out to actually update the private page tables that are being
mirrored for later changes. Just implement the handling of MMU operations
on to mirrored roots.

In order to direct operations to correct root, add root types
KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct
roots to private/shared with conditionals. It could also be implemented by
making the kvm_tdp_mmu_root_types and kvm_gfn_range_filter enum bits line
up such that conversion could be a direct assignment with a case. Don't do
this because the mapping of private to mirrored is confusing enough. So it
is worth not hiding the logic in type casting.

Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place
in kvm_mmu_free_roots(), because the private root that is being cannot be
rebuilt like a normal root. It needs to persist for the lifetime of the VM.

The TDX module will also need to be provided with page tables to use for
the actual mapping being mirrored by the mirrored page tables. Allocate
these in the mapping path using the recently added
kvm_mmu_alloc_private_spt().

Don't support 2M page for now. This is avoided by forcing 4k pages in the
fault. Add a KVM_BUG_ON() to verify.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Remove handle_changed_spte() changes
 - Drop kvm_on_mirror()/kvm_on_direct(), open code (Paolo)
 - Rename tdp_mmu_get_fault_root_type() -> tdp_mmu_get_root_for_fault()  (Paolo)
 - Use fault->addr (in helper) to determine direct vs mirror root (Paolo)
 - Rename process -> filter (Paolo)

TDX MMU Prep v2:
 - Rename private->mirror
 - Split apart from "KVM: x86/tdp_mmu: Support TDX private mapping for TDP
   MMU"
 - Update log
 - Sprinkle a few comments
 - Use kvm_on_*() helpers to direct iterator to proper root
 - Drop BUGGY_KVM_ROOTS because the translation between the process enum
   is no longer automatic, and the warn already happens elsewhere.

TDX MMU Prep:
 - Remove unnecessary gfn, access twist in
   tdp_mmu_map_handle_target_level(). (Chao Gao)
 - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
   tdp_mmu_alloc_sp()
 - Update comment in set_private_spte_present() (Yan)
 - Open code call to kvm_mmu_init_private_spt() (Yan)
 - Add comments on TDX MMU hooks (Yan)
 - Fix various whitespace alignment (Yan)
 - Remove pointless warnings and conditionals in
   handle_removed_private_spte() (Yan)
 - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
 - Remove incorrect comment in handle_changed_spte() (Yan)
 - Remove unneeded kvm_pfn_to_refcounted_page() and
   is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
 - Do kvm_gfn_for_root() branchless (Rick)
 - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
 - Add comment for stripping shared bit for fault.gfn (Chao)

v19:
 - drop CONFIG_KVM_MMU_PRIVATE
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.h              |  7 ++++++
 arch/x86/kvm/mmu/mmu.c          | 11 ++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c      | 34 ++++++++++++++++++++------
 arch/x86/kvm/mmu/tdp_mmu.h      | 43 ++++++++++++++++++++++++++++++---
 5 files changed, 84 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6e07eff06a58..d67e88a69fc4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -460,6 +460,7 @@ struct kvm_mmu {
 	int (*sync_spte)(struct kvm_vcpu *vcpu,
 			 struct kvm_mmu_page *sp, int i);
 	struct kvm_mmu_root_info root;
+	hpa_t mirror_root_hpa;
 	union kvm_cpu_role cpu_role;
 	union kvm_mmu_page_role root_role;
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 63179a4fba7b..7b12ba761c51 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -328,4 +328,11 @@ static inline gfn_t kvm_gfn_direct_bits(const struct kvm *kvm)
 {
 	return kvm->arch.gfn_direct_bits;
 }
+
+static inline bool kvm_is_addr_direct(struct kvm *kvm, gpa_t gpa)
+{
+	gpa_t gpa_direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(kvm));
+
+	return !gpa_direct_bits || (gpa & gpa_direct_bits);
+}
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e9c1783a8743..287dcc2685e4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	int r;
 
 	if (tdp_mmu_enabled) {
-		kvm_tdp_mmu_alloc_root(vcpu);
+		if (kvm_has_mirrored_tdp(vcpu->kvm))
+			kvm_tdp_mmu_alloc_root(vcpu, true);
+		kvm_tdp_mmu_alloc_root(vcpu, false);
 		return 0;
 	}
 
@@ -6245,6 +6247,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 
 	mmu->root.hpa = INVALID_PAGE;
 	mmu->root.pgd = 0;
+	mmu->mirror_root_hpa = INVALID_PAGE;
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
 
@@ -7220,6 +7223,12 @@ int kvm_mmu_vendor_module_init(void)
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_unload(vcpu);
+	if (tdp_mmu_enabled) {
+		read_lock(&vcpu->kvm->mmu_lock);
+		mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->mirror_root_hpa,
+				   NULL);
+		read_unlock(&vcpu->kvm->mmu_lock);
+	}
 	free_mmu_pages(&vcpu->arch.root_mmu);
 	free_mmu_pages(&vcpu->arch.guest_mmu);
 	mmu_free_memory_caches(vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2200bdc7681f..a0010c62425f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -95,10 +95,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
 static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
 			       enum kvm_tdp_mmu_root_types types)
 {
+	if (WARN_ON_ONCE(!(types & KVM_VALID_ROOTS)))
+		return false;
+
 	if (root->role.invalid)
 		return types & KVM_INVALID_ROOTS;
+	if (likely(!is_mirror_sp(root)))
+		return types & KVM_DIRECT_ROOTS;
 
-	return true;
+	return types & KVM_MIRROR_ROOTS;
 }
 
 /*
@@ -233,7 +238,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
 	tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
 }
 
-void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	union kvm_mmu_page_role role = mmu->root_role;
@@ -241,6 +246,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_page *root;
 
+	if (mirror)
+		role.is_mirror = 1;
+
 	/*
 	 * Check for an existing root before acquiring the pages lock to avoid
 	 * unnecessary serialization if multiple vCPUs are loading a new root.
@@ -292,8 +300,12 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
 	 * and actually consuming the root if it's invalidated after dropping
 	 * mmu_lock, and the root can't be freed as this vCPU holds a reference.
 	 */
-	mmu->root.hpa = __pa(root->spt);
-	mmu->root.pgd = 0;
+	if (mirror) {
+		mmu->mirror_root_hpa = __pa(root->spt);
+	} else {
+		mmu->root.hpa = __pa(root->spt);
+		mmu->root.pgd = 0;
+	}
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
@@ -1113,8 +1125,8 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
  */
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
+	struct kvm_mmu_page *root = tdp_mmu_get_root_for_fault(vcpu, fault);
 	struct kvm *kvm = vcpu->kvm;
-	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
 	int ret = RET_PF_RETRY;
@@ -1152,13 +1164,18 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 */
 		sp = tdp_mmu_alloc_sp(vcpu);
 		tdp_mmu_init_child_sp(sp, &iter);
+		if (is_mirror_sp(sp))
+			kvm_mmu_alloc_external_spt(vcpu, sp);
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
 
-		if (is_shadow_present_pte(iter.old_spte))
+		if (is_shadow_present_pte(iter.old_spte)) {
+			/* Don't support large page for mirrored roots (TDX) */
+			KVM_BUG_ON(is_mirror_sptep(iter.sptep), vcpu->kvm);
 			r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
-		else
+		} else {
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
+		}
 
 		/*
 		 * Force the guest to retry if installing an upper level SPTE
@@ -1813,7 +1830,8 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 					u64 *spte)
 {
-	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
+	/* Fast pf is not supported for mirrored roots  */
+	struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, KVM_DIRECT_ROOTS);
 	struct tdp_iter iter;
 	tdp_ptep_t sptep = NULL;
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index b887c225ff24..2903f03a34be 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,7 +10,7 @@
 void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
 
-void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool private);
 
 __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 {
@@ -21,11 +21,48 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
 enum kvm_tdp_mmu_root_types {
 	KVM_INVALID_ROOTS = BIT(0),
-
-	KVM_VALID_ROOTS = BIT(1),
+	KVM_DIRECT_ROOTS = BIT(1),
+	KVM_MIRROR_ROOTS = BIT(2),
+	KVM_VALID_ROOTS = KVM_DIRECT_ROOTS | KVM_MIRROR_ROOTS,
 	KVM_ALL_ROOTS = KVM_VALID_ROOTS | KVM_INVALID_ROOTS,
 };
 
+static inline enum kvm_tdp_mmu_root_types kvm_gfn_range_filter_to_root_types(struct kvm *kvm,
+							     enum kvm_gfn_range_filter process)
+{
+	enum kvm_tdp_mmu_root_types ret = 0;
+
+	if (!kvm_has_mirrored_tdp(kvm))
+		return KVM_DIRECT_ROOTS;
+
+	if (process & KVM_FILTER_PRIVATE)
+		ret |= KVM_MIRROR_ROOTS;
+	if (process & KVM_FILTER_SHARED)
+		ret |= KVM_DIRECT_ROOTS;
+
+	WARN_ON_ONCE(!ret);
+
+	return ret;
+}
+
+static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
+							      struct kvm_page_fault *fault)
+{
+	if (unlikely(!kvm_is_addr_direct(vcpu->kvm, fault->addr)))
+		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
+
+	return root_to_sp(vcpu->arch.mmu->root.hpa);
+}
+
+static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu,
+						    enum kvm_tdp_mmu_root_types type)
+{
+	if (unlikely(type == KVM_MIRROR_ROOTS))
+		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
+
+	return root_to_sp(vcpu->arch.mmu->root.hpa);
+}
+
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 14/17] KVM: x86/tdp_mmu: Propagate attr_filter to MMU notifier callbacks
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (12 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 15/17] KVM: x86/tdp_mmu: Propagate building mirror page tables Rick Edgecombe
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Teach the MMU notifier callbacks how to check kvm_gfn_range.process to
filter which KVM MMU root types to operate on.

The private GPAs are backed by guest memfd. Such memory is not subjected
to MMU notifier callbacks because it can't be mapped into the host user
address space. Now kvm_gfn_range conveys info about which root to operate
on. Enhance the callback to filter the root page table type.

The KVM MMU notifier comes down to two functions.
kvm_tdp_mmu_unmap_gfn_range() and kvm_tdp_mmu_handle_gfn().

For VM's without a private/shared split in the EPT, all operations
should target the normal(direct) root.

invalidate_range_start() comes into kvm_tdp_mmu_unmap_gfn_range().
invalidate_range_end() doesn't come into arch code.

With the switch from for_each_tdp_mmu_root() to
__for_each_tdp_mmu_root() in kvm_tdp_mmu_handle_gfn(), there are no
longer any users of for_each_tdp_mmu_root(). Remove it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Change subject from "Make mmu notifier callbacks to check
   kvm_process" to "Propagate attr_filter to MMU notifier callbacks"
   (Paolo)
 - Remove no longer used for_each_tdp_mmu_root() (Binbin)

TDX MMU Prep v2:
 - Use newly added kvm_process_to_root_types()

TDX MMU Prep:
 - Remove warning (Rick)
 - Remove confusing mention of mapping flags (Chao)
 - Re-write coverletter

v19:
 - type: test_gfn() => test_young()

---
 arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a0010c62425f..582e5a045bb7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -193,9 +193,6 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		     !tdp_mmu_root_match((_root), (_types)))) {			\
 		} else
 
-#define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
-	__for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_ALL_ROOTS)
-
 #define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id)		\
 	__for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_VALID_ROOTS)
 
@@ -1210,12 +1207,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return ret;
 }
 
+/* Used by mmu notifier via kvm_unmap_gfn_range() */
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush)
 {
+	enum kvm_tdp_mmu_root_types types;
 	struct kvm_mmu_page *root;
 
-	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, KVM_ALL_ROOTS)
+	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter);
+
+	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types)
 		flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
 					  range->may_block, flush);
 
@@ -1229,15 +1230,18 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 						   struct kvm_gfn_range *range,
 						   tdp_handler_t handler)
 {
+	enum kvm_tdp_mmu_root_types types;
 	struct kvm_mmu_page *root;
 	struct tdp_iter iter;
 	bool ret = false;
 
+	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter);
+
 	/*
 	 * Don't support rescheduling, none of the MMU notifiers that funnel
 	 * into this helper allow blocking; it'd be dead, wasteful code.
 	 */
-	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+	__for_each_tdp_mmu_root(kvm, root, range->slot->as_id, types) {
 		rcu_read_lock();
 
 		tdp_root_for_each_leaf_pte(iter, kvm, root, range->start, range->end)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 15/17] KVM: x86/tdp_mmu: Propagate building mirror page tables
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (13 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 14/17] KVM: x86/tdp_mmu: Propagate attr_filter to MMU notifier callbacks Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-20  5:15   ` Binbin Wu
  2024-06-19 22:36 ` [PATCH v3 16/17] KVM: x86/tdp_mmu: Propagate tearing down " Rick Edgecombe
  2024-06-19 22:36 ` [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots() Rick Edgecombe
  16 siblings, 1 reply; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Integrate hooks for mirroring page table operations for cases where TDX
will set PTEs or link page tables.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from any actual guest
mapping. Now add plumbing to propagate changes to the "external" page
tables being mirrored. Just create the x86_ops for now, leave plumbing the
operations into the TDX module for future patches.

Add two operations for setting up external page tables, one for linking
new page tables and one for setting leaf PTEs. Don't add any op for
configuring the root PFN, as TDX handles this itself. Don't provide a
way to set permissions on the PTEs also, as TDX doesn't support it.

This results is MMU "mirroring" support that is very targeted towards TDX.
Since it is likely there will be no other user, the main benefit of making
the support generic is to keep TDX specific *looking* code outside of the
MMU. As a generic feature it will make enough sense from TDX's
perspective. For developers unfamiliar with TDX arch it can express the
general concepts such that they can continue to work in the code.

TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For setting/linking, only
hook tdp_mmu_set_spte_atomic() which is use used for mapping and linking
PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for
setting PTEs in operations unsupported by TDX: splitting huge pages and
write protecting. Sprinkle a KVM_BUG_ON()s to document as code that these
paths are not supported for mirrored page tables. For zapping operations,
leave those for near future changes.

Many operations in the TDP MMU depend on atomicity of the PTE update.
While the mirror PTE on KVM's side can be updated atomically, the update
that happens inside the external operations (S-EPT updates via TDX module
call) can't happen atomically with the mirror update. The following race
could result during two vCPU's populating private memory:

* vcpu 1: atomically update 2M level mirror EPT entry to be present
* vcpu 2: read 2M level EPT entry that is present
* vcpu 2: walk down into 4K level EPT
* vcpu 2: atomically update 4K level mirror EPT entry to be present
* vcpu 2: set_exterma;_spte() to update 4K secure EPT entry => error
          because 2M secure EPT entry is not populated yet
* vcpu 1: link_external_spt() to update 2M secure EPT entry

Prevent this by setting the mirror PTE to FROZEN_SPTE while the reflect
operations are performed. Only write the actual mirror PTE value once the
reflect operations has completed. When trying to set a PTE to present and
encountering a removed SPTE, retry the fault.

By doing this the race is prevented as follows:
* vcpu 1: atomically update 2M level EPT entry to be FROZEN_SPTE
* vcpu 2: read 2M level EPT entry that is FROZEN_SPTE
* vcpu 2: find that the EPT entry is frozen
          abandon page table walk to resume guest execution
* vcpu 1: link_external_spt() to update 2M secure EPT entry
* vcpu 1: atomically update 2M level EPT entry to be present (unfreeze)
* vcpu 2: resume guest execution
          Depending on vcpu 1 state, vcpu 2 may result in EPT violation
          again or make progress on guest execution

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Rename mirrored->external (Paolo)
 - Better comment on logic that bugs if doing tdp_mmu_set_spte() on
   present PTE. (Paolo)
 - Move zapping KVM_BUG_ON() to proper patch
 - Use spte_to_child_sp() (Paolo)
 - Drop unnessary comment in __tdp_mmu_set_spte_atomic() (Paolo)
 - Rename pfn->pfn_for_gfn to match remove_external_pte in next patch.
 - Rename REMOVED_SPTE to FROZEN_SPTE (Paolo)

TDX MMU Prep v2:
 - Split from "KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU"
 - Rename x86_ops from "private" to "reflect"
 - In response to "sp->mirrored_spt" rename helpers to "mirrored"
 - Drop unused old_pfn and new_pfn in handle_changed_spte()
 - Drop redundant is_shadow_present_pte() check in __tdp_mmu_set_spte_atomic
 - Adjust some warnings and KVM_BUG_ONs
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 +
 arch/x86/include/asm/kvm_host.h    |  7 +++
 arch/x86/kvm/mmu/tdp_mmu.c         | 98 ++++++++++++++++++++++++++----
 3 files changed, 94 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 566d19b02483..3ef19fcb5e42 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,6 +95,8 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
 KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP_OPTIONAL(link_external_spt)
+KVM_X86_OP_OPTIONAL(set_external_spte)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d67e88a69fc4..12ff04135a0e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1738,6 +1738,13 @@ struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
 
+	/* Update external mapping with page table link */
+	int (*link_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				void *mirrored_spt);
+	/* Update the external page table from spte getting set */
+	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				 kvm_pfn_t pfn_for_gfn);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 582e5a045bb7..bc1ad127046d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -244,7 +244,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
 	struct kvm_mmu_page *root;
 
 	if (mirror)
-		role.is_mirror = 1;
+		role.is_mirror = true;
 
 	/*
 	 * Check for an existing root before acquiring the pages lock to avoid
@@ -440,6 +440,59 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
+static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
+{
+	if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
+		struct kvm_mmu_page *sp = spte_to_child_sp(new_spte);
+
+		WARN_ON_ONCE(sp->role.level + 1 != level);
+		WARN_ON_ONCE(sp->gfn != gfn);
+		return sp->external_spt;
+	}
+
+	return NULL;
+}
+
+static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
+						 gfn_t gfn, u64 old_spte,
+						 u64 new_spte, int level)
+{
+	bool was_present = is_shadow_present_pte(old_spte);
+	bool is_present = is_shadow_present_pte(new_spte);
+	bool is_leaf = is_present && is_last_spte(new_spte, level);
+	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+	int ret = 0;
+
+	KVM_BUG_ON(was_present, kvm);
+
+	lockdep_assert_held(&kvm->mmu_lock);
+	/*
+	 * We need to lock out other updates to the SPTE until the external
+	 * page table has been modified. Use FROZEN_SPTE similar to
+	 * the zapping case.
+	 */
+	if (!try_cmpxchg64(sptep, &old_spte, FROZEN_SPTE))
+		return -EBUSY;
+
+	/*
+	 * Use different call to either set up middle level
+	 * external page table, or leaf.
+	 */
+	if (is_leaf) {
+		ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
+	} else {
+		void *mirrored_spt = get_external_spt(gfn, new_spte, level);
+
+		KVM_BUG_ON(!mirrored_spt, kvm);
+		ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, mirrored_spt);
+	}
+	if (ret)
+		__kvm_tdp_mmu_write_spte(sptep, old_spte);
+	else
+		__kvm_tdp_mmu_write_spte(sptep, new_spte);
+	return ret;
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -548,7 +601,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
 }
 
-static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
+static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte)
 {
 	u64 *sptep = rcu_dereference(iter->sptep);
 
@@ -560,15 +613,25 @@ static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
 	 */
 	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
 
-	/*
-	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
-	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
-	 * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
-	 * the current value, so the caller operates on fresh data, e.g. if it
-	 * retries tdp_mmu_set_spte_atomic()
-	 */
-	if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
-		return -EBUSY;
+	if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(new_spte)) {
+		int ret;
+
+		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
+						iter->old_spte, new_spte, iter->level);
+		if (ret)
+			return ret;
+	} else {
+		/*
+		 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
+		 * and does not hold the mmu_lock.  On failure, i.e. if a
+		 * different logical CPU modified the SPTE, try_cmpxchg64()
+		 * updates iter->old_spte with the current value, so the caller
+		 * operates on fresh data, e.g. if it retries
+		 * tdp_mmu_set_spte_atomic()
+		 */
+		if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
+			return -EBUSY;
+	}
 
 	return 0;
 }
@@ -598,7 +661,7 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
-	ret = __tdp_mmu_set_spte_atomic(iter, new_spte);
+	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
 	if (ret)
 		return ret;
 
@@ -623,7 +686,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * Delay processing of the zapped SPTE until after TLBs are flushed and
 	 * the FROZEN_SPTE is replaced (see below).
 	 */
-	ret = __tdp_mmu_set_spte_atomic(iter, FROZEN_SPTE);
+	ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
+
 	if (ret)
 		return ret;
 
@@ -680,6 +744,14 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
 	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+
+	/*
+	 * Users that do non-atomic setting of PTEs don't operate on mirror
+	 * roots, so don't handle it and bug the VM if it's seen.
+	 */
+	if (is_mirror_sptep(sptep))
+		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
+
 	return old_spte;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 16/17] KVM: x86/tdp_mmu: Propagate tearing down mirror page tables
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (14 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 15/17] KVM: x86/tdp_mmu: Propagate building mirror page tables Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-20  8:44   ` Binbin Wu
  2024-06-19 22:36 ` [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots() Rick Edgecombe
  16 siblings, 1 reply; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Integrate hooks for mirroring page table operations for cases where TDX
will zap PTEs or free page tables.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from the guest. Now add
plumbing to propagate changes to the "external" page tables being
mirrored. Just create the x86_ops for now, leave plumbing the operations
into the TDX module for future patches.

Add two operations for tearing down page tables, one for freeing page
tables (free_external_spt) and one for zapping PTEs (remove_external_spte).
Define them such that remove_external_spte will perform a TLB flush as
well. (in TDX terms "ensure there are no active translations").

TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For zapping/freeing, only
hook tdp_mmu_iter_set_spte() which is use used for mapping and linking
PTs. Don't bother hooking tdp_mmu_set_spte_atomic() as it is only used for
zapping PTEs in operations unsupported by TDX: zapping collapsible PTEs and
kvm_mmu_zap_all_fast().

In previous changes to address races around concurrent populating using
tdp_mmu_set_spte_atomic(), a solution was introduced to temporarily set
REMOVED_SPTE in the mirrored page tables while performing the external
operations. Such a solution is not needed for the tear down paths in TDX
as these will always be performed with the mmu_lock held for write.
Sprinkle some KVM_BUG_ON()s to reflect this.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Rename mirrored->external (Paolo)
 - Drop new_spte arg from reflect_removed_spte() (Paolo)
 - ...and drop was_present and is_present bools (Paolo)
 - Use base_gfn instead of sp->gfn (Paolo)
 - Better comment on logic that bugs if doing tdp_mmu_set_spte() on
   present PTE. (Paolo)
 - Move comment around KVM_BUG_ON() in __tdp_mmu_set_spte_atomic() to this
   patch, and add better comment. (Paolo)
 - In remove_external_spte(), remove was_leaf bool, skip duplicates
   present check and add comment.
 - Rename REMOVED_SPTE to FROZEN_SPTE (Paolo)

TDX MMU Prep v2:
 - Split from "KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU"
 - Rename x86_ops from "private" to "reflect"
 - In response to "sp->mirrored_spt" rename helpers to "mirrored"
 - Remove unused present mirroring support in tdp_mmu_set_spte()
 - Merge reflect_zap_spte() into reflect_remove_spte()
 - Move mirror zapping logic out of handle_changed_spte()
 - Add some KVM_BUG_ONs
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 ++
 arch/x86/include/asm/kvm_host.h    |  8 +++++
 arch/x86/kvm/mmu/tdp_mmu.c         | 51 +++++++++++++++++++++++++++++-
 3 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 3ef19fcb5e42..18a83b211c90 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -97,6 +97,8 @@ KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
 KVM_X86_OP_OPTIONAL(link_external_spt)
 KVM_X86_OP_OPTIONAL(set_external_spte)
+KVM_X86_OP_OPTIONAL(free_external_spt)
+KVM_X86_OP_OPTIONAL(remove_external_spte)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 12ff04135a0e..dca623ffa903 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1745,6 +1745,14 @@ struct kvm_x86_ops {
 	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				 kvm_pfn_t pfn_for_gfn);
 
+	/* Update external page tables for page table about to be freed */
+	int (*free_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				 void *external_spt);
+
+	/* Update external page table from spte getting removed, and flush TLB */
+	int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				    kvm_pfn_t pfn_for_gfn);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bc1ad127046d..630e6b6d4bf2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -340,6 +340,29 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 }
 
+static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
+				 int level)
+{
+	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+	int ret;
+
+	/*
+	 * External (TDX) SPTEs are limited to PG_LEVEL_4K, and external
+	 * PTs are removed in a special order, involving free_external_spt().
+	 * But remove_external_spte() will be called on non-leaf PTEs via
+	 * __tdp_mmu_zap_root(), so avoid the error the former would return
+	 * in this case.
+	 */
+	if (!is_last_spte(old_spte, level))
+		return;
+
+	/* Zapping leaf spte is allowed only when write lock is held. */
+	lockdep_assert_held_write(&kvm->mmu_lock);
+	/* Because write lock is held, operation should success. */
+	ret = static_call(kvm_x86_remove_external_spte)(kvm, gfn, level, old_pfn);
+	KVM_BUG_ON(ret, kvm);
+}
+
 /**
  * handle_removed_pt() - handle a page table removed from the TDP structure
  *
@@ -435,6 +458,23 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 		}
 		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
 				    old_spte, FROZEN_SPTE, level, shared);
+
+		if (is_mirror_sp(sp)) {
+			KVM_BUG_ON(shared, kvm);
+			remove_external_spte(kvm, gfn, old_spte, level);
+		}
+	}
+
+	if (is_mirror_sp(sp) &&
+	    WARN_ON(static_call(kvm_x86_free_external_spt)(kvm, base_gfn, sp->role.level,
+							  sp->external_spt))) {
+		/*
+		 * Failed to free page table page in mirror page table and
+		 * there is nothing to do further.
+		 * Intentionally leak the page to prevent the kernel from
+		 * accessing the encrypted page.
+		 */
+		sp->external_spt = NULL;
 	}
 
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
@@ -616,6 +656,13 @@ static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter *it
 	if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(new_spte)) {
 		int ret;
 
+		/*
+		 * Users of atomic zapping don't operate on mirror roots,
+		 * so don't handle it and bug the VM if it's seen.
+		 */
+		if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
+			return -EBUSY;
+
 		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
 						iter->old_spte, new_spte, iter->level);
 		if (ret)
@@ -749,8 +796,10 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	 * Users that do non-atomic setting of PTEs don't operate on mirror
 	 * roots, so don't handle it and bug the VM if it's seen.
 	 */
-	if (is_mirror_sptep(sptep))
+	if (is_mirror_sptep(sptep)) {
 		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
+		remove_external_spte(kvm, gfn, old_spte, level);
+	}
 
 	return old_spte;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
                   ` (15 preceding siblings ...)
  2024-06-19 22:36 ` [PATCH v3 16/17] KVM: x86/tdp_mmu: Propagate tearing down " Rick Edgecombe
@ 2024-06-19 22:36 ` Rick Edgecombe
  2024-06-21  7:10   ` Yan Zhao
  16 siblings, 1 reply; 47+ messages in thread
From: Rick Edgecombe @ 2024-06-19 22:36 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, erdemaktas, isaku.yamahata, linux-kernel,
	sagis, yan.y.zhao, rick.p.edgecombe, Isaku Yamahata, Chao Gao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Rename kvm_tdp_mmu_invalidate_all_roots() to
kvm_tdp_mmu_invalidate_roots(), and make it enum kvm_tdp_mmu_root_types
as an argument.

kvm_tdp_mmu_invalidate_roots() is called with different root types. For
kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing
down a VM it needs to invalidate all roots. Have the callers only
invalidate the required roots instead of all roots.

Within kvm_tdp_mmu_invalidate_roots(), respect the root type
passed by checking the root type in root iterator.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Use root enum instead of process enum (Paolo)
 - Squash with "KVM: x86/tdp_mmu: Invalidate correct roots" (Paolo)
 - Update comment in kvm_mmu_zap_all_fast() (Paolo)
 - Add warning for attempting to invalidate invalid roots (Paolo)

TDX MMU Prep v2:
 - Use process enum instead of root

TDX MMU Prep:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c     |  9 +++++++--
 arch/x86/kvm/mmu/tdp_mmu.c | 15 +++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.h |  3 ++-
 3 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 287dcc2685e4..c5255aa62146 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6414,8 +6414,13 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	 * write and in the same critical section as making the reload request,
 	 * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
 	 */
-	if (tdp_mmu_enabled)
-		kvm_tdp_mmu_invalidate_all_roots(kvm);
+	if (tdp_mmu_enabled) {
+		/*
+		 * External page tables don't support fast zapping, therefore
+		 * their mirrors must be invalidated separately by the caller.
+		 */
+		kvm_tdp_mmu_invalidate_roots(kvm, KVM_DIRECT_ROOTS);
+	}
 
 	/*
 	 * Notify all vcpus to reload its shadow page table and flush TLB.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 630e6b6d4bf2..a1ab67a4f41f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	 * for zapping and thus puts the TDP MMU's reference to each root, i.e.
 	 * ultimately frees all roots.
 	 */
-	kvm_tdp_mmu_invalidate_all_roots(kvm);
+	kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
 	kvm_tdp_mmu_zap_invalidated_roots(kvm);
 
 	WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
@@ -1110,10 +1110,18 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
  * Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
  * See kvm_tdp_mmu_alloc_root().
  */
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
+void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
+				  enum kvm_tdp_mmu_root_types root_types)
 {
 	struct kvm_mmu_page *root;
 
+	/*
+	 * Invalidating invalid roots doesn't make sense, prevent developers from
+	 * having to think about it.
+	 */
+	if (WARN_ON_ONCE(root_types & KVM_INVALID_ROOTS))
+		root_types &= ~KVM_INVALID_ROOTS;
+
 	/*
 	 * mmu_lock must be held for write to ensure that a root doesn't become
 	 * invalid while there are active readers (invalidating a root while
@@ -1135,6 +1143,9 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
 	 * or get/put references to roots.
 	 */
 	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+		if (!tdp_mmu_root_match(root, root_types))
+			continue;
+
 		/*
 		 * Note, invalid roots can outlive a memslot update!  Invalid
 		 * roots must be *zapped* before the memslot update completes,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 2903f03a34be..56741d31048a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -66,7 +66,8 @@ static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu,
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
+void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
+				  enum kvm_tdp_mmu_root_types root_types);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 15/17] KVM: x86/tdp_mmu: Propagate building mirror page tables
  2024-06-19 22:36 ` [PATCH v3 15/17] KVM: x86/tdp_mmu: Propagate building mirror page tables Rick Edgecombe
@ 2024-06-20  5:15   ` Binbin Wu
  2024-06-24 23:52     ` Edgecombe, Rick P
  0 siblings, 1 reply; 47+ messages in thread
From: Binbin Wu @ 2024-06-20  5:15 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kvm, kai.huang, dmatlack, erdemaktas,
	isaku.yamahata, linux-kernel, sagis, yan.y.zhao, Isaku Yamahata



On 6/20/2024 6:36 AM, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Integrate hooks for mirroring page table operations for cases where TDX
> will set PTEs or link page tables.
>
> Like other Coco technologies, TDX has the concept of private and shared
> memory. For TDX the private and shared mappings are managed on separate
> EPT roots. The private half is managed indirectly though calls into a
> protected runtime environment called the TDX module, where the shared half
> is managed within KVM in normal page tables.
>
> Since calls into the TDX module are relatively slow, walking private page
> tables by making calls into the TDX module would not be efficient. Because
> of this, previous changes have taught the TDP MMU to keep a mirror root,
> which is separate, unmapped TDP root that private operations can be
> directed to. Currently this root is disconnected from any actual guest
> mapping. Now add plumbing to propagate changes to the "external" page
> tables being mirrored. Just create the x86_ops for now, leave plumbing the
> operations into the TDX module for future patches.
>
> Add two operations for setting up external page tables, one for linking
> new page tables and one for setting leaf PTEs. Don't add any op for
> configuring the root PFN, as TDX handles this itself. Don't provide a
> way to set permissions on the PTEs also, as TDX doesn't support it.
>
> This results is MMU "mirroring" support that is very targeted towards TDX.
                ^
                extra "is"?
> Since it is likely there will be no other user, the main benefit of making
> the support generic is to keep TDX specific *looking* code outside of the
> MMU. As a generic feature it will make enough sense from TDX's
> perspective. For developers unfamiliar with TDX arch it can express the
> general concepts such that they can continue to work in the code.
>
> TDX MMU support will exclude certain MMU operations, so only plug in the
> mirroring x86 ops where they will be needed. For setting/linking, only
> hook tdp_mmu_set_spte_atomic() which is use used for mapping and linking
> PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for
> setting PTEs in operations unsupported by TDX: splitting huge pages and
> write protecting. Sprinkle a KVM_BUG_ON()s to document as code that these
                             ^
                             extra "a"
> paths are not supported for mirrored page tables. For zapping operations,
> leave those for near future changes.
>
> Many operations in the TDP MMU depend on atomicity of the PTE update.
> While the mirror PTE on KVM's side can be updated atomically, the update
> that happens inside the external operations (S-EPT updates via TDX module
> call) can't happen atomically with the mirror update. The following race
> could result during two vCPU's populating private memory:
>
> * vcpu 1: atomically update 2M level mirror EPT entry to be present
> * vcpu 2: read 2M level EPT entry that is present
> * vcpu 2: walk down into 4K level EPT
> * vcpu 2: atomically update 4K level mirror EPT entry to be present
> * vcpu 2: set_exterma;_spte() to update 4K secure EPT entry => error
>            because 2M secure EPT entry is not populated yet
> * vcpu 1: link_external_spt() to update 2M secure EPT entry
>
> Prevent this by setting the mirror PTE to FROZEN_SPTE while the reflect
> operations are performed. Only write the actual mirror PTE value once the
             ^
             maybe "are being"
> reflect operations has completed. When trying to set a PTE to present and
> encountering a removed SPTE, retry the fault.
                 ^
                 frozen
>
> By doing this the race is prevented as follows:
> * vcpu 1: atomically update 2M level EPT entry to be FROZEN_SPTE
> * vcpu 2: read 2M level EPT entry that is FROZEN_SPTE
> * vcpu 2: find that the EPT entry is frozen
>            abandon page table walk to resume guest execution
> * vcpu 1: link_external_spt() to update 2M secure EPT entry
> * vcpu 1: atomically update 2M level EPT entry to be present (unfreeze)
> * vcpu 2: resume guest execution
>            Depending on vcpu 1 state, vcpu 2 may result in EPT violation
>            again or make progress on guest execution
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU Prep v3:
>   - Rename mirrored->external (Paolo)
>   - Better comment on logic that bugs if doing tdp_mmu_set_spte() on
>     present PTE. (Paolo)
>   - Move zapping KVM_BUG_ON() to proper patch
>   - Use spte_to_child_sp() (Paolo)
>   - Drop unnessary comment in __tdp_mmu_set_spte_atomic() (Paolo)
>   - Rename pfn->pfn_for_gfn to match remove_external_pte in next patch.
>   - Rename REMOVED_SPTE to FROZEN_SPTE (Paolo)
>
> TDX MMU Prep v2:
>   - Split from "KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU"
>   - Rename x86_ops from "private" to "reflect"
>   - In response to "sp->mirrored_spt" rename helpers to "mirrored"
>   - Drop unused old_pfn and new_pfn in handle_changed_spte()
>   - Drop redundant is_shadow_present_pte() check in __tdp_mmu_set_spte_atomic
>   - Adjust some warnings and KVM_BUG_ONs
> ---
>   arch/x86/include/asm/kvm-x86-ops.h |  2 +
>   arch/x86/include/asm/kvm_host.h    |  7 +++
>   arch/x86/kvm/mmu/tdp_mmu.c         | 98 ++++++++++++++++++++++++++----
>   3 files changed, 94 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 566d19b02483..3ef19fcb5e42 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -95,6 +95,8 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
>   KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
>   KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
>   KVM_X86_OP(load_mmu_pgd)
> +KVM_X86_OP_OPTIONAL(link_external_spt)
> +KVM_X86_OP_OPTIONAL(set_external_spte)
>   KVM_X86_OP(has_wbinvd_exit)
>   KVM_X86_OP(get_l2_tsc_offset)
>   KVM_X86_OP(get_l2_tsc_multiplier)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d67e88a69fc4..12ff04135a0e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1738,6 +1738,13 @@ struct kvm_x86_ops {
>   	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>   			     int root_level);
>   
> +	/* Update external mapping with page table link */
Nit: Add "." at the end of the sentence.

> +	int (*link_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				void *mirrored_spt);
> +	/* Update the external page table from spte getting set */
Ditto.

> +	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				 kvm_pfn_t pfn_for_gfn);
> +
>   	bool (*has_wbinvd_exit)(void);
>   
>   	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 582e5a045bb7..bc1ad127046d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -244,7 +244,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
>   	struct kvm_mmu_page *root;
>   
>   	if (mirror)
> -		role.is_mirror = 1;
> +		role.is_mirror = true;
If this is needed, it's better to be done in "KVM: x86/tdp_mmu: Support 
mirror root for TDP MMU".

>   
>   	/*
>   	 * Check for an existing root before acquiring the pages lock to avoid
> @@ -440,6 +440,59 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>   	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
>   }
>   
[...]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 16/17] KVM: x86/tdp_mmu: Propagate tearing down mirror page tables
  2024-06-19 22:36 ` [PATCH v3 16/17] KVM: x86/tdp_mmu: Propagate tearing down " Rick Edgecombe
@ 2024-06-20  8:44   ` Binbin Wu
  2024-06-24 23:55     ` Edgecombe, Rick P
  0 siblings, 1 reply; 47+ messages in thread
From: Binbin Wu @ 2024-06-20  8:44 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kvm, kai.huang, dmatlack, erdemaktas,
	isaku.yamahata, linux-kernel, sagis, yan.y.zhao, Isaku Yamahata



On 6/20/2024 6:36 AM, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Integrate hooks for mirroring page table operations for cases where TDX
> will zap PTEs or free page tables.
>
> Like other Coco technologies, TDX has the concept of private and shared
> memory. For TDX the private and shared mappings are managed on separate
> EPT roots. The private half is managed indirectly though calls into a
> protected runtime environment called the TDX module, where the shared half
> is managed within KVM in normal page tables.
>
> Since calls into the TDX module are relatively slow, walking private page
> tables by making calls into the TDX module would not be efficient. Because
> of this, previous changes have taught the TDP MMU to keep a mirror root,
> which is separate, unmapped TDP root that private operations can be
> directed to. Currently this root is disconnected from the guest. Now add
> plumbing to propagate changes to the "external" page tables being
> mirrored. Just create the x86_ops for now, leave plumbing the operations
> into the TDX module for future patches.
>
> Add two operations for tearing down page tables, one for freeing page
> tables (free_external_spt) and one for zapping PTEs (remove_external_spte).
> Define them such that remove_external_spte will perform a TLB flush as
> well. (in TDX terms "ensure there are no active translations").
>
> TDX MMU support will exclude certain MMU operations, so only plug in the
> mirroring x86 ops where they will be needed. For zapping/freeing, only
> hook tdp_mmu_iter_set_spte() which is use used for mapping and linking
                                         ^
                                         extra "use"

Also, this sentence is a bit confusing about "used for mapping and linking".

> PTs. Don't bother hooking tdp_mmu_set_spte_atomic() as it is only used for
> zapping PTEs in operations unsupported by TDX: zapping collapsible PTEs and
> kvm_mmu_zap_all_fast().
>
> In previous changes to address races around concurrent populating using
> tdp_mmu_set_spte_atomic(), a solution was introduced to temporarily set
> REMOVED_SPTE in the mirrored page tables while performing the external
   ^
  FROZEN_SPTE

> operations. Such a solution is not needed for the tear down paths in TDX
> as these will always be performed with the mmu_lock held for write.
> Sprinkle some KVM_BUG_ON()s to reflect this.
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU Prep v3:
>   - Rename mirrored->external (Paolo)
>   - Drop new_spte arg from reflect_removed_spte() (Paolo)
>   - ...and drop was_present and is_present bools (Paolo)
>   - Use base_gfn instead of sp->gfn (Paolo)
>   - Better comment on logic that bugs if doing tdp_mmu_set_spte() on
>     present PTE. (Paolo)
>   - Move comment around KVM_BUG_ON() in __tdp_mmu_set_spte_atomic() to this
>     patch, and add better comment. (Paolo)
>   - In remove_external_spte(), remove was_leaf bool, skip duplicates
>     present check and add comment.
>   - Rename REMOVED_SPTE to FROZEN_SPTE (Paolo)
>
> TDX MMU Prep v2:
>   - Split from "KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU"
>   - Rename x86_ops from "private" to "reflect"
>   - In response to "sp->mirrored_spt" rename helpers to "mirrored"
>   - Remove unused present mirroring support in tdp_mmu_set_spte()
>   - Merge reflect_zap_spte() into reflect_remove_spte()
>   - Move mirror zapping logic out of handle_changed_spte()
>   - Add some KVM_BUG_ONs
> ---
>   arch/x86/include/asm/kvm-x86-ops.h |  2 ++
>   arch/x86/include/asm/kvm_host.h    |  8 +++++
>   arch/x86/kvm/mmu/tdp_mmu.c         | 51 +++++++++++++++++++++++++++++-
>   3 files changed, 60 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 3ef19fcb5e42..18a83b211c90 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -97,6 +97,8 @@ KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
>   KVM_X86_OP(load_mmu_pgd)
>   KVM_X86_OP_OPTIONAL(link_external_spt)
>   KVM_X86_OP_OPTIONAL(set_external_spte)
> +KVM_X86_OP_OPTIONAL(free_external_spt)
> +KVM_X86_OP_OPTIONAL(remove_external_spte)
>   KVM_X86_OP(has_wbinvd_exit)
>   KVM_X86_OP(get_l2_tsc_offset)
>   KVM_X86_OP(get_l2_tsc_multiplier)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 12ff04135a0e..dca623ffa903 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1745,6 +1745,14 @@ struct kvm_x86_ops {
>   	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>   				 kvm_pfn_t pfn_for_gfn);
>   
> +	/* Update external page tables for page table about to be freed */
Nit: Add "." at the end of the sentence.

> +	int (*free_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				 void *external_spt);
> +
> +	/* Update external page table from spte getting removed, and flush TLB */
Ditto

> +	int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				    kvm_pfn_t pfn_for_gfn);
> +
>   	bool (*has_wbinvd_exit)(void);
>   
>   	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
[...]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-06-19 22:36 ` [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots() Rick Edgecombe
@ 2024-06-21  7:10   ` Yan Zhao
  2024-06-21 19:08     ` Edgecombe, Rick P
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2024-06-21  7:10 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kvm, kai.huang, dmatlack, erdemaktas,
	isaku.yamahata, linux-kernel, sagis, Isaku Yamahata, Chao Gao

On Wed, Jun 19, 2024 at 03:36:14PM -0700, Rick Edgecombe wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 630e6b6d4bf2..a1ab67a4f41f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  	 * for zapping and thus puts the TDP MMU's reference to each root, i.e.
>  	 * ultimately frees all roots.
>  	 */
> -	kvm_tdp_mmu_invalidate_all_roots(kvm);
> +	kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
all roots (mirror + direct) are invalidated here.

>  	kvm_tdp_mmu_zap_invalidated_roots(kvm);
kvm_tdp_mmu_zap_invalidated_roots() will zap invalidated mirror root with
mmu_lock held for read, which should trigger KVM_BUG_ON() in
__tdp_mmu_set_spte_atomic(), which assumes "atomic zapping don't operate on
mirror roots".

But up to now, the KVM_BUG_ON() is not triggered because
kvm_mmu_notifier_release() is called earlier than kvm_destroy_vm() (as in below
call trace), and kvm_arch_flush_shadow_all() in kvm_mmu_notifier_release() has
zapped all mirror SPTEs before kvm_mmu_uninit_vm() called in kvm_destroy_vm().


kvm_mmu_notifier_release
  kvm_flush_shadow_all
    kvm_arch_flush_shadow_all
      static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
      kvm_mmu_zap_all  ==>hold mmu_lock for write
        kvm_tdp_mmu_zap_all ==>zap KVM_ALL_ROOTS with mmu_lock held for write

kvm_destroy_vm
  kvm_arch_destroy_vm
    kvm_mmu_uninit_vm
      kvm_mmu_uninit_tdp_mmu
        kvm_tdp_mmu_invalidate_roots ==>invalid all KVM_VALID_ROOTS
        kvm_tdp_mmu_zap_invalidated_roots ==> zap all roots with mmu_lock held for read


A question is that kvm_mmu_notifier_release(), as a callback of primary MMU
notifier, why does it zap mirrored tdp when all other callbacks are with
KVM_FILTER_SHARED?

Could we just zap all KVM_DIRECT_ROOTS (valid | invalid) in
kvm_mmu_notifier_release() and move mirrord tdp related stuffs from 
kvm_arch_flush_shadow_all() to kvm_mmu_uninit_tdp_mmu(), ensuring mmu_lock is
held for write?

>  
>  	WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-06-21  7:10   ` Yan Zhao
@ 2024-06-21 19:08     ` Edgecombe, Rick P
  2024-06-24  8:29       ` Yan Zhao
  2024-07-18 15:28       ` Isaku Yamahata
  0 siblings, 2 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-06-21 19:08 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Fri, 2024-06-21 at 15:10 +0800, Yan Zhao wrote:
> On Wed, Jun 19, 2024 at 03:36:14PM -0700, Rick Edgecombe wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 630e6b6d4bf2..a1ab67a4f41f 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> >          * for zapping and thus puts the TDP MMU's reference to each root,
> > i.e.
> >          * ultimately frees all roots.
> >          */
> > -       kvm_tdp_mmu_invalidate_all_roots(kvm);
> > +       kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> all roots (mirror + direct) are invalidated here.

Right.
> 
> >         kvm_tdp_mmu_zap_invalidated_roots(kvm);
> kvm_tdp_mmu_zap_invalidated_roots() will zap invalidated mirror root with
> mmu_lock held for read, which should trigger KVM_BUG_ON() in
> __tdp_mmu_set_spte_atomic(), which assumes "atomic zapping don't operate on
> mirror roots".
> 
> But up to now, the KVM_BUG_ON() is not triggered because
> kvm_mmu_notifier_release() is called earlier than kvm_destroy_vm() (as in
> below
> call trace), and kvm_arch_flush_shadow_all() in kvm_mmu_notifier_release() has
> zapped all mirror SPTEs before kvm_mmu_uninit_vm() called in kvm_destroy_vm().
> 
> 
> kvm_mmu_notifier_release
>   kvm_flush_shadow_all
>     kvm_arch_flush_shadow_all
>       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
>       kvm_mmu_zap_all  ==>hold mmu_lock for write
>         kvm_tdp_mmu_zap_all ==>zap KVM_ALL_ROOTS with mmu_lock held for write
> 
> kvm_destroy_vm
>   kvm_arch_destroy_vm
>     kvm_mmu_uninit_vm
>       kvm_mmu_uninit_tdp_mmu
>         kvm_tdp_mmu_invalidate_roots ==>invalid all KVM_VALID_ROOTS
>         kvm_tdp_mmu_zap_invalidated_roots ==> zap all roots with mmu_lock held
> for read
> 
> 
> A question is that kvm_mmu_notifier_release(), as a callback of primary MMU
> notifier, why does it zap mirrored tdp when all other callbacks are with
> KVM_FILTER_SHARED?
> 
> Could we just zap all KVM_DIRECT_ROOTS (valid | invalid) in
> kvm_mmu_notifier_release() and move mirrord tdp related stuffs from 
> kvm_arch_flush_shadow_all() to kvm_mmu_uninit_tdp_mmu(), ensuring mmu_lock is
> held for write?

Sigh, thanks for flagging this. I agree it seems weird to free private memory
from an MMU notifier callback. I also found this old thread where Sean NAKed the
current approach (free hkid in mmu release):
https://lore.kernel.org/kvm/ZN+1QHGa6ltpQxZn@google.com/#t

One challenge is that flush_shadow_all_private() needs to be done before
kvm_destroy_vcpus(), where it gets into tdx_vcpu_free(). So kvm_mmu_uninit_vm()
is too late. Perhaps this is why it was shoved into mmu notifier release (which
happens long before as you noted). Isaku, do you recall any other reasons?

But static_call_cond(kvm_x86_vm_destroy) happens before kvm_destroy_vcpus, so we
could maybe actually just do the tdx_mmu_release_hkid() part there. Then drop
the flush_shadow_all_private x86 op. See the (not thoroughly checked) diff at
the bottom of this mail.

But most of what is being discussed is in future patches where it starts to get
into the TDX module interaction. So I wonder if we should drop this patch 17
from "part 1" and include it with the next series so it can all be considered
together.

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-
ops.h
index 2adf36b74910..3927731aa947 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -23,7 +23,6 @@ KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
 KVM_X86_OP_OPTIONAL(vm_enable_cap)
 KVM_X86_OP(vm_init)
-KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
 KVM_X86_OP_OPTIONAL(vm_destroy)
 KVM_X86_OP_OPTIONAL(vm_free)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8a72e5873808..8b2b79b39d0f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1647,7 +1647,6 @@ struct kvm_x86_ops {
        unsigned int vm_size;
        int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
        int (*vm_init)(struct kvm *kvm);
-       void (*flush_shadow_all_private)(struct kvm *kvm);
        void (*vm_destroy)(struct kvm *kvm);
        void (*vm_free)(struct kvm *kvm);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e1299eb03e63..4deeeac14324 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6446,7 +6446,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
         * lead to use-after-free.
         */
        if (tdp_mmu_enabled)
-               kvm_tdp_mmu_zap_invalidated_roots(kvm);
+               kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
 }
 
 static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
@@ -6977,13 +6977,6 @@ static void kvm_mmu_zap_all(struct kvm *kvm)
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-       /*
-        * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
-        * tearing down private page tables, TDX requires some TD resources to
-        * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
-        * kvm_x86_flush_shadow_all_private() for this.
-        */
-       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
        kvm_mmu_zap_all(kvm);
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 68dfcdb46ab7..9e8b012aa8cc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -38,7 +38,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
         * ultimately frees all roots.
         */
        kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
-       kvm_tdp_mmu_zap_invalidated_roots(kvm);
+       kvm_tdp_mmu_zap_invalidated_roots(kvm, false);
 
        WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
        WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
@@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
         * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
         */
        lockdep_assert_held_write(&kvm->mmu_lock);
-       for_each_tdp_mmu_root_yield_safe(kvm, root)
+       __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS)
                tdp_mmu_zap_root(kvm, root, false);
 }
 
@@ -1065,11 +1065,14 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
  * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
  * zap" completes.
  */
-void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
+void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared)
 {
        struct kvm_mmu_page *root;
 
-       read_lock(&kvm->mmu_lock);
+       if (shared)
+               read_lock(&kvm->mmu_lock);
+       else
+               write_lock(&kvm->mmu_lock);
 
        for_each_tdp_mmu_root_yield_safe(kvm, root) {
                if (!root->tdp_mmu_scheduled_root_to_zap)
@@ -1087,7 +1090,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
                 * that may be zapped, as such entries are associated with the
                 * ASID on both VMX and SVM.
                 */
-               tdp_mmu_zap_root(kvm, root, true);
+               tdp_mmu_zap_root(kvm, root, shared);
 
                /*
                 * The referenced needs to be put *after* zapping the root, as
@@ -1097,7 +1100,10 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
                kvm_tdp_mmu_put_root(kvm, root);
        }
 
-       read_unlock(&kvm->mmu_lock);
+       if (shared)
+               read_unlock(&kvm->mmu_lock);
+       else
+               write_unlock(&kvm->mmu_lock);
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 56741d31048a..7927fa4a96e0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -68,7 +68,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page
*sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
                                  enum kvm_tdp_mmu_root_types root_types);
-void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
+void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index b6828e35eb17..3f9bfcd3e152 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -98,16 +98,12 @@ static int vt_vm_init(struct kvm *kvm)
        return vmx_vm_init(kvm);
 }
 
-static void vt_flush_shadow_all_private(struct kvm *kvm)
-{
-       if (is_td(kvm))
-               tdx_mmu_release_hkid(kvm);
-}
-
 static void vt_vm_destroy(struct kvm *kvm)
 {
-       if (is_td(kvm))
+       if (is_td(kvm)) {
+               tdx_mmu_release_hkid(kvm);
                return;
+       }
 
        vmx_vm_destroy(kvm);
 }
@@ -980,7 +976,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
        .vm_size = sizeof(struct kvm_vmx),
        .vm_enable_cap = vt_vm_enable_cap,
        .vm_init = vt_vm_init,
-       .flush_shadow_all_private = vt_flush_shadow_all_private,
        .vm_destroy = vt_vm_destroy,
        .vm_free = vt_vm_free,
 


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-06-21 19:08     ` Edgecombe, Rick P
@ 2024-06-24  8:29       ` Yan Zhao
  2024-06-24 23:15         ` Edgecombe, Rick P
  2024-07-18 15:28       ` Isaku Yamahata
  1 sibling, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2024-06-24  8:29 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Sat, Jun 22, 2024 at 03:08:22AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2024-06-21 at 15:10 +0800, Yan Zhao wrote:
> > On Wed, Jun 19, 2024 at 03:36:14PM -0700, Rick Edgecombe wrote:
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 630e6b6d4bf2..a1ab67a4f41f 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> > >          * for zapping and thus puts the TDP MMU's reference to each root,
> > > i.e.
> > >          * ultimately frees all roots.
> > >          */
> > > -       kvm_tdp_mmu_invalidate_all_roots(kvm);
> > > +       kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> > all roots (mirror + direct) are invalidated here.
> 
> Right.
> > 
> > >         kvm_tdp_mmu_zap_invalidated_roots(kvm);
> > kvm_tdp_mmu_zap_invalidated_roots() will zap invalidated mirror root with
> > mmu_lock held for read, which should trigger KVM_BUG_ON() in
> > __tdp_mmu_set_spte_atomic(), which assumes "atomic zapping don't operate on
> > mirror roots".
> > 
> > But up to now, the KVM_BUG_ON() is not triggered because
> > kvm_mmu_notifier_release() is called earlier than kvm_destroy_vm() (as in
> > below
> > call trace), and kvm_arch_flush_shadow_all() in kvm_mmu_notifier_release() has
> > zapped all mirror SPTEs before kvm_mmu_uninit_vm() called in kvm_destroy_vm().
> > 
> > 
> > kvm_mmu_notifier_release
> >   kvm_flush_shadow_all
> >     kvm_arch_flush_shadow_all
> >       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
> >       kvm_mmu_zap_all  ==>hold mmu_lock for write
> >         kvm_tdp_mmu_zap_all ==>zap KVM_ALL_ROOTS with mmu_lock held for write
> > 
> > kvm_destroy_vm
> >   kvm_arch_destroy_vm
> >     kvm_mmu_uninit_vm
> >       kvm_mmu_uninit_tdp_mmu
> >         kvm_tdp_mmu_invalidate_roots ==>invalid all KVM_VALID_ROOTS
> >         kvm_tdp_mmu_zap_invalidated_roots ==> zap all roots with mmu_lock held
> > for read
> > 
> > 
> > A question is that kvm_mmu_notifier_release(), as a callback of primary MMU
> > notifier, why does it zap mirrored tdp when all other callbacks are with
> > KVM_FILTER_SHARED?
> > 
> > Could we just zap all KVM_DIRECT_ROOTS (valid | invalid) in
> > kvm_mmu_notifier_release() and move mirrord tdp related stuffs from 
> > kvm_arch_flush_shadow_all() to kvm_mmu_uninit_tdp_mmu(), ensuring mmu_lock is
> > held for write?
> 
> Sigh, thanks for flagging this. I agree it seems weird to free private memory
> from an MMU notifier callback. I also found this old thread where Sean NAKed the
> current approach (free hkid in mmu release):
> https://lore.kernel.org/kvm/ZN+1QHGa6ltpQxZn@google.com/#t
> 
> One challenge is that flush_shadow_all_private() needs to be done before
> kvm_destroy_vcpus(), where it gets into tdx_vcpu_free(). So kvm_mmu_uninit_vm()
> is too late. Perhaps this is why it was shoved into mmu notifier release (which
> happens long before as you noted). Isaku, do you recall any other reasons?
> 
> But static_call_cond(kvm_x86_vm_destroy) happens before kvm_destroy_vcpus, so we
> could maybe actually just do the tdx_mmu_release_hkid() part there. Then drop
> the flush_shadow_all_private x86 op. See the (not thoroughly checked) diff at
> the bottom of this mail.
It looks good to me.

> 
> But most of what is being discussed is in future patches where it starts to get
> into the TDX module interaction. So I wonder if we should drop this patch 17
> from "part 1" and include it with the next series so it can all be considered
> together.
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-
> ops.h
> index 2adf36b74910..3927731aa947 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -23,7 +23,6 @@ KVM_X86_OP(has_emulated_msr)
>  KVM_X86_OP(vcpu_after_set_cpuid)
>  KVM_X86_OP_OPTIONAL(vm_enable_cap)
>  KVM_X86_OP(vm_init)
> -KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
>  KVM_X86_OP_OPTIONAL(vm_destroy)
>  KVM_X86_OP_OPTIONAL(vm_free)
>  KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8a72e5873808..8b2b79b39d0f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1647,7 +1647,6 @@ struct kvm_x86_ops {
>         unsigned int vm_size;
>         int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
>         int (*vm_init)(struct kvm *kvm);
> -       void (*flush_shadow_all_private)(struct kvm *kvm);
>         void (*vm_destroy)(struct kvm *kvm);
>         void (*vm_free)(struct kvm *kvm);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e1299eb03e63..4deeeac14324 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6446,7 +6446,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>          * lead to use-after-free.
>          */
>         if (tdp_mmu_enabled)
> -               kvm_tdp_mmu_zap_invalidated_roots(kvm);
> +               kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
>  }
>  
>  static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> @@ -6977,13 +6977,6 @@ static void kvm_mmu_zap_all(struct kvm *kvm)
>  
>  void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  {
> -       /*
> -        * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
> -        * tearing down private page tables, TDX requires some TD resources to
> -        * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
> -        * kvm_x86_flush_shadow_all_private() for this.
> -        */
> -       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
>         kvm_mmu_zap_all(kvm);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 68dfcdb46ab7..9e8b012aa8cc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -38,7 +38,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>          * ultimately frees all roots.
>          */
>         kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> -       kvm_tdp_mmu_zap_invalidated_roots(kvm);
> +       kvm_tdp_mmu_zap_invalidated_roots(kvm, false);
>  
>         WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
>         WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
> @@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>          * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
>          */
>         lockdep_assert_held_write(&kvm->mmu_lock);
> -       for_each_tdp_mmu_root_yield_safe(kvm, root)
> +       __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS)
nit: update the comment of kvm_tdp_mmu_zap_all() and explain why it's
KVM_DIRECT_ROOTS, not KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS.

>                 tdp_mmu_zap_root(kvm, root, false);
>  }
>  
> @@ -1065,11 +1065,14 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>   * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
>   * zap" completes.
>   */
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared)
>  {
>         struct kvm_mmu_page *root;
>  
> -       read_lock(&kvm->mmu_lock);
> +       if (shared)
> +               read_lock(&kvm->mmu_lock);
> +       else
> +               write_lock(&kvm->mmu_lock);
>  
>         for_each_tdp_mmu_root_yield_safe(kvm, root) {
>                 if (!root->tdp_mmu_scheduled_root_to_zap)
> @@ -1087,7 +1090,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>                  * that may be zapped, as such entries are associated with the
>                  * ASID on both VMX and SVM.
>                  */
> -               tdp_mmu_zap_root(kvm, root, true);
> +               tdp_mmu_zap_root(kvm, root, shared);
>  
>                 /*
>                  * The referenced needs to be put *after* zapping the root, as
> @@ -1097,7 +1100,10 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>                 kvm_tdp_mmu_put_root(kvm, root);
>         }
>  
> -       read_unlock(&kvm->mmu_lock);
> +       if (shared)
> +               read_unlock(&kvm->mmu_lock);
> +       else
> +               write_unlock(&kvm->mmu_lock);
>  }
>  
>  /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 56741d31048a..7927fa4a96e0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -68,7 +68,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page
> *sp);
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
>                                   enum kvm_tdp_mmu_root_types root_types);
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
>  
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index b6828e35eb17..3f9bfcd3e152 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -98,16 +98,12 @@ static int vt_vm_init(struct kvm *kvm)
>         return vmx_vm_init(kvm);
>  }
>  
> -static void vt_flush_shadow_all_private(struct kvm *kvm)
> -{
> -       if (is_td(kvm))
> -               tdx_mmu_release_hkid(kvm);
> -}
> -
>  static void vt_vm_destroy(struct kvm *kvm)
>  {
> -       if (is_td(kvm))
> +       if (is_td(kvm)) {
> +               tdx_mmu_release_hkid(kvm);
>                 return;
> +       }
>  
>         vmx_vm_destroy(kvm);
>  }
> @@ -980,7 +976,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>         .vm_size = sizeof(struct kvm_vmx),
>         .vm_enable_cap = vt_vm_enable_cap,
>         .vm_init = vt_vm_init,
> -       .flush_shadow_all_private = vt_flush_shadow_all_private,
>         .vm_destroy = vt_vm_destroy,
>         .vm_free = vt_vm_free,
>  
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-06-19 22:36 ` [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU Rick Edgecombe
@ 2024-06-24  8:30   ` Yan Zhao
  2024-06-25  0:51     ` Edgecombe, Rick P
  2024-07-04  8:51   ` Yan Zhao
  1 sibling, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2024-06-24  8:30 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kvm, kai.huang, dmatlack, erdemaktas,
	isaku.yamahata, linux-kernel, sagis, Isaku Yamahata

On Wed, Jun 19, 2024 at 03:36:10PM -0700, Rick Edgecombe wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e9c1783a8743..287dcc2685e4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  	int r;
>  
>  	if (tdp_mmu_enabled) {
> -		kvm_tdp_mmu_alloc_root(vcpu);
> +		if (kvm_has_mirrored_tdp(vcpu->kvm))
> +			kvm_tdp_mmu_alloc_root(vcpu, true);
> +		kvm_tdp_mmu_alloc_root(vcpu, false);
>  		return 0;
>  	}
mmu_alloc_direct_roots() is called when vcpu->arch.mmu->root.hpa is INVALID_PAGE
in kvm_mmu_reload(), which could happen after direct root is invalidated.

> -void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> +void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
>  {
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
>  	union kvm_mmu_page_role role = mmu->root_role;
> @@ -241,6 +246,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
>  	struct kvm *kvm = vcpu->kvm;
>  	struct kvm_mmu_page *root;
>  
> +	if (mirror)
> +		role.is_mirror = 1;
> +
Could we add a validity check of mirror_root_hpa to prevent an incorrect ref
count increment of the mirror root?

+       if (mirror) {
+               if (mmu->mirror_root_hpa != INVALID_PAGE)
+                       return;
+
                role.is_mirror = true;
+       }

>  	/*
>  	 * Check for an existing root before acquiring the pages lock to avoid
>  	 * unnecessary serialization if multiple vCPUs are loading a new root.
> @@ -292,8 +300,12 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
>  	 * and actually consuming the root if it's invalidated after dropping
>  	 * mmu_lock, and the root can't be freed as this vCPU holds a reference.
>  	 */
> -	mmu->root.hpa = __pa(root->spt);
> -	mmu->root.pgd = 0;
> +	if (mirror) {
> +		mmu->mirror_root_hpa = __pa(root->spt);
> +	} else {
> +		mmu->root.hpa = __pa(root->spt);
> +		mmu->root.pgd = 0;
> +	}
>  }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-06-24  8:29       ` Yan Zhao
@ 2024-06-24 23:15         ` Edgecombe, Rick P
  2024-06-25  6:14           ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-06-24 23:15 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Mon, 2024-06-24 at 16:29 +0800, Yan Zhao wrote:
> > @@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> >           * KVM_RUN is unreachable, i.e. no vCPUs will ever service the
> > request.
> >           */
> >          lockdep_assert_held_write(&kvm->mmu_lock);
> > -       for_each_tdp_mmu_root_yield_safe(kvm, root)
> > +       __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS)
> nit: update the comment of kvm_tdp_mmu_zap_all() 

Yea, good idea. It's definitely needs some explanation, considering the function
is called "zap_all". A bit unfortunate actually.

> and explain why it's
> KVM_DIRECT_ROOTS, not KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS.

Explain why not to zap invalid mirrored roots?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 15/17] KVM: x86/tdp_mmu: Propagate building mirror page tables
  2024-06-20  5:15   ` Binbin Wu
@ 2024-06-24 23:52     ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-06-24 23:52 UTC (permalink / raw)
  To: binbin.wu@linux.intel.com
  Cc: seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com, Zhao, Yan Y

On Thu, 2024-06-20 at 13:15 +0800, Binbin Wu wrote:
> > -               role.is_mirror = 1;
> > +               role.is_mirror = true;
> If this is needed, it's better to be done in "KVM: x86/tdp_mmu: Support 
> mirror root for TDP MMU".

Whoops, and agreed on the textual issues. Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 16/17] KVM: x86/tdp_mmu: Propagate tearing down mirror page tables
  2024-06-20  8:44   ` Binbin Wu
@ 2024-06-24 23:55     ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-06-24 23:55 UTC (permalink / raw)
  To: binbin.wu@linux.intel.com
  Cc: seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com, Zhao, Yan Y

On Thu, 2024-06-20 at 16:44 +0800, Binbin Wu wrote:
> > TDX MMU support will exclude certain MMU operations, so only plug in the
> > mirroring x86 ops where they will be needed. For zapping/freeing, only
> > hook tdp_mmu_iter_set_spte() which is use used for mapping and linking
>                                          ^
>                                          extra "use"
> 
> Also, this sentence is a bit confusing about "used for mapping and linking".

Yes. Is this more clear? "...tdp_mmu_iter_set_spte(), which is use used for
setting leaf PTEs and linking non-leaf PTEs."

> 
> > PTs. Don't bother hooking tdp_mmu_set_spte_atomic() as it is only used for
> > zapping PTEs in operations unsupported by TDX: zapping collapsible PTEs and
> > kvm_mmu_zap_all_fast().
> > 
> > In previous changes to address races around concurrent populating using
> > tdp_mmu_set_spte_atomic(), a solution was introduced to temporarily set
> > REMOVED_SPTE in the mirrored page tables while performing the external
>    ^
>   FROZEN_SPTE

Oops. And agreed on the other nits. Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-06-24  8:30   ` Yan Zhao
@ 2024-06-25  0:51     ` Edgecombe, Rick P
  2024-06-25  5:43       ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-06-25  0:51 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Mon, 2024-06-24 at 16:30 +0800, Yan Zhao wrote:
> On Wed, Jun 19, 2024 at 03:36:10PM -0700, Rick Edgecombe wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e9c1783a8743..287dcc2685e4 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu
> > *vcpu)
> >         int r;
> >   
> >         if (tdp_mmu_enabled) {
> > -               kvm_tdp_mmu_alloc_root(vcpu);
> > +               if (kvm_has_mirrored_tdp(vcpu->kvm))
> > +                       kvm_tdp_mmu_alloc_root(vcpu, true);
> > +               kvm_tdp_mmu_alloc_root(vcpu, false);
> >                 return 0;
> >         }
> mmu_alloc_direct_roots() is called when vcpu->arch.mmu->root.hpa is
> INVALID_PAGE
> in kvm_mmu_reload(), which could happen after direct root is invalidated.
> 
> > -void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> > +void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
> >   {
> >         struct kvm_mmu *mmu = vcpu->arch.mmu;
> >         union kvm_mmu_page_role role = mmu->root_role;
> > @@ -241,6 +246,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> >         struct kvm *kvm = vcpu->kvm;
> >         struct kvm_mmu_page *root;
> >   
> > +       if (mirror)
> > +               role.is_mirror = 1;
> > +
> Could we add a validity check of mirror_root_hpa to prevent an incorrect ref
> count increment of the mirror root?

I was originally suspicious of the asymmetry of the tear down of mirror and
direct roots vs the allocation. Do you see a concrete problem, or just
advocating for safety?

> 
> +       if (mirror) {
> +               if (mmu->mirror_root_hpa != INVALID_PAGE)
> +                       return;
> +
>                 role.is_mirror = true;
> +       }


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-06-25  0:51     ` Edgecombe, Rick P
@ 2024-06-25  5:43       ` Yan Zhao
  2024-06-25 20:33         ` Edgecombe, Rick P
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2024-06-25  5:43 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Tue, Jun 25, 2024 at 08:51:33AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2024-06-24 at 16:30 +0800, Yan Zhao wrote:
> > On Wed, Jun 19, 2024 at 03:36:10PM -0700, Rick Edgecombe wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index e9c1783a8743..287dcc2685e4 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu
> > > *vcpu)
> > >         int r;
> > >   
> > >         if (tdp_mmu_enabled) {
> > > -               kvm_tdp_mmu_alloc_root(vcpu);
> > > +               if (kvm_has_mirrored_tdp(vcpu->kvm))
> > > +                       kvm_tdp_mmu_alloc_root(vcpu, true);
> > > +               kvm_tdp_mmu_alloc_root(vcpu, false);
> > >                 return 0;
> > >         }
> > mmu_alloc_direct_roots() is called when vcpu->arch.mmu->root.hpa is
> > INVALID_PAGE
> > in kvm_mmu_reload(), which could happen after direct root is invalidated.
> > 
> > > -void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> > > +void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
> > >   {
> > >         struct kvm_mmu *mmu = vcpu->arch.mmu;
> > >         union kvm_mmu_page_role role = mmu->root_role;
> > > @@ -241,6 +246,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> > >         struct kvm *kvm = vcpu->kvm;
> > >         struct kvm_mmu_page *root;
> > >   
> > > +       if (mirror)
> > > +               role.is_mirror = 1;
> > > +
> > Could we add a validity check of mirror_root_hpa to prevent an incorrect ref
> > count increment of the mirror root?
> 
> I was originally suspicious of the asymmetry of the tear down of mirror and
> direct roots vs the allocation. Do you see a concrete problem, or just
> advocating for safety?
IMO it's a concrete problem, though rare up to now. e.g.

After repeatedly hot-plugping and hot-unplugping memory, which increases
memslots generation, kvm_mmu_zap_all_fast() will be called to invalidate direct
roots when the memslots generation wraps around.

> 
> > 
> > +       if (mirror) {
> > +               if (mmu->mirror_root_hpa != INVALID_PAGE)
> > +                       return;
> > +
> >                 role.is_mirror = true;
> > +       }
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-06-24 23:15         ` Edgecombe, Rick P
@ 2024-06-25  6:14           ` Yan Zhao
  2024-06-25 20:56             ` Edgecombe, Rick P
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2024-06-25  6:14 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Tue, Jun 25, 2024 at 07:15:09AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2024-06-24 at 16:29 +0800, Yan Zhao wrote:
> > > @@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> > >           * KVM_RUN is unreachable, i.e. no vCPUs will ever service the
> > > request.
> > >           */
> > >          lockdep_assert_held_write(&kvm->mmu_lock);
> > > -       for_each_tdp_mmu_root_yield_safe(kvm, root)
> > > +       __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS)
> > nit: update the comment of kvm_tdp_mmu_zap_all() 
> 
> Yea, good idea. It's definitely needs some explanation, considering the function
> is called "zap_all". A bit unfortunate actually.
> 
> > and explain why it's
> > KVM_DIRECT_ROOTS, not KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS.
> 
> Explain why not to zap invalid mirrored roots?
No. Explain why not to zap invalid direct roots.
Just passing KVM_DIRECT_ROOTS will zap only valid direct roots, right?
The original kvm_tdp_mmu_zap_all() "Zap all roots, including invalid roots".
It might be better to explain why not to zap invalid direct roots here.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-06-25  5:43       ` Yan Zhao
@ 2024-06-25 20:33         ` Edgecombe, Rick P
  2024-06-26  5:05           ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-06-25 20:33 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: pbonzini@redhat.com, seanjc@google.com, Huang, Kai,
	sagis@google.com, isaku.yamahata@gmail.com, Aktas, Erdem,
	kvm@vger.kernel.org, dmatlack@google.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org

On Tue, 2024-06-25 at 13:43 +0800, Yan Zhao wrote:
> > > > I was originally suspicious of the asymmetry of the tear down of mirror
> > > > and
> > > > direct roots vs the allocation. Do you see a concrete problem, or just
> > > > advocating for safety?
> > IMO it's a concrete problem, though rare up to now. e.g.
> > 
> > After repeatedly hot-plugping and hot-unplugping memory, which increases
> > memslots generation, kvm_mmu_zap_all_fast() will be called to invalidate >
> > direct
> > roots when the memslots generation wraps around.

Hmm, yes. I'm not sure about putting the check there though. It adds even more
confusion to the lifecycle.
 - mirror_root_hpa != INVALID_PAGE check in a different placed than
   root.hpa != INVALID_PAGE check.
 - they get allocated in the same place
 - they are torn down in the different places.

Can you think of clearer fix for it. Maybe we can just move the mirror root
allocation such that it's not subjected to the reload path? Like something that
matches the tear down in kvm_mmu_destroy()?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-06-25  6:14           ` Yan Zhao
@ 2024-06-25 20:56             ` Edgecombe, Rick P
  2024-06-26  2:25               ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-06-25 20:56 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Tue, 2024-06-25 at 14:14 +0800, Yan Zhao wrote:
> > Explain why not to zap invalid mirrored roots?
> No. Explain why not to zap invalid direct roots.
> Just passing KVM_DIRECT_ROOTS will zap only valid direct roots, right?
> The original kvm_tdp_mmu_zap_all() "Zap all roots, including invalid roots".
> It might be better to explain why not to zap invalid direct roots here.

Hmm, right. We don't actually have enum value to iterate all direct roots (valid
and invalid). We could change tdp_mmu_root_match() to:
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8320b093fef9..bcd771a8835f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -98,8 +98,8 @@ static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
        if (WARN_ON_ONCE(!(types & KVM_VALID_ROOTS)))
                return false;
 
-       if (root->role.invalid)
-               return types & KVM_INVALID_ROOTS;
+       if (root->role.invalid && !(types & KVM_INVALID_ROOTS))
+               return false;
        if (likely(!is_mirror_sp(root)))
                return types & KVM_DIRECT_ROOTS;
 
Maybe better than a comment...? Need to put the whole thing together and test it
still.

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-06-25 20:56             ` Edgecombe, Rick P
@ 2024-06-26  2:25               ` Yan Zhao
  2024-07-03 20:00                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2024-06-26  2:25 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Wed, Jun 26, 2024 at 04:56:33AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2024-06-25 at 14:14 +0800, Yan Zhao wrote:
> > > Explain why not to zap invalid mirrored roots?
> > No. Explain why not to zap invalid direct roots.
> > Just passing KVM_DIRECT_ROOTS will zap only valid direct roots, right?
> > The original kvm_tdp_mmu_zap_all() "Zap all roots, including invalid roots".
> > It might be better to explain why not to zap invalid direct roots here.
> 
> Hmm, right. We don't actually have enum value to iterate all direct roots (valid
> and invalid). We could change tdp_mmu_root_match() to:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 8320b093fef9..bcd771a8835f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -98,8 +98,8 @@ static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
>         if (WARN_ON_ONCE(!(types & KVM_VALID_ROOTS)))
>                 return false;
>  
> -       if (root->role.invalid)
> -               return types & KVM_INVALID_ROOTS;
> +       if (root->role.invalid && !(types & KVM_INVALID_ROOTS))
> +               return false;
>         if (likely(!is_mirror_sp(root)))
>                 return types & KVM_DIRECT_ROOTS;
>  
> Maybe better than a comment...? Need to put the whole thing together and test it
> still.
Hmm, I think passing KVM_DIRECT_ROOTS only in kvm_tdp_mmu_zap_all() is ok,
because kvm_mmu_uninit_tdp_mmu() will zap all invalidated roots eventually,
though KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS matches more to the function name
zap_all.

I'm not convinced that the change in tdp_mmu_root_match() above is needed.
Once a root is marked invalid, it won't be restored later. Distinguishing
between invalid direct roots and invalid mirror roots will only result in more
unused roots lingering unnecessarily.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-06-25 20:33         ` Edgecombe, Rick P
@ 2024-06-26  5:05           ` Yan Zhao
  2024-07-03 19:40             ` Edgecombe, Rick P
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2024-06-26  5:05 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Huang, Kai,
	sagis@google.com, isaku.yamahata@gmail.com, Aktas, Erdem,
	kvm@vger.kernel.org, dmatlack@google.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org

On Wed, Jun 26, 2024 at 04:33:20AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2024-06-25 at 13:43 +0800, Yan Zhao wrote:
> > > > > I was originally suspicious of the asymmetry of the tear down of mirror
> > > > > and
> > > > > direct roots vs the allocation. Do you see a concrete problem, or just
> > > > > advocating for safety?
> > > IMO it's a concrete problem, though rare up to now. e.g.
> > > 
> > > After repeatedly hot-plugping and hot-unplugping memory, which increases
> > > memslots generation, kvm_mmu_zap_all_fast() will be called to invalidate >
> > > direct
> > > roots when the memslots generation wraps around.
> 
> Hmm, yes. I'm not sure about putting the check there though. It adds even more
> confusion to the lifecycle.
>  - mirror_root_hpa != INVALID_PAGE check in a different placed than
>    root.hpa != INVALID_PAGE check.
>  - they get allocated in the same place
>  - they are torn down in the different places.
> 
> Can you think of clearer fix for it. Maybe we can just move the mirror root
> allocation such that it's not subjected to the reload path? Like something that
> matches the tear down in kvm_mmu_destroy()?
But we still need the reload path to have each vcpu to hold a ref count of the
mirror root.
What about below fix?

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 026e8edfb0bd..4decd13457ec 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -127,9 +127,28 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
 void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
                         int bytes);

+static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
+{
+       return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool kvm_mmu_root_hpa_is_invalid(struct kvm_vcpu *vcpu)
+{
+       return vcpu->arch.mmu->root.hpa == INVALID_PAGE;
+}
+
+static inline bool kvm_mmu_mirror_root_hpa_is_invalid(struct kvm_vcpu *vcpu)
+{
+       if (!kvm_has_mirrored_tdp(vcpu->kvm))
+               return false;
+
+       return vcpu->arch.mmu->mirror_root_hpa == INVALID_PAGE;
+}
+
 static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
 {
-       if (likely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
+       if (!kvm_mmu_root_hpa_is_invalid(vcpu) &&
+           !kvm_mmu_mirror_root_hpa_is_invalid(vcpu))
                return 0;

        return kvm_mmu_load(vcpu);
@@ -322,11 +341,6 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
        return translate_nested_gpa(vcpu, gpa, access, exception);
 }

-static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
-{
-       return kvm->arch.vm_type == KVM_X86_TDX_VM;
-}
-
 static inline gfn_t kvm_gfn_direct_bits(const struct kvm *kvm)
 {
        return kvm->arch.gfn_direct_bits;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e1299eb03e63..5e7d92074f70 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3702,9 +3702,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
        int r;

        if (tdp_mmu_enabled) {
-               if (kvm_has_mirrored_tdp(vcpu->kvm))
+               if (kvm_mmu_mirror_root_hpa_is_invalid(vcpu))
                        kvm_tdp_mmu_alloc_root(vcpu, true);
-               kvm_tdp_mmu_alloc_root(vcpu, false);
+
+               if (kvm_mmu_root_hpa_is_invalid(vcpu))
+                       kvm_tdp_mmu_alloc_root(vcpu, false);
                return 0;
        }



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 04/17] KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page
  2024-06-19 22:36 ` [PATCH v3 04/17] KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page Rick Edgecombe
@ 2024-07-03  7:03   ` Yan Zhao
  0 siblings, 0 replies; 47+ messages in thread
From: Yan Zhao @ 2024-07-03  7:03 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kvm, kai.huang, dmatlack, erdemaktas,
	isaku.yamahata, linux-kernel, sagis, Isaku Yamahata, Binbin Wu

On Wed, Jun 19, 2024 at 03:36:01PM -0700, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a external pointer to struct kvm_mmu_page for TDX's private page table
> and add helper functions to allocate/initialize/free a private page table
> page. TDX will only be supported with the TDP MMU. Because KVM TDP MMU
> doesn't use unsync_children and write_flooding_count, pack them to have
> room for a pointer and use a union to avoid memory overhead.
> 
> For private GPA, CPU refers to a private page table whose contents are
> encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
> PTE entry) are used, and their cost is expensive.
> 
> When KVM resolves the KVM page fault, it walks the page tables. To reuse
> the existing KVM MMU code and mitigate the heavy cost of directly walking
> the private page table allocate two sets of page tables for the private
> half of the GPA space.
> 
> For the page tables that KVM will walk, allocate them like normal and refer
> to them as mirror page tables. Additionally allocate one more page for the
> page tables the CPU will walk, and call them external page tables. Resolve
> the KVM page fault with the existing code, and do additional operations
> necessary for modifying the external page table in future patches.
There should be only one page table for mirror page table and one page table for
external page table.

What about the below change?

For the page table that KVM will walk, allocate it like normal and refer to
it as mirror page table. Additionally allocate one more page table that the
CPU will walk, and call it external page table. Resolve the KVM page fault
with the existing code, and do additional operations necessary for
modifying the external page table in future patches.


> The relationship of the types of page tables in this scheme is depicted
> below:
> 
>               KVM page fault                     |
>                      |                           |
>                      V                           |
>         -------------+----------                 |
>         |                      |                 |
>         V                      V                 |
>      shared GPA           private GPA            |
>         |                      |                 |
>         V                      V                 |
>     shared PT root      mirror PT root           |    private PT root
                                                            ^
Should we use "private PT" or "external PT" here?
"private PT", which is the term TDX uses, looks more appropriate, but a legend
is lacked for it below, e.g.

Private PT  - TDX's terminology of the external PT referred in KVM.

Or just use "private EPT"?
But there're already lots of co-existence of "private EPT", "private page table"
in the code comments below. Not sure if it matters.

>         |                      |                 |           |
>         V                      V                 |           V
>      shared PT           mirror PT        --propagate--> external PT
>         |                      |                 |           |
>         |                      \-----------------+------\    |
>         |                                        |      |    |
>         V                                        |      V    V
>   shared guest page                              |    private guest page
>                                                  |
>                            non-encrypted memory  |    encrypted memory
>                                                  |
> PT          - Page table
> Shared PT   - Visible to KVM, and the CPU uses it for shared mappings.
> External PT - The CPU uses it, but it is invisible to KVM. TDX module
>               updates this table to map private guest pages.
> Mirror PT   - It is visible to KVM, but the CPU doesn't use it. KVM uses
>               it to propagate PT change to the actual private PT.
> 
> Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it
> to the TDX vm type.
> 
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> TDX MMU Prep v3:
>  - mirrored->external rename (Paolo)
>  - Remove accidentally included kvm_mmu_alloc_private_spt() (Paolo)
>  - Those -> These (Paolo)
>  - Change log updates to make external/mirrored naming more clear
> 
> TDX MMU Prep v2:
>  - Rename private->mirror
>  - Don't trigger off of shared mask
> 
> TDX MMU Prep:
> - Rename terminology, dummy PT => mirror PT. and updated the commit message
>   By Rick and Kai.
> - Added a comment on union of private_spt by Rick.
> - Don't handle the root case in kvm_mmu_alloc_private_spt(), it will not
>   be needed in future patches. (Rick)
> - Update comments (Yan)
> - Remove kvm_mmu_init_private_spt(), open code it in later patches (Yan)
> 
> v19:
> - typo in the comment in kvm_mmu_alloc_private_spt()
> - drop CONFIG_KVM_MMU_PRIVATE
> ---
>  arch/x86/include/asm/kvm_host.h |  5 +++++
>  arch/x86/kvm/mmu.h              |  5 +++++
>  arch/x86/kvm/mmu/mmu.c          |  7 +++++++
>  arch/x86/kvm/mmu/mmu_internal.h | 31 +++++++++++++++++++++++++++----
>  arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
>  5 files changed, 45 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aabf1648a56a..9e35fe32f500 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -817,6 +817,11 @@ struct kvm_vcpu_arch {
>  	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
>  	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
>  	struct kvm_mmu_memory_cache mmu_page_header_cache;
> +	/*
> +	 * This cache is to allocate private page table. E.g. private EPT used
s/private page table/external page table  

> +	 * by the TDX module.
> +	 */
> +	struct kvm_mmu_memory_cache mmu_external_spt_cache;
>  
>  	/*
>  	 * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index dc80e72e4848..0c3bf89cf7db 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -318,4 +318,9 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
>  		return gpa;
>  	return translate_nested_gpa(vcpu, gpa, access, exception);
>  }
> +
> +static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
> +{
> +	return kvm->arch.vm_type == KVM_X86_TDX_VM;
> +}
>  #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f41c498fcdb5..8023cebeefaa 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -685,6 +685,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
>  	if (r)
>  		return r;
> +	if (kvm_has_mirrored_tdp(vcpu->kvm)) {
> +		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_external_spt_cache,
> +					       PT64_ROOT_MAX_LEVEL);
> +		if (r)
> +			return r;
> +	}
>  	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
>  				       PT64_ROOT_MAX_LEVEL);
>  	if (r)
> @@ -704,6 +710,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> +	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_external_spt_cache);
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 706f0ce8784c..d2837f796f34 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -101,7 +101,22 @@ struct kvm_mmu_page {
>  		int root_count;
>  		refcount_t tdp_mmu_root_count;
>  	};
> -	unsigned int unsync_children;
> +	union {
> +		/* These two members aren't used for TDP MMU */
> +		struct {
> +			unsigned int unsync_children;
> +			/*
> +			 * Number of writes since the last time traversal
> +			 * visited this page.
> +			 */
> +			atomic_t write_flooding_count;
> +		};
> +		/*
> +		 * Page table page of private PT.
s/private/external

> +		 * Passed to TDX module, not accessed by KVM.
> +		 */
> +		void *external_spt;
> +	};
>  	union {
>  		struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
>  		tdp_ptep_t ptep;
> @@ -124,9 +139,6 @@ struct kvm_mmu_page {
>  	int clear_spte_count;
>  #endif
>  
> -	/* Number of writes since the last time traversal visited this page.  */
> -	atomic_t write_flooding_count;
> -
>  #ifdef CONFIG_X86_64
>  	/* Used for freeing the page asynchronously if it is a TDP MMU page. */
>  	struct rcu_head rcu_head;
> @@ -145,6 +157,17 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
>  	return kvm_mmu_role_as_id(sp->role);
>  }
>  
> +static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> +{
> +	/*
> +	 * external_spt is allocated for TDX module to hold private EPT mappings,
> +	 * TDX module will initialize the page by itself.
> +	 * Therefore, KVM does not need to initialize or access external_spt.
> +	 * KVM only interacts with sp->spt for external EPT operations.
s/external EPT/private EPT
or
s/external EPT/external page table

> +	 */
> +	sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
> +}
> +
>  static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
>  {
>  	/*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 16b54208e8d7..35249555b585 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -53,6 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  
>  static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
>  {
> +	free_page((unsigned long)sp->external_spt);
>  	free_page((unsigned long)sp->spt);
>  	kmem_cache_free(mmu_page_header_cache, sp);
>  }
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-06-26  5:05           ` Yan Zhao
@ 2024-07-03 19:40             ` Edgecombe, Rick P
  2024-07-04  8:09               ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-07-03 19:40 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: linux-kernel@vger.kernel.org, seanjc@google.com, Huang, Kai,
	sagis@google.com, isaku.yamahata@gmail.com, Aktas, Erdem,
	dmatlack@google.com, kvm@vger.kernel.org, Yamahata, Isaku,
	pbonzini@redhat.com

On Wed, 2024-06-26 at 13:05 +0800, Yan Zhao wrote:
> But we still need the reload path to have each vcpu to hold a ref count of the
> mirror root.
> What about below fix?
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 026e8edfb0bd..4decd13457ec 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -127,9 +127,28 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
>  void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
>                          int bytes);
> 
> +static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
> +{
> +       return kvm->arch.vm_type == KVM_X86_TDX_VM;
> +}
> +
> +static inline bool kvm_mmu_root_hpa_is_invalid(struct kvm_vcpu *vcpu)
> +{
> +       return vcpu->arch.mmu->root.hpa == INVALID_PAGE;
> +}
> +
> +static inline bool kvm_mmu_mirror_root_hpa_is_invalid(struct kvm_vcpu *vcpu)
> +{
> +       if (!kvm_has_mirrored_tdp(vcpu->kvm))
> +               return false;
> +
> +       return vcpu->arch.mmu->mirror_root_hpa == INVALID_PAGE;
> +}
> +
>  static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
>  {
> -       if (likely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
> +       if (!kvm_mmu_root_hpa_is_invalid(vcpu) &&
> +           !kvm_mmu_mirror_root_hpa_is_invalid(vcpu))

If we go this way, we should keep the likely. But, I'm not convinced.

>                 return 0;
> 
>         return kvm_mmu_load(vcpu);
> @@ -322,11 +341,6 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu
> *vcpu,
>         return translate_nested_gpa(vcpu, gpa, access, exception);
>  }
> 
> -static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
> -{
> -       return kvm->arch.vm_type == KVM_X86_TDX_VM;
> -}
> -
>  static inline gfn_t kvm_gfn_direct_bits(const struct kvm *kvm)
>  {
>         return kvm->arch.gfn_direct_bits;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e1299eb03e63..5e7d92074f70 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3702,9 +3702,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu
> *vcpu)
>         int r;
> 
>         if (tdp_mmu_enabled) {
> -               if (kvm_has_mirrored_tdp(vcpu->kvm))
> +               if (kvm_mmu_mirror_root_hpa_is_invalid(vcpu))
>                         kvm_tdp_mmu_alloc_root(vcpu, true);
> -               kvm_tdp_mmu_alloc_root(vcpu, false);
> +
> +               if (kvm_mmu_root_hpa_is_invalid(vcpu))
> +                       kvm_tdp_mmu_alloc_root(vcpu, false);

So we can have either:
mirror_root_hpa = INVALID_PAGE
root.hpa = INVALID_PAGE

or:
mirror_root_hpa = root
root.hpa = INVALID_PAGE

We don't ever have:
mirror_root_hpa = INVALID_PAGE
root.hpa = root

Because mirror_root_hpa is special.

>                 return 0;
>         }
> 

Having the check in kvm_mmu_reload() and here both is really ugly.  Right now we
have kvm_mmu_reload() which only allocates new roots if needed, then
kvm_mmu_load() goes and allocates/references them. Now mmu_alloc_direct_roots()
has the same logic to only reload as needed.

So could we just leave the kvm_mmu_reload() logic as is, and just have special
logic in mmu_alloc_direct_roots() to not avoid re-allocating mirror roots? This
is actually what v19 had, but it was thought to be a historical relic:
   "Historically we needed it.  We don't need it now. We can drop it."
https://lore.kernel.org/kvm/20240325200122.GH2357401@ls.amr.corp.intel.com/

So just have this small fixup instead?

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ae5d5dee86af..2e062178222e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3705,7 +3705,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
        int r;
 
        if (tdp_mmu_enabled) {
-               if (kvm_has_mirrored_tdp(vcpu->kvm))
+               if (kvm_has_mirrored_tdp(vcpu->kvm) &&
+                   !VALID_PAGE(mmu->mirror_root_hpa))
                        kvm_tdp_mmu_alloc_root(vcpu, true);
                kvm_tdp_mmu_alloc_root(vcpu, false);
                return 0;


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-06-26  2:25               ` Yan Zhao
@ 2024-07-03 20:00                 ` Edgecombe, Rick P
  2024-07-05  1:16                   ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-07-03 20:00 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Wed, 2024-06-26 at 10:25 +0800, Yan Zhao wrote:
> > Maybe better than a comment...? Need to put the whole thing together and
> > test it
> > still.
> Hmm, I think passing KVM_DIRECT_ROOTS only in kvm_tdp_mmu_zap_all() is ok,
> because kvm_mmu_uninit_tdp_mmu() will zap all invalidated roots eventually,
> though KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS matches more to the function name
> zap_all.
> 
> I'm not convinced that the change in tdp_mmu_root_match() above is needed.
> Once a root is marked invalid, it won't be restored later. Distinguishing
> between invalid direct roots and invalid mirror roots will only result in more
> unused roots lingering unnecessarily.

The logic for direct roots is for normal VMs as well. In any case, we should not
mix generic KVM MMU changes in with mirrored memory additions. So let's keep the
existing direct root behavior the same.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-07-03 19:40             ` Edgecombe, Rick P
@ 2024-07-04  8:09               ` Yan Zhao
  2024-07-09 22:36                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2024-07-04  8:09 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel@vger.kernel.org, seanjc@google.com, Huang, Kai,
	sagis@google.com, isaku.yamahata@gmail.com, Aktas, Erdem,
	dmatlack@google.com, kvm@vger.kernel.org, Yamahata, Isaku,
	pbonzini@redhat.com

On Thu, Jul 04, 2024 at 03:40:35AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2024-06-26 at 13:05 +0800, Yan Zhao wrote:
> > But we still need the reload path to have each vcpu to hold a ref count of the
> > mirror root.
> > What about below fix?
> > 
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 026e8edfb0bd..4decd13457ec 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -127,9 +127,28 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> >                          int bytes);
> > 
> > +static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
> > +{
> > +       return kvm->arch.vm_type == KVM_X86_TDX_VM;
> > +}
> > +
> > +static inline bool kvm_mmu_root_hpa_is_invalid(struct kvm_vcpu *vcpu)
> > +{
> > +       return vcpu->arch.mmu->root.hpa == INVALID_PAGE;
> > +}
> > +
> > +static inline bool kvm_mmu_mirror_root_hpa_is_invalid(struct kvm_vcpu *vcpu)
> > +{
> > +       if (!kvm_has_mirrored_tdp(vcpu->kvm))
> > +               return false;
> > +
> > +       return vcpu->arch.mmu->mirror_root_hpa == INVALID_PAGE;
> > +}
> > +
> >  static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
> >  {
> > -       if (likely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
> > +       if (!kvm_mmu_root_hpa_is_invalid(vcpu) &&
> > +           !kvm_mmu_mirror_root_hpa_is_invalid(vcpu))
> 
> If we go this way, we should keep the likely. But, I'm not convinced.
> 
> >                 return 0;
> > 
> >         return kvm_mmu_load(vcpu);
> > @@ -322,11 +341,6 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu
> > *vcpu,
> >         return translate_nested_gpa(vcpu, gpa, access, exception);
> >  }
> > 
> > -static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
> > -{
> > -       return kvm->arch.vm_type == KVM_X86_TDX_VM;
> > -}
> > -
> >  static inline gfn_t kvm_gfn_direct_bits(const struct kvm *kvm)
> >  {
> >         return kvm->arch.gfn_direct_bits;
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e1299eb03e63..5e7d92074f70 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3702,9 +3702,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu
> > *vcpu)
> >         int r;
> > 
> >         if (tdp_mmu_enabled) {
> > -               if (kvm_has_mirrored_tdp(vcpu->kvm))
> > +               if (kvm_mmu_mirror_root_hpa_is_invalid(vcpu))
> >                         kvm_tdp_mmu_alloc_root(vcpu, true);
> > -               kvm_tdp_mmu_alloc_root(vcpu, false);
> > +
> > +               if (kvm_mmu_root_hpa_is_invalid(vcpu))
> > +                       kvm_tdp_mmu_alloc_root(vcpu, false);
> 
> So we can have either:
> mirror_root_hpa = INVALID_PAGE
> root.hpa = INVALID_PAGE
> 
> or:
> mirror_root_hpa = root
> root.hpa = INVALID_PAGE
> 
> We don't ever have:
> mirror_root_hpa = INVALID_PAGE
> root.hpa = root
> 
> Because mirror_root_hpa is special.
> 
Good point.

> >                 return 0;
> >         }
> > 
> 
> Having the check in kvm_mmu_reload() and here both is really ugly.  Right now we
> have kvm_mmu_reload() which only allocates new roots if needed, then
> kvm_mmu_load() goes and allocates/references them. Now mmu_alloc_direct_roots()
> has the same logic to only reload as needed.
> 
> So could we just leave the kvm_mmu_reload() logic as is, and just have special
> logic in mmu_alloc_direct_roots() to not avoid re-allocating mirror roots? This
> is actually what v19 had, but it was thought to be a historical relic:
>    "Historically we needed it.  We don't need it now. We can drop it."
> https://lore.kernel.org/kvm/20240325200122.GH2357401@ls.amr.corp.intel.com/
> 
> So just have this small fixup instead?
Yes.

> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ae5d5dee86af..2e062178222e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3705,7 +3705,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>         int r;
>  
>         if (tdp_mmu_enabled) {
> -               if (kvm_has_mirrored_tdp(vcpu->kvm))
> +               if (kvm_has_mirrored_tdp(vcpu->kvm) &&
> +                   !VALID_PAGE(mmu->mirror_root_hpa))
>                         kvm_tdp_mmu_alloc_root(vcpu, true);
>                 kvm_tdp_mmu_alloc_root(vcpu, false);
>                 return 0;
> 
Perhaps also a comment in kvm_mmu_reload() to address concerns like why checking
only root.hpa in kvm_mmu_reload() is enough.

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 03737f3aaeeb..aba98c8cc67d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -129,6 +129,15 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 
 static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
 {
+       /*
+        * Checking root.hpa is sufficient even when KVM has mirror root.
+        * We can have either:
+        * (1) mirror_root_hpa = INVALID_PAGE, root.hpa = INVALID_PAGE
+        * (2) mirror_root_hpa = root ,        root.hpa = INVALID_PAGE
+        * (3) mirror_root_hpa = root1,        root.hpa = root2
+        * We don't ever have:
+        *     mirror_root_hpa = INVALID_PAGE, root.hpa = root
+        */
        if (likely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
                return 0;
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a5f803f1d17e..eee35e958971 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3705,7 +3705,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
        int r;
 
        if (tdp_mmu_enabled) {
-               if (kvm_has_mirrored_tdp(vcpu->kvm))
+               if (kvm_has_mirrored_tdp(vcpu->kvm) &&
+                   !VALID_PAGE(mmu->mirror_root_hpa))
                        kvm_tdp_mmu_alloc_root(vcpu, true);
                kvm_tdp_mmu_alloc_root(vcpu, false);
                return 0;

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-06-19 22:36 ` [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU Rick Edgecombe
  2024-06-24  8:30   ` Yan Zhao
@ 2024-07-04  8:51   ` Yan Zhao
  2024-07-09 22:38     ` Edgecombe, Rick P
  1 sibling, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2024-07-04  8:51 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kvm, kai.huang, dmatlack, erdemaktas,
	isaku.yamahata, linux-kernel, sagis, Isaku Yamahata

On Wed, Jun 19, 2024 at 03:36:10PM -0700, Rick Edgecombe wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index b887c225ff24..2903f03a34be 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -10,7 +10,7 @@
>  void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
>  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
>  
> -void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
> +void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool private);
>  
>  __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
This function name is very similar to below tdp_mmu_get_root().

> +static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
> +							      struct kvm_page_fault *fault)
> +{
> +	if (unlikely(!kvm_is_addr_direct(vcpu->kvm, fault->addr)))
> +		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
> +
> +	return root_to_sp(vcpu->arch.mmu->root.hpa);
> +}
> +
> +static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu,
> +						    enum kvm_tdp_mmu_root_types type)
> +{
> +	if (unlikely(type == KVM_MIRROR_ROOTS))
> +		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
> +
> +	return root_to_sp(vcpu->arch.mmu->root.hpa);
> +}
No need to put the two functions in tdp_mmu.h as they are not called in mmu.c.

Could we move them to tdp_mmu.c and rename them to something like
tdp_mmu_type_to_root() and tdp_mmu_fault_to_root() ?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-07-03 20:00                 ` Edgecombe, Rick P
@ 2024-07-05  1:16                   ` Yan Zhao
  2024-07-09 22:52                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2024-07-05  1:16 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Thu, Jul 04, 2024 at 04:00:18AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2024-06-26 at 10:25 +0800, Yan Zhao wrote:
> > > Maybe better than a comment...? Need to put the whole thing together and
> > > test it
> > > still.
> > Hmm, I think passing KVM_DIRECT_ROOTS only in kvm_tdp_mmu_zap_all() is ok,
> > because kvm_mmu_uninit_tdp_mmu() will zap all invalidated roots eventually,
> > though KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS matches more to the function name
> > zap_all.
> > 
> > I'm not convinced that the change in tdp_mmu_root_match() above is needed.
> > Once a root is marked invalid, it won't be restored later. Distinguishing
> > between invalid direct roots and invalid mirror roots will only result in more
> > unused roots lingering unnecessarily.
> 
> The logic for direct roots is for normal VMs as well. In any case, we should not
> mix generic KVM MMU changes in with mirrored memory additions. So let's keep the
> existing direct root behavior the same.
To keep the existing direct root behavior the same, I think specifying
KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS in kvm_tdp_mmu_zap_all() is enough.

No need to modify tdp_mmu_root_match() do distinguish between invalid direct
roots and invalid mirror roots. As long as a root is invalid, guest is no longer
affected by it and KVM will not use it any more. The last remaining operation
to the invalid root is only zapping.

Distinguishing between invalid direct roots and invalid mirror roots would
make the code to zap invalid roots unnecessarily complex, e.g.

kvm_tdp_mmu_zap_invalidated_roots() is called both in kvm_mmu_uninit_tdp_mmu()
and kvm_mmu_zap_all_fast().

- When called in the former, both invalid direct and invalid mirror roots are
  required to zap;
- when called in the latter, only invalid direct roots are required to zap.





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-07-04  8:09               ` Yan Zhao
@ 2024-07-09 22:36                 ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-07-09 22:36 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: pbonzini@redhat.com, seanjc@google.com, Huang, Kai,
	sagis@google.com, isaku.yamahata@gmail.com, Aktas, Erdem,
	kvm@vger.kernel.org, dmatlack@google.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org

On Thu, 2024-07-04 at 16:09 +0800, Yan Zhao wrote:
> Perhaps also a comment in kvm_mmu_reload() to address concerns like why
> checking
> only root.hpa in kvm_mmu_reload() is enough.

Sounds good, and thanks again for catching this.

> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 03737f3aaeeb..aba98c8cc67d 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -129,6 +129,15 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t
> gpa, const u8 *new,
>  
>  static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
>  {
> +       /*
> +        * Checking root.hpa is sufficient even when KVM has mirror root.
> +        * We can have either:
> +        * (1) mirror_root_hpa = INVALID_PAGE, root.hpa = INVALID_PAGE
> +        * (2) mirror_root_hpa = root ,        root.hpa = INVALID_PAGE
Looks good to me except for the space  ^

> +        * (3) mirror_root_hpa = root1,        root.hpa = root2
> +        * We don't ever have:
> +        *     mirror_root_hpa = INVALID_PAGE, root.hpa = root
> +        */
>         if (likely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
>                 return 0;
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a5f803f1d17e..eee35e958971 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3705,7 +3705,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>         int r;
>  
>         if (tdp_mmu_enabled) {
> -               if (kvm_has_mirrored_tdp(vcpu->kvm))
> +               if (kvm_has_mirrored_tdp(vcpu->kvm) &&
> +                   !VALID_PAGE(mmu->mirror_root_hpa))
>                         kvm_tdp_mmu_alloc_root(vcpu, true);
>                 kvm_tdp_mmu_alloc_root(vcpu, false);
>                 return 0;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-07-04  8:51   ` Yan Zhao
@ 2024-07-09 22:38     ` Edgecombe, Rick P
  2024-07-11 23:54       ` Edgecombe, Rick P
  0 siblings, 1 reply; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-07-09 22:38 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Thu, 2024-07-04 at 16:51 +0800, Yan Zhao wrote:
> On Wed, Jun 19, 2024 at 03:36:10PM -0700, Rick Edgecombe wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index b887c225ff24..2903f03a34be 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -10,7 +10,7 @@
> >   void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> >   void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
> >   
> > -void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
> > +void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool private);
> >   
> >   __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page
> > *root)
> This function name is very similar to below tdp_mmu_get_root().

That is true.

> 
> > +static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct
> > kvm_vcpu *vcpu,
> > +                                                             struct
> > kvm_page_fault *fault)
> > +{
> > +       if (unlikely(!kvm_is_addr_direct(vcpu->kvm, fault->addr)))
> > +               return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
> > +
> > +       return root_to_sp(vcpu->arch.mmu->root.hpa);
> > +}
> > +
> > +static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu,
> > +                                                   enum
> > kvm_tdp_mmu_root_types type)
> > +{
> > +       if (unlikely(type == KVM_MIRROR_ROOTS))
> > +               return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
> > +
> > +       return root_to_sp(vcpu->arch.mmu->root.hpa);
> > +}
> No need to put the two functions in tdp_mmu.h as they are not called in mmu.c.

I think it is nice to have the root/enum logic above close together.

> 
> Could we move them to tdp_mmu.c and rename them to something like
> tdp_mmu_type_to_root() and tdp_mmu_fault_to_root() ?

tdp_mmu_get_root_for_fault() was proposed by Paolo, and tdp_mmu_get_root() was
discussed without comment. Not to say there is anything wrong with the names
proposed, but I think this is wading into bikeshedding territory at this stage.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-07-05  1:16                   ` Yan Zhao
@ 2024-07-09 22:52                     ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-07-09 22:52 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Gao, Chao, seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Fri, 2024-07-05 at 09:16 +0800, Yan Zhao wrote:
> To keep the existing direct root behavior the same, I think specifying
> KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS in kvm_tdp_mmu_zap_all() is enough.

Right.

> 
> No need to modify tdp_mmu_root_match() do distinguish between invalid direct
> roots and invalid mirror roots. As long as a root is invalid, guest is no
> longer
> affected by it and KVM will not use it any more. The last remaining operation
> to the invalid root is only zapping.
> 
> Distinguishing between invalid direct roots and invalid mirror roots would
> make the code to zap invalid roots unnecessarily complex, e.g.

I'm not sure that it is more complicated. One requires a big comment to explain,
and the other is self explanatory...

> 
> kvm_tdp_mmu_zap_invalidated_roots() is called both in kvm_mmu_uninit_tdp_mmu()
> and kvm_mmu_zap_all_fast().
> 
> - When called in the former, both invalid direct and invalid mirror roots are
>   required to zap;
> - when called in the latter, only invalid direct roots are required to zap.

It might help to put together a full fixup at this point. We have a couple diffs
in this thread, and I'm not 100% which base patch we are talking about at this
point.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-07-09 22:38     ` Edgecombe, Rick P
@ 2024-07-11 23:54       ` Edgecombe, Rick P
  2024-07-12  1:42         ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-07-11 23:54 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Tue, 2024-07-09 at 15:38 -0700, Rick Edgecombe wrote:
> > Could we move them to tdp_mmu.c and rename them to something like
> > tdp_mmu_type_to_root() and tdp_mmu_fault_to_root() ?
> 
> tdp_mmu_get_root_for_fault() was proposed by Paolo, and tdp_mmu_get_root() was
> discussed without comment. Not to say there is anything wrong with the names
> proposed, but I think this is wading into bikeshedding territory at this
> stage.

I kept thinking about this comment. On more consideration tdp_mmu_get_root() is
a problematically terrible name since "get" usually means take a reference to
something which is something you can do to roots and (and what
kvm_tdp_mmu_get_root() is doing). So people might think tdp_mmu_get_root() is
taking a reference.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU
  2024-07-11 23:54       ` Edgecombe, Rick P
@ 2024-07-12  1:42         ` Yan Zhao
  0 siblings, 0 replies; 47+ messages in thread
From: Yan Zhao @ 2024-07-12  1:42 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Huang, Kai, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, kvm@vger.kernel.org,
	Yamahata, Isaku, pbonzini@redhat.com

On Fri, Jul 12, 2024 at 07:54:50AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2024-07-09 at 15:38 -0700, Rick Edgecombe wrote:
> > > Could we move them to tdp_mmu.c and rename them to something like
> > > tdp_mmu_type_to_root() and tdp_mmu_fault_to_root() ?
> > 
> > tdp_mmu_get_root_for_fault() was proposed by Paolo, and tdp_mmu_get_root() was
> > discussed without comment. Not to say there is anything wrong with the names
> > proposed, but I think this is wading into bikeshedding territory at this
> > stage.
> 
> I kept thinking about this comment. On more consideration tdp_mmu_get_root() is
> a problematically terrible name since "get" usually means take a reference to
> something which is something you can do to roots and (and what
> kvm_tdp_mmu_get_root() is doing). So people might think tdp_mmu_get_root() is
> taking a reference.
Right, I have exactly the same feeling to "get".

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-06-21 19:08     ` Edgecombe, Rick P
  2024-06-24  8:29       ` Yan Zhao
@ 2024-07-18 15:28       ` Isaku Yamahata
  2024-07-18 15:55         ` Edgecombe, Rick P
  1 sibling, 1 reply; 47+ messages in thread
From: Isaku Yamahata @ 2024-07-18 15:28 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Zhao, Yan Y, Gao, Chao, seanjc@google.com, Huang, Kai,
	sagis@google.com, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, Aktas, Erdem, dmatlack@google.com,
	kvm@vger.kernel.org, Yamahata, Isaku, pbonzini@redhat.com,
	isaku.yamahata

On Fri, Jun 21, 2024 at 07:08:22PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Fri, 2024-06-21 at 15:10 +0800, Yan Zhao wrote:
> > On Wed, Jun 19, 2024 at 03:36:14PM -0700, Rick Edgecombe wrote:
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 630e6b6d4bf2..a1ab67a4f41f 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> > >          * for zapping and thus puts the TDP MMU's reference to each root,
> > > i.e.
> > >          * ultimately frees all roots.
> > >          */
> > > -       kvm_tdp_mmu_invalidate_all_roots(kvm);
> > > +       kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> > all roots (mirror + direct) are invalidated here.
> 
> Right.
> > 
> > >         kvm_tdp_mmu_zap_invalidated_roots(kvm);
> > kvm_tdp_mmu_zap_invalidated_roots() will zap invalidated mirror root with
> > mmu_lock held for read, which should trigger KVM_BUG_ON() in
> > __tdp_mmu_set_spte_atomic(), which assumes "atomic zapping don't operate on
> > mirror roots".
> > 
> > But up to now, the KVM_BUG_ON() is not triggered because
> > kvm_mmu_notifier_release() is called earlier than kvm_destroy_vm() (as in
> > below
> > call trace), and kvm_arch_flush_shadow_all() in kvm_mmu_notifier_release() has
> > zapped all mirror SPTEs before kvm_mmu_uninit_vm() called in kvm_destroy_vm().
> > 
> > 
> > kvm_mmu_notifier_release
> >   kvm_flush_shadow_all
> >     kvm_arch_flush_shadow_all
> >       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
> >       kvm_mmu_zap_all  ==>hold mmu_lock for write
> >         kvm_tdp_mmu_zap_all ==>zap KVM_ALL_ROOTS with mmu_lock held for write
> > 
> > kvm_destroy_vm
> >   kvm_arch_destroy_vm
> >     kvm_mmu_uninit_vm
> >       kvm_mmu_uninit_tdp_mmu
> >         kvm_tdp_mmu_invalidate_roots ==>invalid all KVM_VALID_ROOTS
> >         kvm_tdp_mmu_zap_invalidated_roots ==> zap all roots with mmu_lock held
> > for read
> > 
> > 
> > A question is that kvm_mmu_notifier_release(), as a callback of primary MMU
> > notifier, why does it zap mirrored tdp when all other callbacks are with
> > KVM_FILTER_SHARED?
> > 
> > Could we just zap all KVM_DIRECT_ROOTS (valid | invalid) in
> > kvm_mmu_notifier_release() and move mirrord tdp related stuffs from 
> > kvm_arch_flush_shadow_all() to kvm_mmu_uninit_tdp_mmu(), ensuring mmu_lock is
> > held for write?
> 
> Sigh, thanks for flagging this. I agree it seems weird to free private memory
> from an MMU notifier callback. I also found this old thread where Sean NAKed the
> current approach (free hkid in mmu release):
> https://lore.kernel.org/kvm/ZN+1QHGa6ltpQxZn@google.com/#t
> 
> One challenge is that flush_shadow_all_private() needs to be done before
> kvm_destroy_vcpus(), where it gets into tdx_vcpu_free(). So kvm_mmu_uninit_vm()
> is too late. Perhaps this is why it was shoved into mmu notifier release (which
> happens long before as you noted). Isaku, do you recall any other reasons?

Although it's a bit late, it's for record.

It's to optimize the destruction Secure-EPT.  Free HKID early and destruct
Secure-EPT by TDH.PHYMEM.PAGE.RECLAIM().  QEMU doesn't close any KVM file
descriptors on exit. (gmem fd references KVM VM fd. so vm destruction happens
after all gmem fds are closed.  Closing gmem fd causes secure-EPT zapping befure
releasing HKID.)

Because we're ignoring such optimization for now, we can simply defer releasing
HKID following Seans's call.


> But static_call_cond(kvm_x86_vm_destroy) happens before kvm_destroy_vcpus, so we
> could maybe actually just do the tdx_mmu_release_hkid() part there. Then drop
> the flush_shadow_all_private x86 op. See the (not thoroughly checked) diff at
> the bottom of this mail.

Yep, we can release HKID at vm destruction with potential too slow zapping of
Secure-EPT.  The following change basically looks good to me.
(The callback for Secure-EPT can be simplified.)

Thanks,

> But most of what is being discussed is in future patches where it starts to get
> into the TDX module interaction. So I wonder if we should drop this patch 17
> from "part 1" and include it with the next series so it can all be considered
> together.
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-
> ops.h
> index 2adf36b74910..3927731aa947 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -23,7 +23,6 @@ KVM_X86_OP(has_emulated_msr)
>  KVM_X86_OP(vcpu_after_set_cpuid)
>  KVM_X86_OP_OPTIONAL(vm_enable_cap)
>  KVM_X86_OP(vm_init)
> -KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
>  KVM_X86_OP_OPTIONAL(vm_destroy)
>  KVM_X86_OP_OPTIONAL(vm_free)
>  KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8a72e5873808..8b2b79b39d0f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1647,7 +1647,6 @@ struct kvm_x86_ops {
>         unsigned int vm_size;
>         int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
>         int (*vm_init)(struct kvm *kvm);
> -       void (*flush_shadow_all_private)(struct kvm *kvm);
>         void (*vm_destroy)(struct kvm *kvm);
>         void (*vm_free)(struct kvm *kvm);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e1299eb03e63..4deeeac14324 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6446,7 +6446,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>          * lead to use-after-free.
>          */
>         if (tdp_mmu_enabled)
> -               kvm_tdp_mmu_zap_invalidated_roots(kvm);
> +               kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
>  }
>  
>  static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> @@ -6977,13 +6977,6 @@ static void kvm_mmu_zap_all(struct kvm *kvm)
>  
>  void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  {
> -       /*
> -        * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
> -        * tearing down private page tables, TDX requires some TD resources to
> -        * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
> -        * kvm_x86_flush_shadow_all_private() for this.
> -        */
> -       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
>         kvm_mmu_zap_all(kvm);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 68dfcdb46ab7..9e8b012aa8cc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -38,7 +38,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>          * ultimately frees all roots.
>          */
>         kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> -       kvm_tdp_mmu_zap_invalidated_roots(kvm);
> +       kvm_tdp_mmu_zap_invalidated_roots(kvm, false);
>  
>         WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
>         WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
> @@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>          * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
>          */
>         lockdep_assert_held_write(&kvm->mmu_lock);
> -       for_each_tdp_mmu_root_yield_safe(kvm, root)
> +       __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS)
>                 tdp_mmu_zap_root(kvm, root, false);
>  }
>  
> @@ -1065,11 +1065,14 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>   * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
>   * zap" completes.
>   */
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared)
>  {
>         struct kvm_mmu_page *root;
>  
> -       read_lock(&kvm->mmu_lock);
> +       if (shared)
> +               read_lock(&kvm->mmu_lock);
> +       else
> +               write_lock(&kvm->mmu_lock);
>  
>         for_each_tdp_mmu_root_yield_safe(kvm, root) {
>                 if (!root->tdp_mmu_scheduled_root_to_zap)
> @@ -1087,7 +1090,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>                  * that may be zapped, as such entries are associated with the
>                  * ASID on both VMX and SVM.
>                  */
> -               tdp_mmu_zap_root(kvm, root, true);
> +               tdp_mmu_zap_root(kvm, root, shared);
>  
>                 /*
>                  * The referenced needs to be put *after* zapping the root, as
> @@ -1097,7 +1100,10 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>                 kvm_tdp_mmu_put_root(kvm, root);
>         }
>  
> -       read_unlock(&kvm->mmu_lock);
> +       if (shared)
> +               read_unlock(&kvm->mmu_lock);
> +       else
> +               write_unlock(&kvm->mmu_lock);
>  }
>  
>  /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 56741d31048a..7927fa4a96e0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -68,7 +68,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page
> *sp);
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
>                                   enum kvm_tdp_mmu_root_types root_types);
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
>  
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index b6828e35eb17..3f9bfcd3e152 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -98,16 +98,12 @@ static int vt_vm_init(struct kvm *kvm)
>         return vmx_vm_init(kvm);
>  }
>  
> -static void vt_flush_shadow_all_private(struct kvm *kvm)
> -{
> -       if (is_td(kvm))
> -               tdx_mmu_release_hkid(kvm);
> -}
> -
>  static void vt_vm_destroy(struct kvm *kvm)
>  {
> -       if (is_td(kvm))
> +       if (is_td(kvm)) {
> +               tdx_mmu_release_hkid(kvm);
>                 return;
> +       }
>  
>         vmx_vm_destroy(kvm);
>  }
> @@ -980,7 +976,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>         .vm_size = sizeof(struct kvm_vmx),
>         .vm_enable_cap = vt_vm_enable_cap,
>         .vm_init = vt_vm_init,
> -       .flush_shadow_all_private = vt_flush_shadow_all_private,
>         .vm_destroy = vt_vm_destroy,
>         .vm_free = vt_vm_free,
>  
> 

-- 
Isaku Yamahata <isaku.yamahata@intel.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
  2024-07-18 15:28       ` Isaku Yamahata
@ 2024-07-18 15:55         ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2024-07-18 15:55 UTC (permalink / raw)
  To: Yamahata, Isaku
  Cc: Gao, Chao, isaku.yamahata@linux.intel.com, seanjc@google.com,
	Huang, Kai, pbonzini@redhat.com, sagis@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Aktas, Erdem, dmatlack@google.com, Zhao, Yan Y,
	kvm@vger.kernel.org

On Thu, 2024-07-18 at 08:28 -0700, Isaku Yamahata wrote:
> Although it's a bit late, it's for record.
> 
> It's to optimize the destruction Secure-EPT.  Free HKID early and destruct
> Secure-EPT by TDH.PHYMEM.PAGE.RECLAIM().  QEMU doesn't close any KVM file
> descriptors on exit. (gmem fd references KVM VM fd. so vm destruction happens
> after all gmem fds are closed.  Closing gmem fd causes secure-EPT zapping
> befure
> releasing HKID.)
> 
> Because we're ignoring such optimization for now, we can simply defer
> releasing
> HKID following Seans's call.

Thanks for the background.

> 
> 
> > But static_call_cond(kvm_x86_vm_destroy) happens before kvm_destroy_vcpus,
> > so we
> > could maybe actually just do the tdx_mmu_release_hkid() part there. Then
> > drop
> > the flush_shadow_all_private x86 op. See the (not thoroughly checked) diff
> > at
> > the bottom of this mail.
> 
> Yep, we can release HKID at vm destruction with potential too slow zapping of
> Secure-EPT.  The following change basically looks good to me.
> (The callback for Secure-EPT can be simplified.)


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2024-07-18 15:55 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-19 22:35 [PATCH v3 00/17] TDX MMU prep series part 1 Rick Edgecombe
2024-06-19 22:35 ` [PATCH v3 01/17] KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE Rick Edgecombe
2024-06-19 22:35 ` [PATCH v3 02/17] KVM: Add member to struct kvm_gfn_range for target alias Rick Edgecombe
2024-06-19 22:36 ` [PATCH v3 03/17] KVM: x86: Add a VM type define for TDX Rick Edgecombe
2024-06-19 22:36 ` [PATCH v3 04/17] KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page Rick Edgecombe
2024-07-03  7:03   ` Yan Zhao
2024-06-19 22:36 ` [PATCH v3 05/17] KVM: x86/mmu: Add an is_mirror member for union kvm_mmu_page_role Rick Edgecombe
2024-06-19 22:36 ` [PATCH v3 06/17] KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void Rick Edgecombe
2024-06-19 22:36 ` [PATCH v3 07/17] KVM: x86/tdp_mmu: Take struct kvm in iter loops Rick Edgecombe
2024-06-19 22:36 ` [PATCH v3 08/17] KVM: x86/tdp_mmu: Take a GFN in kvm_tdp_mmu_fast_pf_get_last_sptep() Rick Edgecombe
2024-06-19 22:36 ` [PATCH v3 09/17] KVM: x86/mmu: Support GFN direct bits Rick Edgecombe
2024-06-19 22:36 ` [PATCH v3 10/17] KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root() Rick Edgecombe
2024-06-19 22:36 ` [PATCH v3 11/17] KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type Rick Edgecombe
2024-06-19 22:36 ` [PATCH v3 12/17] KVM: x86/tdp_mmu: Take root in tdp_mmu_for_each_pte() Rick Edgecombe
2024-06-19 22:36 ` [PATCH v3 13/17] KVM: x86/tdp_mmu: Support mirror root for TDP MMU Rick Edgecombe
2024-06-24  8:30   ` Yan Zhao
2024-06-25  0:51     ` Edgecombe, Rick P
2024-06-25  5:43       ` Yan Zhao
2024-06-25 20:33         ` Edgecombe, Rick P
2024-06-26  5:05           ` Yan Zhao
2024-07-03 19:40             ` Edgecombe, Rick P
2024-07-04  8:09               ` Yan Zhao
2024-07-09 22:36                 ` Edgecombe, Rick P
2024-07-04  8:51   ` Yan Zhao
2024-07-09 22:38     ` Edgecombe, Rick P
2024-07-11 23:54       ` Edgecombe, Rick P
2024-07-12  1:42         ` Yan Zhao
2024-06-19 22:36 ` [PATCH v3 14/17] KVM: x86/tdp_mmu: Propagate attr_filter to MMU notifier callbacks Rick Edgecombe
2024-06-19 22:36 ` [PATCH v3 15/17] KVM: x86/tdp_mmu: Propagate building mirror page tables Rick Edgecombe
2024-06-20  5:15   ` Binbin Wu
2024-06-24 23:52     ` Edgecombe, Rick P
2024-06-19 22:36 ` [PATCH v3 16/17] KVM: x86/tdp_mmu: Propagate tearing down " Rick Edgecombe
2024-06-20  8:44   ` Binbin Wu
2024-06-24 23:55     ` Edgecombe, Rick P
2024-06-19 22:36 ` [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots() Rick Edgecombe
2024-06-21  7:10   ` Yan Zhao
2024-06-21 19:08     ` Edgecombe, Rick P
2024-06-24  8:29       ` Yan Zhao
2024-06-24 23:15         ` Edgecombe, Rick P
2024-06-25  6:14           ` Yan Zhao
2024-06-25 20:56             ` Edgecombe, Rick P
2024-06-26  2:25               ` Yan Zhao
2024-07-03 20:00                 ` Edgecombe, Rick P
2024-07-05  1:16                   ` Yan Zhao
2024-07-09 22:52                     ` Edgecombe, Rick P
2024-07-18 15:28       ` Isaku Yamahata
2024-07-18 15:55         ` Edgecombe, Rick P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox