[PATCH v2 00/24] TDX MMU Part 2

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/24] TDX MMU Part 2
@ 2024-11-12  7:33 Yan Zhao
  2024-11-12  7:34 ` [PATCH v2 01/24] KVM: x86/mmu: Implement memslot deletion for TDX Yan Zhao
                   ` (26 more replies)
  0 siblings, 27 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:33 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86, Yan Zhao

Hi,

Here is v2 of the TDX “MMU part 2” series.
As discussed earlier, non-nit feedbacks from v1[0] have been applied.
- Among them, patch "KVM: TDX: MTRR: implement get_mt_mask() for TDX" was
  dropped. The feature self-snoop was not made a dependency for enabling
  TDX since checking for the feature self-snoop was not included in
  kvm_mmu_may_ignore_guest_pat() in the base code. So, strickly speaking,
  current code would incorrectly zap the mirrored root if non-coherent DMA
  devices were hot-plugged.

There were also a few minor issues noticed by me and fixed without internal
discussion (noted in each patch's version log).

It’s now ready to hand off to Paolo/kvm-coco-queue.


One remaining item that requires further discussion is "How to handle
the TDX module lock contention (i.e. SEAMCALL retry replacements)".
The basis for future discussions includes:
(1) TDH.MEM.TRACK can contend with TDH.VP.ENTER on the TD epoch lock.
(2) TDH.VP.ENTER contends with TDH.MEM* on S-EPT tree lock when 0-stepping
    mitigation is triggered.
    - The threshold of zero-step mitigation is counted per-vCPU when the
      TDX module finds that EPT violations are caused by the same RIP as
      in the last TDH.VP.ENTER for 6 consecutive times.
      The threshold value 6 is explained as 
      "There can be at most 2 mapping faults on instruction fetch
       (x86 macro-instructions length is at most 15 bytes) when the
       instruction crosses page boundary; then there can be at most 2
       mapping faults for each memory operand, when the operand crosses
       page boundary. For most of x86 macro-instructions, there are up to 2
       memory operands and each one of them is small, which brings us to
       maximum 2+2*2 = 6 legal mapping faults."
    - If the EPT violations received by KVM are caused by
      TDG.MEM.PAGE.ACCEPT, they will not trigger 0-stepping mitigation.
      Since a TD is required to call TDG.MEM.PAGE.ACCEPT before accessing a
      private memory when configured with pending_ve_disable=Y, 0-stepping
      mitigation is not expected to occur in such a TD.
(3) TDG.MEM.PAGE.ACCEPT can contend with SEAMCALLs TDH.MEM*.
    (Actually, TDG.MEM.PAGE.ATTR.RD or TDG.MEM.PAGE.ATTR.WR can also
     contend with SEAMCALLs TDH.MEM*. Although we don't need to consider
     these two TDCALLs when enabling basic TDX, they are allowed by the
     TDX module, and we can't control whether a TD invokes a TDCALL or
     not).

The "KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" is 
still in place in this series (at the tail), but we should drop it when we
finalize on the real solution.


This series has 5 commits intended to collect Acks from x86 maintainers.
These commits introduce and export SEAMCALL wrappers to allow KVM to manage
the S-EPT (the EPT that maps private memory and is protected by the TDX
module):

  x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT
    pages
  x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
  x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
  x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
  x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial
    contents
 
This series is based off of a kvm-coco-queue commit and some pre-req
series:
1. commit ee69eb746754 ("KVM: x86/mmu: Prevent aliased memslot GFNs") (in
   kvm-coco-queue).
2. v7 of "TDX host: metadata reading tweaks, bug fix and info dump" [1].
3. v1 of "KVM: VMX: Initialize TDX when loading KVM module" [2], with some
   new feedback from Sean.
4. v2 of “TDX vCPU/VM creation” [3]
 
It requires TDX module 1.5.06.00.0744[4], or later. This is due to removal
of the workarounds for the lack of the NO_RBP_MOD feature required by the
kernel. Now NO_RBP_MOD is enabled (in VM/vCPU creation patches), and this
particular version of the TDX module has a required NO_RBP_MOD related bug
fix.
A working edk2 commit is 95d8a1c ("UnitTestFrameworkPkg: Use TianoCore
mirror of subhook submodule").

 
The series has been tested as part of the development branch for the TDX
base series. The testing consisted of TDX kvm-unit-tests and booting a
Linux TD, and TDX enhanced KVM selftests.

The full KVM branch is here:
https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-11.3

Matching QEMU:
https://github.com/intel-staging/qemu-tdx/commits/tdx-qemu-upstream-v6.1/

[0] https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/kvm/cover.1731318868.git.kai.huang@intel.com/#t
[2] https://lore.kernel.org/kvm/cover.1730120881.git.kai.huang@intel.com/
[3] https://lore.kernel.org/kvm/20241030190039.77971-1-rick.p.edgecombe@intel.com/
[4] https://github.com/intel/tdx-module/releases/tag/TDX_1.5.06


Isaku Yamahata (17):
  KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
  KVM: TDX: Add accessors VMX VMCS helpers
  KVM: TDX: Set gfn_direct_bits to shared bit
  x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT
    pages
  x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
  x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
  x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
  x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial
    contents
  KVM: TDX: Require TDP MMU and mmio caching for TDX
  KVM: x86/mmu: Add setter for shadow_mmio_value
  KVM: TDX: Set per-VM shadow_mmio_value to 0
  KVM: TDX: Handle TLB tracking for TDX
  KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page
    table
  KVM: TDX: Implement hook to get max mapping level of private pages
  KVM: TDX: Add an ioctl to create initial guest memory
  KVM: TDX: Finalize VM initialization
  KVM: TDX: Handle vCPU dissociation

Rick Edgecombe (3):
  KVM: x86/mmu: Implement memslot deletion for TDX
  KVM: VMX: Teach EPT violation helper about private mem
  KVM: x86/mmu: Export kvm_tdp_map_page()

Sean Christopherson (2):
  KVM: VMX: Split out guts of EPT violation to common/exposed function
  KVM: TDX: Add load_mmu_pgd method for TDX

Yan Zhao (1):
  KVM: x86/mmu: Do not enable page track for TD guest

Yuan Yao (1):
  [HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand
    SEPT

 arch/x86/include/asm/tdx.h      |   9 +
 arch/x86/include/asm/vmx.h      |   1 +
 arch/x86/include/uapi/asm/kvm.h |  10 +
 arch/x86/kvm/mmu.h              |   4 +
 arch/x86/kvm/mmu/mmu.c          |   7 +-
 arch/x86/kvm/mmu/page_track.c   |   3 +
 arch/x86/kvm/mmu/spte.c         |   8 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  37 +-
 arch/x86/kvm/vmx/common.h       |  43 ++
 arch/x86/kvm/vmx/main.c         | 104 ++++-
 arch/x86/kvm/vmx/tdx.c          | 727 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h          |  93 ++++
 arch/x86/kvm/vmx/tdx_arch.h     |  23 +
 arch/x86/kvm/vmx/vmx.c          |  25 +-
 arch/x86/kvm/vmx/x86_ops.h      |  51 +++
 arch/x86/virt/vmx/tdx/tdx.c     | 176 ++++++++
 arch/x86/virt/vmx/tdx/tdx.h     |   8 +
 virt/kvm/kvm_main.c             |   1 +
 18 files changed, 1278 insertions(+), 52 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/common.h

-- 
2.43.2


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 01/24] KVM: x86/mmu: Implement memslot deletion for TDX
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
@ 2024-11-12  7:34 ` Yan Zhao
  2024-11-12  7:34 ` [PATCH v2 02/24] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU Yan Zhao
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:34 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Rick Edgecombe <rick.p.edgecombe@intel.com>

Update attr_filter field to zap both private and shared mappings for TDX
when memslot is deleted.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
 - Update commit msg.

TDX MMU part 2 v1:
 - Clarify TDX limits on zapping private memory (Sean)

Memslot quirk series:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2e253a488949..68bb78a2306c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7126,6 +7126,7 @@ static void kvm_mmu_zap_memslot(struct kvm *kvm,
 		.start = slot->base_gfn,
 		.end = slot->base_gfn + slot->npages,
 		.may_block = true,
+		.attr_filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED,
 	};
 	bool flush;
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 02/24] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
  2024-11-12  7:34 ` [PATCH v2 01/24] KVM: x86/mmu: Implement memslot deletion for TDX Yan Zhao
@ 2024-11-12  7:34 ` Yan Zhao
  2024-11-12  7:35 ` [PATCH v2 03/24] KVM: x86/mmu: Do not enable page track for TD guest Yan Zhao
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:34 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

Export a function to walk down the TDP without modifying it and simply
check if a GPA is mapped.

Future changes will support pre-populating TDX private memory. In order to
implement this KVM will need to check if a given GFN is already
pre-populated in the mirrored EPT. [1]

There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
the KVM MMU that almost does what is required. However, to make sense of
the results, MMU internal PTE helpers are needed. Refactor the code to
provide a helper that can be used outside of the KVM MMU code.

Refactoring the KVM page fault handler to support this lookup usage was
also considered, but it was an awkward fit.

kvm_tdp_mmu_gpa_is_mapped() is based on a diff by Paolo Bonzini.

Link: https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@google.com/ [1]
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX MMU part 2 v2:
- Added Paolo's rb

TDX MMU part 2 v1:
 - Change exported function to just return of GPA is mapped because "You
   are executing with the filemap_invalidate_lock() taken, and therefore
   cannot race with kvm_gmem_punch_hole()" (Paolo)
   https://lore.kernel.org/kvm/CABgObfbpNN842noAe77WYvgi5MzK2SAA_FYw-=fGa+PcT_Z22w@mail.gmail.com/
 - Take root hpa instead of enum (Paolo)

TDX MMU Prep v2:
 - Rename function with "mirror" and use root enum

TDX MMU Prep:
 - New patch
---
 arch/x86/kvm/mmu.h         |  3 +++
 arch/x86/kvm/mmu/mmu.c     |  3 +--
 arch/x86/kvm/mmu/tdp_mmu.c | 37 ++++++++++++++++++++++++++++++++-----
 3 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 0fa86e47e9f3..398b6b06ed73 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -252,6 +252,9 @@ extern bool tdp_mmu_enabled;
 #define tdp_mmu_enabled false
 #endif
 
+bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
+
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
 	return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 68bb78a2306c..3a338df541c1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4743,8 +4743,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
-static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
-			    u8 *level)
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
 {
 	int r;
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d7d60116672b..b0e1c4cb3004 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1898,16 +1898,13 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
  *
  * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
  */
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-			 int *root_level)
+static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+				  struct kvm_mmu_page *root)
 {
-	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
 	struct tdp_iter iter;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
 
-	*root_level = vcpu->arch.mmu->root_role.level;
-
 	tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
@@ -1916,6 +1913,36 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	return leaf;
 }
 
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+			 int *root_level)
+{
+	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
+	*root_level = vcpu->arch.mmu->root_role.level;
+
+	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root);
+}
+
+bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
+{
+	struct kvm *kvm = vcpu->kvm;
+	bool is_direct = kvm_is_addr_direct(kvm, gpa);
+	hpa_t root = is_direct ? vcpu->arch.mmu->root.hpa :
+				 vcpu->arch.mmu->mirror_root_hpa;
+	u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
+	int leaf;
+
+	lockdep_assert_held(&kvm->mmu_lock);
+	rcu_read_lock();
+	leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root));
+	rcu_read_unlock();
+	if (leaf < 0)
+		return false;
+
+	spte = sptes[leaf];
+	return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
+
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
  * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 03/24] KVM: x86/mmu: Do not enable page track for TD guest
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
  2024-11-12  7:34 ` [PATCH v2 01/24] KVM: x86/mmu: Implement memslot deletion for TDX Yan Zhao
  2024-11-12  7:34 ` [PATCH v2 02/24] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU Yan Zhao
@ 2024-11-12  7:35 ` Yan Zhao
  2024-11-12  7:35 ` [PATCH v2 04/24] KVM: VMX: Split out guts of EPT violation to common/exposed function Yan Zhao
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:35 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

Fail kvm_page_track_write_tracking_enabled() if VM type is TDX to make the
external page track user fail in kvm_page_track_register_notifier() since
TDX does not support write protection and hence page track.

No need to fail KVM internal users of page track (i.e. for shadow page),
because TDX is always with EPT enabled and currently TDX module does not
emulate and send VMLAUNCH/VMRESUME VMExits to VMM.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Yuan Yao <yuan.yao@linux.intel.com>
---
TDX MMU part 2 v2:
- Move the checking of VM type from kvm_page_track_write_tracking_enabled()
  to kvm_enable_external_write_tracking() to make
  kvm_page_track_register_notifier() fail. (Paolo)
- Updated patch msg (Yan)
- Added Paolo's Suggested-by tag since the patch is simple enough and the
  current implementation was suggested by Paolo (Yan)

v19:
- drop TDX: from the short log
- Added reviewed-by: BinBin
---
 arch/x86/kvm/mmu/page_track.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 561c331fd6ec..1b17b12393a8 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -172,6 +172,9 @@ static int kvm_enable_external_write_tracking(struct kvm *kvm)
 	struct kvm_memory_slot *slot;
 	int r = 0, i, bkt;
 
+	if (kvm->arch.vm_type == KVM_X86_TDX_VM)
+		return -EOPNOTSUPP;
+
 	mutex_lock(&kvm->slots_arch_lock);
 
 	/*
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 04/24] KVM: VMX: Split out guts of EPT violation to common/exposed function
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (2 preceding siblings ...)
  2024-11-12  7:35 ` [PATCH v2 03/24] KVM: x86/mmu: Do not enable page track for TD guest Yan Zhao
@ 2024-11-12  7:35 ` Yan Zhao
  2024-11-12  7:35 ` [PATCH v2 05/24] KVM: VMX: Teach EPT violation helper about private mem Yan Zhao
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:35 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Sean Christopherson <sean.j.christopherson@intel.com>

The difference of TDX EPT violation is how to retrieve information, GPA,
and exit qualification.  To share the code to handle EPT violation, split
out the guts of EPT violation handler so that VMX/TDX exit handler can call
it after retrieving GPA and exit qualification.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---
 arch/x86/kvm/vmx/common.h | 34 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c    | 25 +++----------------------
 2 files changed, 37 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/common.h

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
new file mode 100644
index 000000000000..78ae39b6cdcd
--- /dev/null
+++ b/arch/x86/kvm/vmx/common.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_X86_VMX_COMMON_H
+#define __KVM_X86_VMX_COMMON_H
+
+#include <linux/kvm_host.h>
+
+#include "mmu.h"
+
+static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
+					     unsigned long exit_qualification)
+{
+	u64 error_code;
+
+	/* Is it a read fault? */
+	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
+		     ? PFERR_USER_MASK : 0;
+	/* Is it a write fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
+		      ? PFERR_WRITE_MASK : 0;
+	/* Is it a fetch fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
+		      ? PFERR_FETCH_MASK : 0;
+	/* ept page table entry is present? */
+	error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
+		      ? PFERR_PRESENT_MASK : 0;
+
+	if (error_code & EPT_VIOLATION_GVA_IS_VALID)
+		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
+			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
+
+	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+}
+
+#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 976fe6579f62..f7ae2359cea2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -53,6 +53,7 @@
 #include <trace/events/ipi.h>
 
 #include "capabilities.h"
+#include "common.h"
 #include "cpuid.h"
 #include "hyperv.h"
 #include "kvm_onhyperv.h"
@@ -5774,11 +5775,8 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
 
 static int handle_ept_violation(struct kvm_vcpu *vcpu)
 {
-	unsigned long exit_qualification;
+	unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
 	gpa_t gpa;
-	u64 error_code;
-
-	exit_qualification = vmx_get_exit_qual(vcpu);
 
 	/*
 	 * EPT violation happened while executing iret from NMI,
@@ -5794,23 +5792,6 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 	trace_kvm_page_fault(vcpu, gpa, exit_qualification);
 
-	/* Is it a read fault? */
-	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
-		     ? PFERR_USER_MASK : 0;
-	/* Is it a write fault? */
-	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
-		      ? PFERR_WRITE_MASK : 0;
-	/* Is it a fetch fault? */
-	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
-		      ? PFERR_FETCH_MASK : 0;
-	/* ept page table entry is present? */
-	error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
-		      ? PFERR_PRESENT_MASK : 0;
-
-	if (error_code & EPT_VIOLATION_GVA_IS_VALID)
-		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
-			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
-
 	/*
 	 * Check that the GPA doesn't exceed physical memory limits, as that is
 	 * a guest page fault.  We have to emulate the instruction here, because
@@ -5822,7 +5803,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
 		return kvm_emulate_instruction(vcpu, 0);
 
-	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 05/24] KVM: VMX: Teach EPT violation helper about private mem
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (3 preceding siblings ...)
  2024-11-12  7:35 ` [PATCH v2 04/24] KVM: VMX: Split out guts of EPT violation to common/exposed function Yan Zhao
@ 2024-11-12  7:35 ` Yan Zhao
  2024-11-12  7:35 ` [PATCH v2 06/24] KVM: TDX: Add accessors VMX VMCS helpers Yan Zhao
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:35 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Rick Edgecombe <rick.p.edgecombe@intel.com>

Teach EPT violation helper to check shared mask of a GPA to find out
whether the GPA is for private memory.

When EPT violation is triggered after TD accessing a private GPA, KVM will
exit to user space if the corresponding GFN's attribute is not private.
User space will then update GFN's attribute during its memory conversion
process. After that, TD will re-access the private GPA and trigger EPT
violation again. Only with GFN's attribute matches to private, KVM will
fault in private page, map it in mirrored TDP root, and propagate changes
to private EPT to resolve the EPT violation.

Relying on GFN's attribute tracking xarray to determine if a GFN is
private, as for KVM_X86_SW_PROTECTED_VM, may lead to endless EPT
violations.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
 - Rename kvm_is_private_gpa() to vt_is_tdx_private_gpa() (Paolo)

TDX MMU part 2 v1:
 - Split from "KVM: TDX: handle ept violation/misconfig exit"
---
 arch/x86/kvm/vmx/common.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 78ae39b6cdcd..7a592467a044 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -6,6 +6,12 @@
 
 #include "mmu.h"
 
+static inline bool vt_is_tdx_private_gpa(struct kvm *kvm, gpa_t gpa)
+{
+	/* For TDX the direct mask is the shared mask. */
+	return !kvm_is_addr_direct(kvm, gpa);
+}
+
 static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 					     unsigned long exit_qualification)
 {
@@ -28,6 +34,9 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
 			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
+	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa))
+		error_code |= PFERR_PRIVATE_ACCESS;
+
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 06/24] KVM: TDX: Add accessors VMX VMCS helpers
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (4 preceding siblings ...)
  2024-11-12  7:35 ` [PATCH v2 05/24] KVM: VMX: Teach EPT violation helper about private mem Yan Zhao
@ 2024-11-12  7:35 ` Yan Zhao
  2024-11-12  7:36 ` [PATCH v2 07/24] KVM: TDX: Add load_mmu_pgd method for TDX Yan Zhao
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:35 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX defines SEAMCALL APIs to access TDX control structures corresponding to
the VMX VMCS.  Introduce helper accessors to hide its SEAMCALL ABI details.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
-Move inline warning msgs to tdh_vp_rd/wr_failed() (Paolo)

TDX MMU part 2 v1:
 - Update for the wrapper functions for SEAMCALLs. (Sean)
 - Eliminate kvm_mmu_free_private_spt() and open code it.
 - Fix bisectability issues in headers (Kai)
 - Updates from seamcall overhaul (Kai)

v19:
 - deleted unnecessary stub functions,
   tdvps_state_non_arch_check() and tdvps_management_check().
---
 arch/x86/kvm/vmx/tdx.c | 13 +++++++
 arch/x86/kvm/vmx/tdx.h | 88 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7fb32d3b1aae..ed4473d0c2cd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -32,6 +32,19 @@ static enum cpuhp_state tdx_cpuhp_state;
 
 static const struct tdx_sys_info *tdx_sysinfo;
 
+void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err)
+{
+	KVM_BUG_ON(1, tdx->vcpu.kvm);
+	pr_err("TDH_VP_RD[%s.0x%x] failed 0x%llx\n", uclass, field, err);
+}
+
+void tdh_vp_wr_failed(struct vcpu_tdx *tdx, char *uclass, char *op, u32 field,
+		      u64 val, u64 err)
+{
+	KVM_BUG_ON(1, tdx->vcpu.kvm);
+	pr_err("TDH_VP_WR[%s.0x%x]%s0x%llx failed: 0x%llx\n", uclass, field, op, val, err);
+}
+
 #define KVM_SUPPORTED_TD_ATTRS (TDX_TD_ATTR_SEPT_VE_DISABLE)
 
 static u64 tdx_get_supported_attrs(const struct tdx_sys_info_td_conf *td_conf)
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 1b78a7ea988e..727bcf25d731 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -49,6 +49,10 @@ struct vcpu_tdx {
 	enum vcpu_tdx_state state;
 };
 
+void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
+void tdh_vp_wr_failed(struct vcpu_tdx *tdx, char *uclass, char *op, u32 field,
+		      u64 val, u64 err);
+
 static inline bool is_td(struct kvm *kvm)
 {
 	return kvm->arch.vm_type == KVM_X86_TDX_VM;
@@ -80,6 +84,90 @@ static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 fiel
 	}
 	return data;
 }
+
+static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
+{
+#define VMCS_ENC_ACCESS_TYPE_MASK	0x1UL
+#define VMCS_ENC_ACCESS_TYPE_FULL	0x0UL
+#define VMCS_ENC_ACCESS_TYPE_HIGH	0x1UL
+#define VMCS_ENC_ACCESS_TYPE(field)	((field) & VMCS_ENC_ACCESS_TYPE_MASK)
+
+	/* TDX is 64bit only.  HIGH field isn't supported. */
+	BUILD_BUG_ON_MSG(__builtin_constant_p(field) &&
+			 VMCS_ENC_ACCESS_TYPE(field) == VMCS_ENC_ACCESS_TYPE_HIGH,
+			 "Read/Write to TD VMCS *_HIGH fields not supported");
+
+	BUILD_BUG_ON(bits != 16 && bits != 32 && bits != 64);
+
+#define VMCS_ENC_WIDTH_MASK	GENMASK(14, 13)
+#define VMCS_ENC_WIDTH_16BIT	(0UL << 13)
+#define VMCS_ENC_WIDTH_64BIT	(1UL << 13)
+#define VMCS_ENC_WIDTH_32BIT	(2UL << 13)
+#define VMCS_ENC_WIDTH_NATURAL	(3UL << 13)
+#define VMCS_ENC_WIDTH(field)	((field) & VMCS_ENC_WIDTH_MASK)
+
+	/* TDX is 64bit only.  i.e. natural width = 64bit. */
+	BUILD_BUG_ON_MSG(bits != 64 && __builtin_constant_p(field) &&
+			 (VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_64BIT ||
+			  VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_NATURAL),
+			 "Invalid TD VMCS access for 64-bit field");
+	BUILD_BUG_ON_MSG(bits != 32 && __builtin_constant_p(field) &&
+			 VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_32BIT,
+			 "Invalid TD VMCS access for 32-bit field");
+	BUILD_BUG_ON_MSG(bits != 16 && __builtin_constant_p(field) &&
+			 VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_16BIT,
+			 "Invalid TD VMCS access for 16-bit field");
+}
+
+#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass)				\
+static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx,	\
+							u32 field)		\
+{										\
+	u64 err, data;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_rd(tdx->tdvpr_pa, TDVPS_##uclass(field), &data);		\
+	if (unlikely(err)) {							\
+		tdh_vp_rd_failed(tdx, #uclass, field, err);			\
+		return 0;							\
+	}									\
+	return (u##bits)data;							\
+}										\
+static __always_inline void td_##lclass##_write##bits(struct vcpu_tdx *tdx,	\
+						      u32 field, u##bits val)	\
+{										\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx->tdvpr_pa, TDVPS_##uclass(field), val,		\
+		      GENMASK_ULL(bits - 1, 0));				\
+	if (unlikely(err))							\
+		tdh_vp_wr_failed(tdx, #uclass, " = ", field, (u64)val, err);	\
+}										\
+static __always_inline void td_##lclass##_setbit##bits(struct vcpu_tdx *tdx,	\
+						       u32 field, u64 bit)	\
+{										\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx->tdvpr_pa, TDVPS_##uclass(field), bit, bit);	\
+	if (unlikely(err))							\
+		tdh_vp_wr_failed(tdx, #uclass, " |= ", field, bit, err);	\
+}										\
+static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx,	\
+							 u32 field, u64 bit)	\
+{										\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx->tdvpr_pa, TDVPS_##uclass(field), 0, bit);		\
+	if (unlikely(err))							\
+		tdh_vp_wr_failed(tdx, #uclass, " &= ~", field, bit, err);\
+}
+
+TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
 #else
 static inline void tdx_bringup(void) {}
 static inline void tdx_cleanup(void) {}
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 07/24] KVM: TDX: Add load_mmu_pgd method for TDX
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (5 preceding siblings ...)
  2024-11-12  7:35 ` [PATCH v2 06/24] KVM: TDX: Add accessors VMX VMCS helpers Yan Zhao
@ 2024-11-12  7:36 ` Yan Zhao
  2024-11-12  7:36 ` [PATCH v2 08/24] KVM: TDX: Set gfn_direct_bits to shared bit Yan Zhao
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:36 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX uses two EPT pointers, one for the private half of the GPA space and
one for the shared half. The private half uses the normal EPT_POINTER vmcs
field, which is managed in a special way by the TDX module. For TDX, KVM is
not allowed to operate on it directly. The shared half uses a new
SHARED_EPT_POINTER field and will be managed by the conventional MMU
management operations that operate directly on the EPT root. This means for
TDX the .load_mmu_pgd() operation will need to know to use the
SHARED_EPT_POINTER field instead of the normal one. Add a new wrapper in
x86 ops for load_mmu_pgd() that either directs the write to the existing
vmx implementation or a TDX one.

tdx_load_mmu_pgd() is so much simpler than vmx_load_mmu_pgd() since for the
TDX mode of operation, EPT will always be used and KVM does not need to be
involved in virtualization of CR3 behavior. So tdx_load_mmu_pgd() can
simply write to SHARED_EPT_POINTER.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX MMU part 2 v2:
-Check shared EPT level matches to direct bits mask in tdx_load_mmu_pgd()
 (Chao Gao)

TDX MMU part 2 v1:
- update the commit msg with the version rephrased by Rick.
  https://lore.kernel.org/all/78b1024ec3f5868e228baf797c6be98c5397bd49.camel@intel.com/

v19:
- Add WARN_ON_ONCE() to tdx_load_mmu_pgd() and drop unconditional mask
---
 arch/x86/include/asm/vmx.h |  1 +
 arch/x86/kvm/vmx/main.c    | 13 ++++++++++++-
 arch/x86/kvm/vmx/tdx.c     | 15 +++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  4 ++++
 4 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f7fd4369b821..9298fb9d4bb3 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -256,6 +256,7 @@ enum vmcs_field {
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
 	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
 	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
+	SHARED_EPT_POINTER		= 0x0000203C,
 	PID_POINTER_TABLE		= 0x00002042,
 	PID_POINTER_TABLE_HIGH		= 0x00002043,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d28ffddd766f..3c292b4a063a 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -98,6 +98,17 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
+			    int pgd_level)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+		return;
+	}
+
+	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -229,7 +240,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.write_tsc_offset = vmx_write_tsc_offset,
 	.write_tsc_multiplier = vmx_write_tsc_multiplier,
 
-	.load_mmu_pgd = vmx_load_mmu_pgd,
+	.load_mmu_pgd = vt_load_mmu_pgd,
 
 	.check_intercept = vmx_check_intercept,
 	.handle_exit_irqoff = vmx_handle_exit_irqoff,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ed4473d0c2cd..785ee9f95504 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -28,6 +28,9 @@
 bool enable_tdx __ro_after_init;
 module_param_named(tdx, enable_tdx, bool, 0444);
 
+#define TDX_SHARED_BIT_PWL_5 gpa_to_gfn(BIT_ULL(51))
+#define TDX_SHARED_BIT_PWL_4 gpa_to_gfn(BIT_ULL(47))
+
 static enum cpuhp_state tdx_cpuhp_state;
 
 static const struct tdx_sys_info *tdx_sysinfo;
@@ -495,6 +498,18 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 	tdx->state = VCPU_TD_STATE_UNINITIALIZED;
 }
 
+
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
+{
+	u64 shared_bit = (pgd_level == 5) ? TDX_SHARED_BIT_PWL_5 :
+			  TDX_SHARED_BIT_PWL_4;
+
+	if (KVM_BUG_ON(shared_bit != kvm_gfn_direct_bits(vcpu->kvm), vcpu->kvm))
+		return;
+
+	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 4739891858ea..f49135094c94 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -129,6 +129,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
+
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 #else
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
@@ -140,6 +142,8 @@ static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
+
+static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 08/24] KVM: TDX: Set gfn_direct_bits to shared bit
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (6 preceding siblings ...)
  2024-11-12  7:36 ` [PATCH v2 07/24] KVM: TDX: Add load_mmu_pgd method for TDX Yan Zhao
@ 2024-11-12  7:36 ` Yan Zhao
  2024-11-12  7:36 ` [PATCH v2 09/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT pages Yan Zhao
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:36 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

Make the direct root handle memslot GFNs at an alias with the TDX shared
bit set.

For TDX shared memory, the memslot GFNs need to be mapped at an alias with
the shared bit set. These shared mappings will be mapped on the KVM MMU's
"direct" root. The direct root has it's mappings shifted by applying
"gfn_direct_bits" as a mask. The concept of "GPAW" (guest physical address
width) determines the location of the shared bit. So set gfn_direct_bits
based on this, to map shared memory at the proper GPA.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX MMU part 2 v2:
 - Added Paolo's rb
 - Use TDX 1.5 naming of config_flags instead of exec_controls (Xiaoyao)
 - Use macro TDX_SHARED_BITS_PWL_5 and TDX_SHARED_BITS_PWL_4 for
   gfn_direct_bits. (Yan)

TDX MMU part 2 v1:
 - Move setting of gfn_direct_bits to separate patch (Yan)
---
 arch/x86/kvm/vmx/tdx.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 785ee9f95504..38369cafc175 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1041,6 +1041,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	kvm_tdx->attributes = td_params->attributes;
 	kvm_tdx->xfam = td_params->xfam;
 
+	if (td_params->config_flags & TDX_CONFIG_FLAGS_MAX_GPAW)
+		kvm->arch.gfn_direct_bits = TDX_SHARED_BIT_PWL_5;
+	else
+		kvm->arch.gfn_direct_bits = TDX_SHARED_BIT_PWL_4;
+
 	kvm_tdx->state = TD_STATE_INITIALIZED;
 out:
 	/* kfree() accepts NULL. */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 09/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT pages
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (7 preceding siblings ...)
  2024-11-12  7:36 ` [PATCH v2 08/24] KVM: TDX: Set gfn_direct_bits to shared bit Yan Zhao
@ 2024-11-12  7:36 ` Yan Zhao
  2024-11-12  7:36 ` [PATCH v2 10/24] x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages Yan Zhao
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:36 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX architecture introduces the concept of private GPA vs shared GPA,
depending on the GPA.SHARED bit. The TDX module maintains a Secure EPT
(S-EPT or SEPT) tree per TD for private GPA to HPA translation. Wrap the
TDH.MEM.SEPT.ADD SEAMCALL with tdh_mem_sept_add() to provide pages to the
TDX module for building a TD's SEPT tree. (Refer to these pages as SEPT
pages).

Callers need to allocate and provide a normal page to tdh_mem_sept_add(),
which then passes the page to the TDX module via the SEAMCALL
TDH.MEM.SEPT.ADD. The TDX module then installs the page into SEPT tree and
encrypts this SEPT page with the TD's guest keyID. The kernel cannot use
the SEPT page until after reclaiming it via TDH.MEM.SEPT.REMOVE or
TDH.PHYMEM.PAGE.RECLAIM.

Before passing the page to the TDX module, tdh_mem_sept_add() performs a
CLFLUSH on the page mapped with keyID 0 to ensure that any dirty cache
lines don't write back later and clobber TD memory or control structures.
Don't worry about the other MK-TME keyIDs because the kernel doesn't use
them. The TDX docs specify that this flush is not needed unless the TDX
module exposes the CLFLUSH_BEFORE_ALLOC feature bit. Do the CLFLUSH
unconditionally for two reasons: make the solution simpler by having a
single path that can handle both !CLFLUSH_BEFORE_ALLOC and
CLFLUSH_BEFORE_ALLOC cases. Avoid wading into any correctness uncertainty
by going with a conservative solution to start.

Callers should specify "GPA" and "level" for the TDX module to install the
SEPT page at the specified position in the SEPT. Do not include the root
page level in "level" since TDH.MEM.SEPT.ADD can only add non-root pages to
the SEPT. Ensure "level" is between 1 and 3 for a 4-level SEPT or between 1
and 4 for a 5-level SEPT.

Call tdh_mem_sept_add() during the TD's build time or during the TD's
runtime. Check for errors from the function return value and retrieve
extended error info from the function output parameters.

The TDX module has many internal locks. To avoid staying in SEAM mode for
too long, SEAMCALLs returns a BUSY error code to the kernel instead of
spinning on the locks. Depending on the specific SEAMCALL, the caller
may need to handle this error in specific ways (e.g., retry). Therefore,
return the SEAMCALL error code directly to the caller. Don't attempt to
handle it in the core kernel.

TDH.MEM.SEPT.ADD effectively manages two internal resources of the TDX
module: it installs page table pages in the SEPT tree and also updates the
TDX module's page metadata (PAMT). Don't add a wrapper for the matching
SEAMCALL for removing a SEPT page (TDH.MEM.SEPT.REMOVE) because KVM, as the
only in-kernel user, will only tear down the SEPT tree when the TD is being
torn down. When this happens it can just do other operations that reclaim
the SEPT pages for the host kernels to use, update the PAMT and let the
SEPT get trashed.

[Kai: Switched from generic seamcall export]
[Yan: Re-wrote the changelog]
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
- split out TDH.MEM.SEPT.ADD and re-wrote the patch msg. (Yan)
- dropped TDH.MEM.SEPT.REMOVE. (Yan)
- split out from original patch "KVM: TDX: Add C wrapper functions for
  SEAMCALLs to the TDX module" and move to x86 core. (Kai)
---
 arch/x86/include/asm/tdx.h  |  1 +
 arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 3 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d093dc4350ac..b6f3e5504d4d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -124,6 +124,7 @@ void tdx_guest_keyid_free(unsigned int keyid);

 /* SEAMCALL wrappers for creating/destroying/running TDX guests */
 u64 tdh_mng_addcx(u64 tdr, u64 tdcs);
+u64 tdh_mem_sept_add(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx);
 u64 tdh_vp_addcx(u64 tdvpr, u64 tdcx);
 u64 tdh_mng_key_config(u64 tdr);
 u64 tdh_mng_create(u64 tdr, u64 hkid);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index af121a73de80..1dc9be680475 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1575,6 +1575,25 @@ u64 tdh_mng_addcx(u64 tdr, u64 tdcs)
 }
 EXPORT_SYMBOL_GPL(tdh_mng_addcx);

+u64 tdh_mem_sept_add(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx)
+{
+	struct tdx_module_args args = {
+		.rcx = gpa | level,
+		.rdx = tdr,
+		.r8 = hpa,
+	};
+	u64 ret;
+
+	clflush_cache_range(__va(hpa), PAGE_SIZE);
+	ret = seamcall_ret(TDH_MEM_SEPT_ADD, &args);
+
+	*rcx = args.rcx;
+	*rdx = args.rdx;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mem_sept_add);
+
 u64 tdh_vp_addcx(u64 tdvpr, u64 tdcx)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index a63037036c91..7624d098515f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -18,6 +18,7 @@
  * TDX module SEAMCALL leaf functions
  */
 #define TDH_MNG_ADDCX			1
+#define TDH_MEM_SEPT_ADD		3
 #define TDH_VP_ADDCX			4
 #define TDH_MNG_KEY_CONFIG		8
 #define TDH_MNG_CREATE			9
-- 
2.43.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 10/24] x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (8 preceding siblings ...)
  2024-11-12  7:36 ` [PATCH v2 09/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT pages Yan Zhao
@ 2024-11-12  7:36 ` Yan Zhao
  2024-11-12  7:36 ` [PATCH v2 11/24] x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking Yan Zhao
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:36 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX architecture introduces the concept of private GPA vs shared GPA,
depending on the GPA.SHARED bit. The TDX module maintains a Secure EPT
(S-EPT or SEPT) tree per TD to translate TD's private memory accessed
using a private GPA. Wrap the SEAMCALL TDH.MEM.PAGE.ADD with
tdh_mem_page_add() and TDH.MEM.PAGE.AUG with tdh_mem_page_aug() to add TD
private pages and map them to the TD's private GPAs in the SEPT.

Callers of tdh_mem_page_add() and tdh_mem_page_aug() allocate and provide
normal pages to the wrappers, who further pass those pages to the TDX
module. Before passing the pages to the TDX module, tdh_mem_page_add() and
tdh_mem_page_aug() perform a CLFLUSH on the page mapped with keyID 0 to
ensure that any dirty cache lines don't write back later and clobber TD
memory or control structures. Don't worry about the other MK-TME keyIDs
because the kernel doesn't use them. The TDX docs specify that this flush
is not needed unless the TDX module exposes the CLFLUSH_BEFORE_ALLOC
feature bit. Do the CLFLUSH unconditionally for two reasons: make the
solution simpler by having a single path that can handle both
!CLFLUSH_BEFORE_ALLOC and CLFLUSH_BEFORE_ALLOC cases. Avoid wading into any
correctness uncertainty by going with a conservative solution to start.

Call tdh_mem_page_add() to add a private page to a TD during the TD's build
time (i.e., before TDH.MR.FINALIZE). Specify which GPA the 4K private page
will map to. No need to specify level info since TDH.MEM.PAGE.ADD only adds
pages at 4K level. To provide initial contents to TD, provide an additional
source page residing in memory managed by the host kernel itself (encrypted
with a shared keyID). The TDX module will copy the initial contents from
the source page in shared memory into the private page after mapping the
page in the SEPT to the specified private GPA. The TDX module allows the
source page to be the same page as the private page to be added. In that
case, the TDX module converts and encrypts the source page as a TD private
page.

Call tdh_mem_page_aug() to add a private page to a TD during the TD's
runtime (i.e., after TDH.MR.FINALIZE). TDH.MEM.PAGE.AUG supports adding
huge pages. Specify which GPA the private page will map to, along with
level info embedded in the lower bits of the GPA. The TDX module will
recognize the added page as the TD's private page after the TD's acceptance
with TDCALL TDG.MEM.PAGE.ACCEPT.

tdh_mem_page_add() and tdh_mem_page_aug() may fail. Callers can check
function return value and retrieve extended error info from the function
output parameters.

The TDX module has many internal locks. To avoid staying in SEAM mode for
too long, SEAMCALLs returns a BUSY error code to the kernel instead of
spinning on the locks. Depending on the specific SEAMCALL, the caller
may need to handle this error in specific ways (e.g., retry). Therefore,
return the SEAMCALL error code directly to the caller. Don't attempt to
handle it in the core kernel.

[Kai: Switched from generic seamcall export]
[Yan: Re-wrote the changelog]
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
- split out TDH.MEM.PAGE.ADD/AUG and re-wrote the patch msg (Yan).
- removed the code comment in tdh_mem_page_add() about rcx/rdx since the
  callers still need to check for accurate interpretation from spec and
  need to put the comment in a central place (Yan, Reinette).
- split out from original patch "KVM: TDX: Add C wrapper functions for
  SEAMCALLs to the TDX module" and move to x86 core (Kai)
---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/virt/vmx/tdx/tdx.c | 39 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 3 files changed, 43 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b6f3e5504d4d..d363aa201283 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -124,8 +124,10 @@ void tdx_guest_keyid_free(unsigned int keyid);

 /* SEAMCALL wrappers for creating/destroying/running TDX guests */
 u64 tdh_mng_addcx(u64 tdr, u64 tdcs);
+u64 tdh_mem_page_add(u64 tdr, u64 gpa, u64 hpa, u64 source, u64 *rcx, u64 *rdx);
 u64 tdh_mem_sept_add(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx);
 u64 tdh_vp_addcx(u64 tdvpr, u64 tdcx);
+u64 tdh_mem_page_aug(u64 tdr, u64 gpa, u64 hpa, u64 *rcx, u64 *rdx);
 u64 tdh_mng_key_config(u64 tdr);
 u64 tdh_mng_create(u64 tdr, u64 hkid);
 u64 tdh_vp_create(u64 tdr, u64 tdvpr);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1dc9be680475..e63e3cfd41fc 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1575,6 +1575,26 @@ u64 tdh_mng_addcx(u64 tdr, u64 tdcs)
 }
 EXPORT_SYMBOL_GPL(tdh_mng_addcx);

+u64 tdh_mem_page_add(u64 tdr, u64 gpa, u64 hpa, u64 source, u64 *rcx, u64 *rdx)
+{
+	struct tdx_module_args args = {
+		.rcx = gpa,
+		.rdx = tdr,
+		.r8 = hpa,
+		.r9 = source,
+	};
+	u64 ret;
+
+	clflush_cache_range(__va(hpa), PAGE_SIZE);
+	ret = seamcall_ret(TDH_MEM_PAGE_ADD, &args);
+
+	*rcx = args.rcx;
+	*rdx = args.rdx;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mem_page_add);
+
 u64 tdh_mem_sept_add(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx)
 {
 	struct tdx_module_args args = {
@@ -1606,6 +1626,25 @@ u64 tdh_vp_addcx(u64 tdvpr, u64 tdcx)
 }
 EXPORT_SYMBOL_GPL(tdh_vp_addcx);

+u64 tdh_mem_page_aug(u64 tdr, u64 gpa, u64 hpa, u64 *rcx, u64 *rdx)
+{
+	struct tdx_module_args args = {
+		.rcx = gpa,
+		.rdx = tdr,
+		.r8 = hpa,
+	};
+	u64 ret;
+
+	clflush_cache_range(__va(hpa), PAGE_SIZE);
+	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
+
+	*rcx = args.rcx;
+	*rdx = args.rdx;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mem_page_aug);
+
 u64 tdh_mng_key_config(u64 tdr)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 7624d098515f..d32ed527f67f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -18,8 +18,10 @@
  * TDX module SEAMCALL leaf functions
  */
 #define TDH_MNG_ADDCX			1
+#define TDH_MEM_PAGE_ADD		2
 #define TDH_MEM_SEPT_ADD		3
 #define TDH_VP_ADDCX			4
+#define TDH_MEM_PAGE_AUG		6
 #define TDH_MNG_KEY_CONFIG		8
 #define TDH_MNG_CREATE			9
 #define TDH_VP_CREATE			10
-- 
2.43.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 11/24] x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (9 preceding siblings ...)
  2024-11-12  7:36 ` [PATCH v2 10/24] x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages Yan Zhao
@ 2024-11-12  7:36 ` Yan Zhao
  2024-11-12  7:36 ` [PATCH v2 12/24] x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page Yan Zhao
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:36 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX module defines a TLB tracking protocol to make sure that no logical
processor holds any stale Secure EPT (S-EPT or SEPT) TLB translations for a
given TD private GPA range. After a successful TDH.MEM.RANGE.BLOCK,
TDH.MEM.TRACK, and kicking off all vCPUs, TDX module ensures that the
subsequent TDH.VP.ENTER on each vCPU will flush all stale TLB entries for
the specified GPA ranges in TDH.MEM.RANGE.BLOCK. Wrap the
TDH.MEM.RANGE.BLOCK with tdh_mem_range_block() and TDH.MEM.TRACK with
tdh_mem_track() to enable the kernel to assist the TDX module in TLB
tracking management.

The caller of tdh_mem_range_block() needs to specify "GPA" and "level" to
request the TDX module to block the subsequent creation of TLB translation
for a GPA range. This GPA range can correspond to a SEPT page or a TD
private page at any level.

Contentions and errors are possible with the SEAMCALL TDH.MEM.RANGE.BLOCK.
Therefore, the caller of tdh_mem_range_block() needs to check the function
return value and retrieve extended error info from the function output
params.

Upon TDH.MEM.RANGE.BLOCK success, no new TLB entries will be created for
the specified private GPA range, though the existing TLB translations may
still persist.

Call tdh_mem_track() after tdh_mem_range_block(). No extra info is required
except the TDR HPA to denote the TD. TDH.MEM.TRACK will advance the TD's
epoch counter to ensure TDX module will flush TLBs in all vCPUs once the
vCPUs re-enter the TD. TDH.MEM.TRACK will fail to advance TD's epoch
counter if there are vCPUs still running in non-root mode at the previous
TD epoch counter. Therefore, send IPIs to kick off vCPUs after
tdh_mem_track() to avoid the failure by forcing all vCPUs to re-enter the
TD.

Contentions are also possible in TDH.MEM.TRACK. For example, TDH.MEM.TRACK
may contend with TDH.VP.ENTER when advancing the TD epoch counter.
tdh_mem_track() does not provide the retries for the caller. Callers can
choose to avoid contentions or retry on their own.

[Kai: Switched from generic seamcall export]
[Yan: Re-wrote the changelog]
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
- split out TDH.MEM.RANGE.BLOCK and TDH.MEM.TRACK and re-wrote the patch
  msg (Yan).
- removed TDH.MEM.RANGE.UNBLOCK since it's unused. (Yan)
- split out from original patch "KVM: TDX: Add C wrapper functions for
  SEAMCALLs to the TDX module" and move to x86 core (Yan).
---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/virt/vmx/tdx/tdx.c | 27 +++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 3 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d363aa201283..227cb334176e 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -128,6 +128,7 @@ u64 tdh_mem_page_add(u64 tdr, u64 gpa, u64 hpa, u64 source, u64 *rcx, u64 *rdx);
 u64 tdh_mem_sept_add(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx);
 u64 tdh_vp_addcx(u64 tdvpr, u64 tdcx);
 u64 tdh_mem_page_aug(u64 tdr, u64 gpa, u64 hpa, u64 *rcx, u64 *rdx);
+u64 tdh_mem_range_block(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx);
 u64 tdh_mng_key_config(u64 tdr);
 u64 tdh_mng_create(u64 tdr, u64 hkid);
 u64 tdh_vp_create(u64 tdr, u64 tdvpr);
@@ -141,6 +142,7 @@ u64 tdh_vp_rd(u64 tdvpr, u64 field, u64 *data);
 u64 tdh_vp_wr(u64 tdvpr, u64 field, u64 data, u64 mask);
 u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32 x2apicid);
 u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8);
+u64 tdh_mem_track(u64 tdr);
 u64 tdh_phymem_cache_wb(bool resume);
 u64 tdh_phymem_page_wbinvd_tdr(u64 tdr);
 #else
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e63e3cfd41fc..f7f83d86ec18 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1645,6 +1645,23 @@ u64 tdh_mem_page_aug(u64 tdr, u64 gpa, u64 hpa, u64 *rcx, u64 *rdx)
 }
 EXPORT_SYMBOL_GPL(tdh_mem_page_aug);

+u64 tdh_mem_range_block(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx)
+{
+	struct tdx_module_args args = {
+		.rcx = gpa | level,
+		.rdx = tdr,
+	};
+	u64 ret;
+
+	ret = seamcall_ret(TDH_MEM_RANGE_BLOCK, &args);
+
+	*rcx = args.rcx;
+	*rdx = args.rdx;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mem_range_block);
+
 u64 tdh_mng_key_config(u64 tdr)
 {
 	struct tdx_module_args args = {
@@ -1820,6 +1837,16 @@ u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8)
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_reclaim);

+u64 tdh_mem_track(u64 tdr)
+{
+	struct tdx_module_args args = {
+		.rcx = tdr,
+	};
+
+	return seamcall(TDH_MEM_TRACK, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_mem_track);
+
 u64 tdh_phymem_cache_wb(bool resume)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index d32ed527f67f..e659eee1080a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -22,6 +22,7 @@
 #define TDH_MEM_SEPT_ADD		3
 #define TDH_VP_ADDCX			4
 #define TDH_MEM_PAGE_AUG		6
+#define TDH_MEM_RANGE_BLOCK		7
 #define TDH_MNG_KEY_CONFIG		8
 #define TDH_MNG_CREATE			9
 #define TDH_VP_CREATE			10
@@ -37,6 +38,7 @@
 #define TDH_SYS_KEY_CONFIG		31
 #define TDH_SYS_INIT			33
 #define TDH_SYS_RD			34
+#define TDH_MEM_TRACK			38
 #define TDH_SYS_LP_INIT			35
 #define TDH_SYS_TDMR_INIT		36
 #define TDH_PHYMEM_CACHE_WB		40
-- 
2.43.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 12/24] x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (10 preceding siblings ...)
  2024-11-12  7:36 ` [PATCH v2 11/24] x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking Yan Zhao
@ 2024-11-12  7:36 ` Yan Zhao
  2024-11-12  7:37 ` [PATCH v2 13/24] x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial contents Yan Zhao
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:36 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX architecture introduces the concept of private GPA vs shared GPA,
depending on the GPA.SHARED bit. The TDX module maintains a single Secure
EPT (S-EPT or SEPT) tree per TD to translate TD's private memory accessed
using a private GPA. Wrap the SEAMCALL TDH.MEM.PAGE.REMOVE with
tdh_mem_page_remove() and TDH_PHYMEM_PAGE_WBINVD with
tdh_phymem_page_wbinvd_hkid() to unmap a TD private page from the SEPT,
remove the TD private page from the TDX module and flush cache lines to
memory after removal of the private page.

Callers should specify "GPA" and "level" when calling tdh_mem_page_remove()
to indicate to the TDX module which TD private page to unmap and remove.

TDH.MEM.PAGE.REMOVE may fail, and the caller of tdh_mem_page_remove() can
check the function return value and retrieve extended error information
from the function output parameters. Follow the TLB tracking protocol
before calling tdh_mem_page_remove() to remove a TD private page to avoid
SEAMCALL failure.

After removing a TD's private page, the TDX module does not write back and
invalidate cache lines associated with the page and the page's keyID (i.e.,
the TD's guest keyID). Therefore, provide tdh_phymem_page_wbinvd_hkid() to
allow the caller to pass in the TD's guest keyID and invoke
TDH_PHYMEM_PAGE_WBINVD to perform this action.

Before reusing the page, the host kernel needs to map the page with keyID 0
and invoke movdir64b() to convert the TD private page to a normal shared
page.

TDH.MEM.PAGE.REMOVE and TDH_PHYMEM_PAGE_WBINVD may meet contentions inside
the TDX module for TDX's internal resources. To avoid staying in SEAM mode
for too long, TDX module will return a BUSY error code to the kernel
instead of spinning on the locks. The caller may need to handle this error
in specific ways (e.g., retry). The wrappers return the SEAMCALL error code
directly to the caller. Don't attempt to handle it in the core kernel.

[Kai: Switched from generic seamcall export]
[Yan: Re-wrote the changelog]
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
- split out TDH.MEM.PAGE.REMOVE, TDH_PHYMEM_PAGE_WBINVD and re-wrote the
  patch msg (Yan).
- split out from original patch "KVM: TDX: Add C wrapper functions for
  SEAMCALLs to the TDX module" and move to x86 core (Kai)
---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/virt/vmx/tdx/tdx.c | 27 +++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 3 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 227cb334176e..bad47415894b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -143,8 +143,10 @@ u64 tdh_vp_wr(u64 tdvpr, u64 field, u64 data, u64 mask);
 u64 tdh_vp_init_apicid(u64 tdvpr, u64 initial_rcx, u32 x2apicid);
 u64 tdh_phymem_page_reclaim(u64 page, u64 *rcx, u64 *rdx, u64 *r8);
 u64 tdh_mem_track(u64 tdr);
+u64 tdh_mem_page_remove(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx);
 u64 tdh_phymem_cache_wb(bool resume);
 u64 tdh_phymem_page_wbinvd_tdr(u64 tdr);
+u64 tdh_phymem_page_wbinvd_hkid(u64 hpa, u64 hkid);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f7f83d86ec18..1b57486f2f06 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1847,6 +1847,23 @@ u64 tdh_mem_track(u64 tdr)
 }
 EXPORT_SYMBOL_GPL(tdh_mem_track);

+u64 tdh_mem_page_remove(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx)
+{
+	struct tdx_module_args args = {
+		.rcx = gpa | level,
+		.rdx = tdr,
+	};
+	u64 ret;
+
+	ret = seamcall_ret(TDH_MEM_PAGE_REMOVE, &args);
+
+	*rcx = args.rcx;
+	*rdx = args.rdx;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mem_page_remove);
+
 u64 tdh_phymem_cache_wb(bool resume)
 {
 	struct tdx_module_args args = {
@@ -1866,3 +1883,13 @@ u64 tdh_phymem_page_wbinvd_tdr(u64 tdr)
 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
+
+u64 tdh_phymem_page_wbinvd_hkid(u64 hpa, u64 hkid)
+{
+	struct tdx_module_args args = {};
+
+	args.rcx = hpa | (hkid << boot_cpu_data.x86_phys_bits);
+
+	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index e659eee1080a..505203a89238 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -35,6 +35,7 @@
 #define TDH_PHYMEM_PAGE_RDMD		24
 #define TDH_VP_RD			26
 #define TDH_PHYMEM_PAGE_RECLAIM		28
+#define TDH_MEM_PAGE_REMOVE		29
 #define TDH_SYS_KEY_CONFIG		31
 #define TDH_SYS_INIT			33
 #define TDH_SYS_RD			34
-- 
2.43.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 13/24] x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial contents
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (11 preceding siblings ...)
  2024-11-12  7:36 ` [PATCH v2 12/24] x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page Yan Zhao
@ 2024-11-12  7:37 ` Yan Zhao
  2024-11-12  7:37 ` [PATCH v2 14/24] KVM: TDX: Require TDP MMU and mmio caching for TDX Yan Zhao
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:37 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

The TDX module measures the TD during the build process and saves the
measurement in TDCS.MRTD to facilitate TD attestation of the initial
contents of the TD. Wrap the SEAMCALL TDH.MR.EXTEND with tdh_mr_extend()
and TDH.MR.FINALIZE with tdh_mr_finalize() to enable the host kernel to
assist the TDX module in performing the measurement.

The measurement in TDCS.MRTD is a SHA-384 digest of the build process.
SEAMCALLs TDH.MNG.INIT and TDH.MEM.PAGE.ADD initialize and contribute to
the MRTD digest calculation.

The caller of tdh_mr_extend() should break the TD private page into chunks
of size TDX_EXTENDMR_CHUNKSIZE and invoke tdh_mr_extend() to add the page
content into the digest calculation. Failures are possible with
TDH.MR.EXTEND (e.g., due to SEPT walking). The caller of tdh_mr_extend()
can check the function return value and retrieve extended error information
from the function output parameters.

Calling tdh_mr_finalize() completes the measurement. The TDX module then
turns the TD into the runnable state. Further TDH.MEM.PAGE.ADD and
TDH.MR.EXTEND calls will fail.

TDH.MR.FINALIZE may fail due to errors such as the TD having no vCPUs or
contentions. Check function return value when calling tdh_mr_finalize() to
determine the exact reason for failure. Take proper locks on the caller's
side to avoid contention failures, or handle the BUSY error in specific
ways (e.g., retry). Return the SEAMCALL error code directly to the caller.
Do not attempt to handle it in the core kernel.

[Kai: Switched from generic seamcall export]
[Yan: Re-wrote the changelog]
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
- Rewrote the patch log (Yan).

uAPI breakout v2:
 - Change to use 'u64' as function parameter to prepare to move
   SEAMCALL wrappers to arch/x86. (Kai)
 - Split to separate patch
 - Move SEAMCALL wrappers from KVM to x86 core;
 - Move TDH_xx macros from KVM to x86 core;
 - Re-write log

uAPI breakout v1:
 - Make argument to C wrapper function struct kvm_tdx * or
   struct vcpu_tdx * .(Sean)
 - Drop unused helpers (Kai)
 - Fix bisectability issues in headers (Kai)
 - Updates from seamcall overhaul (Kai)

v19:
 - Update the commit message to match the patch by Yuan
 - Use seamcall() and seamcall_ret() by paolo

v18:
 - removed stub functions for __seamcall{,_ret}()
 - Added Reviewed-by Binbin
 - Make tdx_seamcall() use struct tdx_module_args instead of taking
  each inputs.

v16:
 - use struct tdx_module_args instead of struct tdx_module_output
 - Add tdh_mem_sept_rd() for SEPT_VE_DISABLE=1.
---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/virt/vmx/tdx/tdx.c | 27 +++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 3 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index bad47415894b..fdc81799171e 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -133,6 +133,8 @@ u64 tdh_mng_key_config(u64 tdr);
 u64 tdh_mng_create(u64 tdr, u64 hkid);
 u64 tdh_vp_create(u64 tdr, u64 tdvpr);
 u64 tdh_mng_rd(u64 tdr, u64 field, u64 *data);
+u64 tdh_mr_extend(u64 tdr, u64 gpa, u64 *rcx, u64 *rdx);
+u64 tdh_mr_finalize(u64 tdr);
 u64 tdh_vp_flush(u64 tdvpr);
 u64 tdh_mng_vpflushdone(u64 tdr);
 u64 tdh_mng_key_freeid(u64 tdr);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1b57486f2f06..7e0574facfb0 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1713,6 +1713,33 @@ u64 tdh_mng_rd(u64 tdr, u64 field, u64 *data)
 }
 EXPORT_SYMBOL_GPL(tdh_mng_rd);
 
+u64 tdh_mr_extend(u64 tdr, u64 gpa, u64 *rcx, u64 *rdx)
+{
+	struct tdx_module_args args = {
+		.rcx = gpa,
+		.rdx = tdr,
+	};
+	u64 ret;
+
+	ret = seamcall_ret(TDH_MR_EXTEND, &args);
+
+	*rcx = args.rcx;
+	*rdx = args.rdx;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mr_extend);
+
+u64 tdh_mr_finalize(u64 tdr)
+{
+	struct tdx_module_args args = {
+		.rcx = tdr,
+	};
+
+	return seamcall(TDH_MR_FINALIZE, &args);
+}
+EXPORT_SYMBOL_GPL(tdh_mr_finalize);
+
 u64 tdh_vp_flush(u64 tdvpr)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 505203a89238..4919d00025c9 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -27,6 +27,8 @@
 #define TDH_MNG_CREATE			9
 #define TDH_VP_CREATE			10
 #define TDH_MNG_RD			11
+#define TDH_MR_EXTEND			16
+#define TDH_MR_FINALIZE			17
 #define TDH_VP_FLUSH			18
 #define TDH_MNG_VPFLUSHDONE		19
 #define TDH_MNG_KEY_FREEID		20
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 14/24] KVM: TDX: Require TDP MMU and mmio caching for TDX
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (12 preceding siblings ...)
  2024-11-12  7:37 ` [PATCH v2 13/24] x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial contents Yan Zhao
@ 2024-11-12  7:37 ` Yan Zhao
  2024-11-12  7:37 ` [PATCH v2 15/24] KVM: x86/mmu: Add setter for shadow_mmio_value Yan Zhao
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:37 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

Disable TDX support when TDP MMU or mmio caching aren't supported.

As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
support for TDX isn't implemented.

TDX requires KVM mmio caching. Without mmio caching, KVM will go to MMIO
emulation without installing SPTEs for MMIOs. However, TDX guest is
protected and KVM would meet errors when trying to emulate MMIOs for TDX
guest during instruction decoding. So, TDX guest relies on SPTEs being
installed for MMIOs, which are with no RWX bits and with VE suppress bit
unset, to inject VE to TDX guest. The TDX guest would then issue TDVMCALL
in the VE handler to perform instruction decoding and have host do MMIO
emulation.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX MMU part 2 v2:
 - Added Paolo's rb.

TDX MMU part 2 v1:
 - Addressed Binbin's comment by massaging Isaku's updated comments and
   adding more explanations about instroducing mmio caching.
 - Addressed Sean's comments of v19 according to Isaku's update but
   kept the warning for MOVDIR64B.
 - Move code change in tdx_hardware_setup() to __tdx_bringup() since the
   former has been removed.
---
 arch/x86/kvm/mmu/mmu.c  | 1 +
 arch/x86/kvm/vmx/main.c | 1 +
 arch/x86/kvm/vmx/tdx.c  | 8 +++-----
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3a338df541c1..e2f75c8145fd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -110,6 +110,7 @@ static bool __ro_after_init tdp_mmu_allowed;
 #ifdef CONFIG_X86_64
 bool __read_mostly tdp_mmu_enabled = true;
 module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
+EXPORT_SYMBOL_GPL(tdp_mmu_enabled);
 #endif
 
 static int max_huge_page_level __read_mostly;
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 3c292b4a063a..a34c0bebe1c3 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -3,6 +3,7 @@
 
 #include "x86_ops.h"
 #include "vmx.h"
+#include "mmu.h"
 #include "nested.h"
 #include "pmu.h"
 #include "posted_intr.h"
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 38369cafc175..8832f76e4a22 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1412,16 +1412,14 @@ static int __init __tdx_bringup(void)
 	const struct tdx_sys_info_td_conf *td_conf;
 	int r;
 
+	if (!tdp_mmu_enabled || !enable_mmio_caching)
+		return -EOPNOTSUPP;
+
 	if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
 		pr_warn("MOVDIR64B is reqiured for TDX\n");
 		return -EOPNOTSUPP;
 	}
 
-	if (!enable_ept) {
-		pr_err("Cannot enable TDX with EPT disabled.\n");
-		return -EINVAL;
-	}
-
 	/*
 	 * Enabling TDX requires enabling hardware virtualization first,
 	 * as making SEAMCALLs requires CPU being in post-VMXON state.
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 15/24] KVM: x86/mmu: Add setter for shadow_mmio_value
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (13 preceding siblings ...)
  2024-11-12  7:37 ` [PATCH v2 14/24] KVM: TDX: Require TDP MMU and mmio caching for TDX Yan Zhao
@ 2024-11-12  7:37 ` Yan Zhao
  2024-11-12  7:37 ` [PATCH v2 16/24] KVM: TDX: Set per-VM shadow_mmio_value to 0 Yan Zhao
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:37 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

Future changes will want to set shadow_mmio_value from TDX code. Add a
helper to setter with a name that makes more sense from that context.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[split into new patch]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX MMU part 2 v2:
 - Added Paolo's rb

TDX MMU part 2 v1:
 - Split into new patch
---
 arch/x86/kvm/mmu.h      | 1 +
 arch/x86/kvm/mmu/spte.c | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 398b6b06ed73..a935e65a133d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -78,6 +78,7 @@ static inline gfn_t kvm_mmu_max_gfn(void)
 u8 kvm_mmu_get_max_tdp_level(void);
 
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
+void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value);
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
 
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index f1a50a78badb..a831e76f379a 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -422,6 +422,12 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 
+void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value)
+{
+	kvm->arch.shadow_mmio_value = mmio_value;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_value);
+
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask)
 {
 	/* shadow_me_value must be a subset of shadow_me_mask */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 16/24] KVM: TDX: Set per-VM shadow_mmio_value to 0
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (14 preceding siblings ...)
  2024-11-12  7:37 ` [PATCH v2 15/24] KVM: x86/mmu: Add setter for shadow_mmio_value Yan Zhao
@ 2024-11-12  7:37 ` Yan Zhao
  2024-11-12  7:37 ` [PATCH v2 17/24] KVM: TDX: Handle TLB tracking for TDX Yan Zhao
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:37 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

Set per-VM shadow_mmio_value to 0 for TDX.

With enable_mmio_caching on, KVM installs MMIO SPTEs for TDs. To correctly
configure MMIO SPTEs, TDX requires the per-VM shadow_mmio_value to be set
to 0. This is necessary to override the default value of the suppress VE
bit in the SPTE, which is 1, and to ensure value 0 in RWX bits.

For MMIO SPTE, the spte value changes as follows:
1. initial value (suppress VE bit is set)
2. Guest issues MMIO and triggers EPT violation
3. KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
4. Guest MMIO resumes.  It triggers VE exception in guest TD
5. Guest VE handler issues TDG.VP.VMCALL<MMIO>
6. KVM handles MMIO
7. Guest VE handler resumes its execution after MMIO instruction

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX MMU part 2 v2:
 - Added Paolo's rb.

TDX MMU part 2 v1:
 - Split from the big patch "KVM: TDX: TDP MMU TDX support".
 - Remove warning for shadow_mmio_value
---
 arch/x86/kvm/mmu/spte.c |  2 --
 arch/x86/kvm/vmx/tdx.c  | 15 ++++++++++++++-
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index a831e76f379a..817c68ad8bd5 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -94,8 +94,6 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
 	u64 spte = generation_mmio_spte_mask(gen);
 	u64 gpa = gfn << PAGE_SHIFT;
 
-	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);
-
 	access &= shadow_mmio_access_mask;
 	spte |= vcpu->kvm->arch.shadow_mmio_value | access;
 	spte |= gpa | shadow_nonpresent_or_rsvd_mask;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8832f76e4a22..37696adb574c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,7 +5,7 @@
 #include "mmu.h"
 #include "x86_ops.h"
 #include "tdx.h"
-
+#include "mmu/spte.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -415,6 +415,19 @@ int tdx_vm_init(struct kvm *kvm)
 
 	kvm->arch.has_private_mem = true;
 
+	/*
+	 * Because guest TD is protected, VMM can't parse the instruction in TD.
+	 * Instead, guest uses MMIO hypercall.  For unmodified device driver,
+	 * #VE needs to be injected for MMIO and #VE handler in TD converts MMIO
+	 * instruction into MMIO hypercall.
+	 *
+	 * SPTE value for MMIO needs to be setup so that #VE is injected into
+	 * TD instead of triggering EPT MISCONFIG.
+	 * - RWX=0 so that EPT violation is triggered.
+	 * - suppress #VE bit is cleared to inject #VE.
+	 */
+	kvm_mmu_set_mmio_spte_value(kvm, 0);
+
 	/*
 	 * TDX has its own limit of maximum vCPUs it can support for all
 	 * TDX guests in addition to KVM_MAX_VCPUS.  TDX module reports
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 17/24] KVM: TDX: Handle TLB tracking for TDX
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (15 preceding siblings ...)
  2024-11-12  7:37 ` [PATCH v2 16/24] KVM: TDX: Set per-VM shadow_mmio_value to 0 Yan Zhao
@ 2024-11-12  7:37 ` Yan Zhao
  2024-11-12  7:38 ` [PATCH v2 18/24] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Yan Zhao
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:37 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

Handle TLB tracking for TDX by introducing function tdx_track() for private
memory TLB tracking and implementing flush_tlb* hooks to flush TLBs for
shared memory.

Introduce function tdx_track() to do TLB tracking on private memory, which
basically does two things: calling TDH.MEM.TRACK to increase TD epoch and
kicking off all vCPUs. The private EPT will then be flushed when each vCPU
re-enters the TD. This function is unused temporarily in this patch and
will be called on a page-by-page basis on removal of private guest page in
a later patch.

In earlier revisions, tdx_track() relied on an atomic counter to coordinate
the synchronization between the actions of kicking off vCPUs, incrementing
the TD epoch, and the vCPUs waiting for the incremented TD epoch after
being kicked off.

However, the core MMU only actually needs to call tdx_track() while
aleady under a write mmu_lock. So this sychnonization can be made to be
unneeded. vCPUs are kicked off only after the successful execution of
TDH.MEM.TRACK, eliminating the need for vCPUs to wait for TDH.MEM.TRACK
completion after being kicked off. tdx_track() is therefore able to send
requests KVM_REQ_OUTSIDE_GUEST_MODE rather than KVM_REQ_TLB_FLUSH.

Hooks for flush_remote_tlb and flush_remote_tlbs_range are not necessary
for TDX, as tdx_track() will handle TLB tracking of private memory on
page-by-page basis when private guest pages are removed. There is no need
to invoke tdx_track() again in kvm_flush_remote_tlbs() even after changes
to the mirrored page table.

For hooks flush_tlb_current and flush_tlb_all, which are invoked during
kvm_mmu_load() and vcpu load for normal VMs, let VMM to flush all EPTs in
the two hooks for simplicity, since TDX does not depend on the two
hooks to notify TDX module to flush private EPT in those cases.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
 - No need for is_td_finalized() (Rick)
 - Fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
   wrappers (Kai)
 - Add TD state handling (Tony)
 - Added tdx_flush_tlb_all() (Paolo)
 - Re-explanined the reason to do global invalidation in
   tdx_flush_tlb_current() (Yan)

TDX MMU part 2 v1:
 - Split from the big patch "KVM: TDX: TDP MMU TDX support".
 - Modification of synchronization mechanism in tdx_track().
 - Dropped hooks flush_remote_tlb and flush_remote_tlbs_range.
 - Let VMM to flush all EPTs in hooks flush_tlb_all and flush_tlb_current.
 - Dropped KVM_BUG_ON() in vt_flush_tlb_gva(). (Rick)
---
 arch/x86/kvm/vmx/main.c    | 44 +++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 81 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  4 ++
 3 files changed, 125 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a34c0bebe1c3..4902d7bb86f3 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -99,6 +99,42 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_flush_tlb_all(vcpu);
+		return;
+	}
+
+	vmx_flush_tlb_all(vcpu);
+}
+
+static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_flush_tlb_current(vcpu);
+		return;
+	}
+
+	vmx_flush_tlb_current(vcpu);
+}
+
+static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_flush_tlb_gva(vcpu, addr);
+}
+
+static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_flush_tlb_guest(vcpu);
+}
+
 static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			    int pgd_level)
 {
@@ -188,10 +224,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_rflags = vmx_set_rflags,
 	.get_if_flag = vmx_get_if_flag,
 
-	.flush_tlb_all = vmx_flush_tlb_all,
-	.flush_tlb_current = vmx_flush_tlb_current,
-	.flush_tlb_gva = vmx_flush_tlb_gva,
-	.flush_tlb_guest = vmx_flush_tlb_guest,
+	.flush_tlb_all = vt_flush_tlb_all,
+	.flush_tlb_current = vt_flush_tlb_current,
+	.flush_tlb_gva = vt_flush_tlb_gva,
+	.flush_tlb_guest = vt_flush_tlb_guest,
 
 	.vcpu_pre_run = vmx_vcpu_pre_run,
 	.vcpu_run = vmx_vcpu_run,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 37696adb574c..9eef361c8e57 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,6 +5,7 @@
 #include "mmu.h"
 #include "x86_ops.h"
 #include "tdx.h"
+#include "vmx.h"
 #include "mmu/spte.h"
 
 #undef pr_fmt
@@ -523,6 +524,51 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
+/*
+ * Ensure shared and private EPTs to be flushed on all vCPUs.
+ * tdh_mem_track() is the only caller that increases TD epoch. An increase in
+ * the TD epoch (e.g., to value "N + 1") is successful only if no vCPUs are
+ * running in guest mode with the value "N - 1".
+ *
+ * A successful execution of tdh_mem_track() ensures that vCPUs can only run in
+ * guest mode with TD epoch value "N" if no TD exit occurs after the TD epoch
+ * being increased to "N + 1".
+ *
+ * Kicking off all vCPUs after that further results in no vCPUs can run in guest
+ * mode with TD epoch value "N", which unblocks the next tdh_mem_track() (e.g.
+ * to increase TD epoch to "N + 2").
+ *
+ * TDX module will flush EPT on the next TD enter and make vCPUs to run in
+ * guest mode with TD epoch value "N + 1".
+ *
+ * kvm_make_all_cpus_request() guarantees all vCPUs are out of guest mode by
+ * waiting empty IPI handler ack_kick().
+ *
+ * No action is required to the vCPUs being kicked off since the kicking off
+ * occurs certainly after TD epoch increment and before the next
+ * tdh_mem_track().
+ */
+static void __always_unused tdx_track(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err;
+
+	/* If TD isn't finalized, it's before any vcpu running. */
+	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
+		return;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	do {
+		err = tdh_mem_track(kvm_tdx->tdr_pa);
+	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
+
+	if (KVM_BUG_ON(err, kvm))
+		pr_tdx_error(TDH_MEM_TRACK, err);
+
+	kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
@@ -1068,6 +1114,41 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return ret;
 }
 
+void tdx_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * flush_tlb_current() is invoked when the first time for the vcpu to
+	 * run or when root of shared EPT is invalidated.
+	 * KVM only needs to flush shared EPT because the TDX module handles TLB
+	 * invalidation for private EPT in tdh_vp_enter();
+	 *
+	 * A single context invalidation for shared EPT can be performed here.
+	 * However, this single context invalidation requires the private EPTP
+	 * rather than the shared EPTP to flush shared EPT, as shared EPT uses
+	 * private EPTP as its ASID for TLB invalidation.
+	 *
+	 * To avoid reading back private EPTP, perform a global invalidation for
+	 * shared EPT instead to keep this function simple.
+	 */
+	ept_sync_global();
+}
+
+void tdx_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * TDX has called tdx_track() in tdx_sept_remove_private_spte() to
+	 * ensure that private EPT will be flushed on the next TD enter. No need
+	 * to call tdx_track() here again even when this callback is a result of
+	 * zapping private EPT.
+	 *
+	 * Due to the lack of the context to determine which EPT has been
+	 * affected by zapping, invoke invept() directly here for both shared
+	 * EPT and private EPT for simplicity, though it's not necessary for
+	 * private EPT.
+	 */
+	ept_sync_global();
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f49135094c94..7151ac38bc31 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -130,6 +130,8 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
+void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
+void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 #else
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
@@ -143,6 +145,8 @@ static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
+static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
+static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 #endif
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 18/24] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (16 preceding siblings ...)
  2024-11-12  7:37 ` [PATCH v2 17/24] KVM: TDX: Handle TLB tracking for TDX Yan Zhao
@ 2024-11-12  7:38 ` Yan Zhao
  2024-11-12  7:38 ` [PATCH v2 19/24] KVM: TDX: Implement hook to get max mapping level of private pages Yan Zhao
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:38 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement hooks in TDX to propagate changes of mirror page table to private
EPT, including changes for page table page adding/removing, guest page
adding/removing.

TDX invokes corresponding SEAMCALLs in the hooks.

- Hook link_external_spt
  propagates adding page table page into private EPT.

- Hook set_external_spte
  tdx_sept_set_private_spte() in this patch only handles adding of guest
  private page when TD is finalized.
  Later patches will handle the case of adding guest private pages before
  TD finalization.

- Hook free_external_spt
  It is invoked when page table page is removed in mirror page table, which
  currently must occur at TD tear down phase, after hkid is freed.

- Hook remove_external_spte
  It is invoked when guest private page is removed in mirror page table,
  which can occur when TD is active, e.g. during shared <-> private
  conversion and slot move/deletion.
  This hook is ensured to be triggered before hkid is freed, because
  gmem fd is released along with all private leaf mappings zapped before
  freeing hkid at VM destroy.

  TDX invokes below SEAMCALLs sequentially:
  1) TDH.MEM.RANGE.BLOCK (remove RWX bits from a private EPT entry),
  2) TDH.MEM.TRACK (increases TD epoch)
  3) TDH.MEM.PAGE.REMOVE (remove the private EPT entry and untrack the
     guest page).

  TDH.MEM.PAGE.REMOVE can't succeed without TDH.MEM.RANGE.BLOCK and
  TDH.MEM.TRACK being called successfully.
  SEAMCALL TDH.MEM.TRACK is called in function tdx_track() to enforce that
  TLB tracking will be performed by TDX module for private EPT.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
 - No need for is_td_finalized() (Rick)
 - Fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
   wrappers (Kai)
 - Add TD state handling (Tony)
 - Fix "KVM_MAP_MEMORY" comment (Binbin)
 - Updated comment of KVM_BUG_ON() in tdx_sept_remove_private_spte
   (Kai, Rick)
 - Return -EBUSY on busy in tdx_mem_page_aug(). (Kai)
 - Retry tdh_mem_page_aug() on TDX_OPERAND_BUSY instead of
   TDX_ERROR_SEPT_BUSY. (Yan)

TDX MMU part 2 v1:
 - Split from the big patch "KVM: TDX: TDP MMU TDX support".
 - Move setting up the 4 callbacks (kvm_x86_ops::link_external_spt etc)
   from tdx_hardware_setup() (which doesn't exist anymore) to
   vt_hardware_setup() directly.  Make tdx_sept_link_external_spt() those
   4 callbacks global and add declarations to x86_ops.h so they can be
   setup in vt_hardware_setup().
 - Updated the KVM_BUG_ON() in tdx_sept_free_private_spt(). (Isaku, Binbin)
 - Removed the unused tdx_post_mmu_map_page().
 - Removed WARN_ON_ONCE) in tdh_mem_page_aug() according to Isaku's
   feedback:
   "This WARN_ON_ONCE() is a guard for buggy TDX module. It shouldn't return
   (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX)) when
   SEPT_VE_DISABLED cleared.  Maybe we should remove this WARN_ON_ONCE()
   because the TDX module is mature."
 - Update for the wrapper functions for SEAMCALLs. (Sean)
 - Add preparation for KVM_TDX_INIT_MEM_REGION to make
   tdx_sept_set_private_spte() callback nop when the guest isn't finalized.
 - use unlikely(err) in  tdx_reclaim_td_page().
 - Updates from seamcall overhaul (Kai)
 - Move header definitions from "KVM: TDX: Define TDX architectural
   definitions" (Sean)
 - Drop ugly unions (Sean)
 - Remove tdx_mng_key_config_lock cleanup after dropped in "KVM: TDX:
   create/destroy VM structure" (Chao)
 - Since HKID is freed on vm_destroy() zapping only happens when HKID is
   allocated. Remove relevant code in zapping handlers that assume the
   opposite, and add some KVM_BUG_ON() to assert this where it was
   missing. (Isaku)
---
 arch/x86/kvm/vmx/main.c     |  14 ++-
 arch/x86/kvm/vmx/tdx.c      | 219 +++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx_arch.h |  23 ++++
 arch/x86/kvm/vmx/x86_ops.h  |  37 ++++++
 4 files changed, 291 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 4902d7bb86f3..cb41e9a1d3e3 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -36,9 +36,21 @@ static __init int vt_hardware_setup(void)
 	 * is KVM may allocate couple of more bytes than needed for
 	 * each VM.
 	 */
-	if (enable_tdx)
+	if (enable_tdx) {
 		vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
 				sizeof(struct kvm_tdx));
+		/*
+		 * Note, TDX may fail to initialize in a later time in
+		 * vt_init(), in which case it is not necessary to setup
+		 * those callbacks.  But making them valid here even
+		 * when TDX fails to init later is fine because those
+		 * callbacks won't be called if the VM isn't TDX guest.
+		 */
+		vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
+		vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
+		vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
+		vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
+	}
 
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9eef361c8e57..29f01cff0e6b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -142,6 +142,14 @@ static DEFINE_MUTEX(tdx_lock);
 
 static atomic_t nr_configured_hkid;
 
+#define TDX_ERROR_SEPT_BUSY    (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)
+
+static inline int pg_level_to_tdx_sept_level(enum pg_level level)
+{
+	WARN_ON_ONCE(level == PG_LEVEL_NONE);
+	return level - 1;
+}
+
 /* Maximum number of retries to attempt for SEAMCALLs. */
 #define TDX_SEAMCALL_RETRIES	10000
 
@@ -524,6 +532,166 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
+static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
+{
+	put_page(pfn_to_page(pfn));
+}
+
+static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
+			    enum pg_level level, kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	hpa_t hpa = pfn_to_hpa(pfn);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	u64 entry, level_state;
+	u64 err;
+
+	err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, hpa, &entry, &level_state);
+	if (unlikely(err & TDX_OPERAND_BUSY)) {
+		tdx_unpin(kvm, pfn);
+		return -EBUSY;
+	}
+
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
+		tdx_unpin(kvm, pfn);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return -EINVAL;
+
+	/*
+	 * Because guest_memfd doesn't support page migration with
+	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
+	 * migration.  Until guest_memfd supports page migration, prevent page
+	 * migration.
+	 * TODO: Once guest_memfd introduces callback on page migration,
+	 * implement it and remove get_page/put_page().
+	 */
+	get_page(pfn_to_page(pfn));
+
+	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
+		return tdx_mem_page_aug(kvm, gfn, level, pfn);
+
+	/*
+	 * TODO: KVM_TDX_INIT_MEM_REGION support to populate before finalize
+	 * comes here for the initial memory.
+	 */
+	return -EOPNOTSUPP;
+}
+
+static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level, kvm_pfn_t pfn)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	hpa_t hpa = pfn_to_hpa(pfn);
+	u64 err, entry, level_state;
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return -EINVAL;
+
+	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
+		return -EINVAL;
+
+	do {
+		/*
+		 * When zapping private page, write lock is held. So no race
+		 * condition with other vcpu sept operation.  Race only with
+		 * TDH.VP.ENTER.
+		 */
+		err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
+					  &level_state);
+	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
+
+	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE &&
+		     err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
+		/*
+		 * This page was mapped with KVM_MAP_MEMORY, but
+		 * KVM_TDX_INIT_MEM_REGION is not issued yet.
+		 */
+		if (!is_last_spte(entry, level) || !(entry & VMX_EPT_RWX_MASK)) {
+			tdx_unpin(kvm, pfn);
+			return 0;
+		}
+	}
+
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state);
+		return -EIO;
+	}
+
+	do {
+		/*
+		 * TDX_OPERAND_BUSY can happen on locking PAMT entry.  Because
+		 * this page was removed above, other thread shouldn't be
+		 * repeatedly operating on this page.  Just retry loop.
+		 */
+		err = tdh_phymem_page_wbinvd_hkid(hpa, (u16)kvm_tdx->hkid);
+	} while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
+
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
+		return -EIO;
+	}
+	tdx_clear_page(hpa);
+	tdx_unpin(kvm, pfn);
+	return 0;
+}
+
+int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, void *private_spt)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	hpa_t hpa = __pa(private_spt);
+	u64 err, entry, level_state;
+
+	err = tdh_mem_sept_add(to_kvm_tdx(kvm)->tdr_pa, gpa, tdx_level, hpa, &entry,
+			       &level_state);
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_SEPT_ADD, err, entry, level_state);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
+	u64 err, entry, level_state;
+
+	/* For now large page isn't supported yet. */
+	WARN_ON_ONCE(level != PG_LEVEL_4K);
+
+	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
+		return -EIO;
+	}
+	return 0;
+}
+
 /*
  * Ensure shared and private EPTs to be flushed on all vCPUs.
  * tdh_mem_track() is the only caller that increases TD epoch. An increase in
@@ -548,7 +716,7 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
  * occurs certainly after TD epoch increment and before the next
  * tdh_mem_track().
  */
-static void __always_unused tdx_track(struct kvm *kvm)
+static void tdx_track(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	u64 err;
@@ -569,6 +737,55 @@ static void __always_unused tdx_track(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
 }
 
+int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, void *private_spt)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/*
+	 * free_external_spt() is only called after hkid is freed when TD is
+	 * tearing down.
+	 * KVM doesn't (yet) zap page table pages in mirror page table while
+	 * TD is active, though guest pages mapped in mirror page table could be
+	 * zapped during TD is active, e.g. for shared <-> private conversion
+	 * and slot move/deletion.
+	 */
+	if (KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm))
+		return -EINVAL;
+
+	/*
+	 * The HKID assigned to this TD was already freed and cache was
+	 * already flushed. We don't have to flush again.
+	 */
+	return tdx_reclaim_page(__pa(private_spt));
+}
+
+int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+				 enum pg_level level, kvm_pfn_t pfn)
+{
+	int ret;
+
+	/*
+	 * HKID is released after all private pages have been removed, and set
+	 * before any might be populated. Warn if zapping is attempted when
+	 * there can't be anything populated in the private EPT.
+	 */
+	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
+		return -EINVAL;
+
+	ret = tdx_sept_zap_private_spte(kvm, gfn, level);
+	if (ret)
+		return ret;
+
+	/*
+	 * TDX requires TLB tracking before dropping private page.  Do
+	 * it here, although it is also done later.
+	 */
+	tdx_track(kvm);
+
+	return tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index d80ec118834e..289728f1611f 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -155,6 +155,29 @@ struct td_params {
 #define TDX_MIN_TSC_FREQUENCY_KHZ		(100 * 1000)
 #define TDX_MAX_TSC_FREQUENCY_KHZ		(10 * 1000 * 1000)
 
+/* Additional Secure EPT entry information */
+#define TDX_SEPT_LEVEL_MASK		GENMASK_ULL(2, 0)
+#define TDX_SEPT_STATE_MASK		GENMASK_ULL(15, 8)
+#define TDX_SEPT_STATE_SHIFT		8
+
+enum tdx_sept_entry_state {
+	TDX_SEPT_FREE = 0,
+	TDX_SEPT_BLOCKED = 1,
+	TDX_SEPT_PENDING = 2,
+	TDX_SEPT_PENDING_BLOCKED = 3,
+	TDX_SEPT_PRESENT = 4,
+};
+
+static inline u8 tdx_get_sept_level(u64 sept_entry_info)
+{
+	return sept_entry_info & TDX_SEPT_LEVEL_MASK;
+}
+
+static inline u8 tdx_get_sept_state(u64 sept_entry_info)
+{
+	return (sept_entry_info & TDX_SEPT_STATE_MASK) >> TDX_SEPT_STATE_SHIFT;
+}
+
 #define MD_FIELD_ID_FEATURES0_TOPOLOGY_ENUM	BIT_ULL(20)
 
 /*
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 7151ac38bc31..3e7e7d0eadbf 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -130,6 +130,15 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
+int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, void *private_spt);
+int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, void *private_spt);
+int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, kvm_pfn_t pfn);
+int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+				 enum pg_level level, kvm_pfn_t pfn);
+
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
@@ -145,6 +154,34 @@ static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
+static inline int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+					    enum pg_level level,
+					    void *private_spt)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+					    enum pg_level level,
+					    void *private_spt)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+					    enum pg_level level,
+					    kvm_pfn_t pfn)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+					       enum pg_level level,
+					       kvm_pfn_t pfn)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
 static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 19/24] KVM: TDX: Implement hook to get max mapping level of private pages
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (17 preceding siblings ...)
  2024-11-12  7:38 ` [PATCH v2 18/24] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Yan Zhao
@ 2024-11-12  7:38 ` Yan Zhao
  2024-11-12  7:38 ` [PATCH v2 20/24] KVM: x86/mmu: Export kvm_tdp_map_page() Yan Zhao
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:38 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement hook private_max_mapping_level for TDX to let TDP MMU core get
max mapping level of private pages.

The value is hard coded to 4K for no huge page support for now.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX MMU part 2 v2:
 - Added Paolo's rb.

TDX MMU part 2 v1:
 - Split from the big patch "KVM: TDX: TDP MMU TDX support".
 - Fix missing tdx_gmem_private_max_mapping_level() implementation for
   !CONFIG_INTEL_TDX_HOST

v19:
 - Use gmem_max_level callback, delete tdp_max_page_level.
---
 arch/x86/kvm/vmx/main.c    | 10 ++++++++++
 arch/x86/kvm/vmx/tdx.c     |  5 +++++
 arch/x86/kvm/vmx/x86_ops.h |  2 ++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index cb41e9a1d3e3..244fb80d385a 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -174,6 +174,14 @@ static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return tdx_vcpu_ioctl(vcpu, argp);
 }
 
+static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+{
+	if (is_td(kvm))
+		return tdx_gmem_private_max_mapping_level(kvm, pfn);
+
+	return 0;
+}
+
 #define VMX_REQUIRED_APICV_INHIBITS				\
 	(BIT(APICV_INHIBIT_REASON_DISABLED) |			\
 	 BIT(APICV_INHIBIT_REASON_ABSENT) |			\
@@ -329,6 +337,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.mem_enc_ioctl = vt_mem_enc_ioctl,
 	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
+
+	.private_max_mapping_level = vt_gmem_private_max_mapping_level
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 29f01cff0e6b..ead520083397 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1627,6 +1627,11 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return ret;
 }
 
+int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+{
+	return PG_LEVEL_4K;
+}
+
 static int tdx_online_cpu(unsigned int cpu)
 {
 	unsigned long flags;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 3e7e7d0eadbf..f61daac5f2f0 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -142,6 +142,7 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
+int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
 #else
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
@@ -185,6 +186,7 @@ static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
 static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
+static inline int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) { return 0; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 20/24] KVM: x86/mmu: Export kvm_tdp_map_page()
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (18 preceding siblings ...)
  2024-11-12  7:38 ` [PATCH v2 19/24] KVM: TDX: Implement hook to get max mapping level of private pages Yan Zhao
@ 2024-11-12  7:38 ` Yan Zhao
  2024-11-12  7:38 ` [PATCH v2 21/24] KVM: TDX: Add an ioctl to create initial guest memory Yan Zhao
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:38 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Rick Edgecombe <rick.p.edgecombe@intel.com>

In future changes coco specific code will need to call kvm_tdp_map_page()
from within their respective gmem_post_populate() callbacks. Export it
so this can be done from vendor specific code. Since kvm_mmu_reload()
will be needed for this operation, export its callee kvm_mmu_load() as
well.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
 - Updated patch msg to mention kvm_mmu_load() is kvm_mmu_reload()'s callee
   (Paolo)

TDX MMU part 2 v1:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2f75c8145fd..7157e87c5e07 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4782,6 +4782,7 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 		return -EIO;
 	}
 }
+EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
 
 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range)
@@ -5805,6 +5806,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 out:
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_load);
 
 void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 {
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 21/24] KVM: TDX: Add an ioctl to create initial guest memory
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (19 preceding siblings ...)
  2024-11-12  7:38 ` [PATCH v2 20/24] KVM: x86/mmu: Export kvm_tdp_map_page() Yan Zhao
@ 2024-11-12  7:38 ` Yan Zhao
  2024-11-27 18:08   ` Nikolay Borisov
  2024-11-12  7:38 ` [PATCH v2 22/24] KVM: TDX: Finalize VM initialization Yan Zhao
                   ` (5 subsequent siblings)
  26 siblings, 1 reply; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:38 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a new ioctl for the user space VMM to initialize guest memory with the
specified memory contents.

Because TDX protects the guest's memory, the creation of the initial guest
memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of
directly copying the memory contents into the guest's memory in the case of
the default VM type.

Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped
KVM_MEMORY_ENCRYPT_OP.  Check if the GFN is already pre-allocated, assign
the guest page in Secure-EPT, copy the initial memory contents into the
guest memory, and encrypt the guest memory.  Optionally, extend the memory
measurement of the TDX guest.

The ioctl uses the vCPU file descriptor because of the TDX module's
requirement that the memory is added to the S-EPT (via TDH.MEM.SEPT.ADD)
prior to initialization (TDH.MEM.PAGE.ADD).  Accessing the MMU in turn
requires a vCPU file descriptor, just like for KVM_PRE_FAULT_MEMORY.  In
fact, the post-populate callback is able to reuse the same logic used by
KVM_PRE_FAULT_MEMORY, so that userspace can do everything with a single
ioctl.

Note that this is the only way to invoke TDH.MEM.SEPT.ADD before the TD
in finalized, as userspace cannot use KVM_PRE_FAULT_MEMORY at that
point.  This ensures that there cannot be pages in the S-EPT awaiting
TDH.MEM.PAGE.ADD, which would be treated incorrectly as spurious by
tdp_mmu_map_handle_target_level() (KVM would see the SPTE as PRESENT,
but the corresponding S-EPT entry will be !PRESENT).

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
 - Updated commit msg (Paolo)
 - Added a guard around kvm_tdp_mmu_gpa_is_mapped() (Paolo).
 - Remove checking kvm_mem_is_private() in tdx_gmem_post_populate (Rick)
 - No need for is_td_finalized() (Rick)
 - Remove decrement of nr_premapped (moved to "Finalize VM initialization"
   patch) (Paolo)
 - Take slots_lock before checking kvm_tdx->finalized in
   tdx_vcpu_init_mem_region(), and use guard() (Paolo)
 - Fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
   wrappers (Kai)
 - Add TD state handling (Tony)

TDX MMU part 2 v1:
 - Update the code according to latest gmem update.
   https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/
 - Fixup a aligment bug reported by Binbin.
 - Rename KVM_MEMORY_MAPPING => KVM_MAP_MEMORY (Sean)
 - Drop issueing TDH.MEM.PAGE.ADD() on KVM_MAP_MEMORY(), defer it to
   KVM_TDX_INIT_MEM_REGION. (Sean)
 - Added nr_premapped to track the number of premapped pages
 - Drop tdx_post_mmu_map_page().
 - Drop kvm_slot_can_be_private() check (Paolo)
 - Use kvm_tdp_mmu_gpa_is_mapped() (Paolo)

v19:
 - Switched to use KVM_MEMORY_MAPPING
 - Dropped measurement extension
 - updated commit message. private_page_add() => set_private_spte()
---
 arch/x86/include/uapi/asm/kvm.h |   9 ++
 arch/x86/kvm/vmx/tdx.c          | 147 ++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c             |   1 +
 3 files changed, 157 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 36fa03376581..a19cd84cec76 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -931,6 +931,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
+	KVM_TDX_INIT_MEM_REGION,
 	KVM_TDX_GET_CPUID,
 
 	KVM_TDX_CMD_NR_MAX,
@@ -985,4 +986,12 @@ struct kvm_tdx_init_vm {
 	struct kvm_cpuid2 cpuid;
 };
 
+#define KVM_TDX_MEASURE_MEMORY_REGION   _BITULL(0)
+
+struct kvm_tdx_init_mem_region {
+	__u64 source_addr;
+	__u64 gpa;
+	__u64 nr_pages;
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ead520083397..15cedacd717a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/cleanup.h>
 #include <linux/cpu.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
@@ -7,6 +8,7 @@
 #include "tdx.h"
 #include "vmx.h"
 #include "mmu/spte.h"
+#include "common.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -1597,6 +1599,148 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+struct tdx_gmem_post_populate_arg {
+	struct kvm_vcpu *vcpu;
+	__u32 flags;
+};
+
+static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
+				  void __user *src, int order, void *_arg)
+{
+	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct tdx_gmem_post_populate_arg *arg = _arg;
+	struct kvm_vcpu *vcpu = arg->vcpu;
+	gpa_t gpa = gfn_to_gpa(gfn);
+	u8 level = PG_LEVEL_4K;
+	struct page *page;
+	int ret, i;
+	u64 err, entry, level_state;
+
+	/*
+	 * Get the source page if it has been faulted in. Return failure if the
+	 * source page has been swapped out or unmapped in primary memory.
+	 */
+	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page);
+	if (ret < 0)
+		return ret;
+	if (ret != 1)
+		return -ENOMEM;
+
+	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * The private mem cannot be zapped after kvm_tdp_map_page()
+	 * because all paths are covered by slots_lock and the
+	 * filemap invalidate lock.  Check that they are indeed enough.
+	 */
+	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
+		scoped_guard(read_lock, &kvm->mmu_lock) {
+			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
+				ret = -EIO;
+				goto out;
+			}
+		}
+	}
+
+	ret = 0;
+	do {
+		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
+				       pfn_to_hpa(page_to_pfn(page)),
+				       &entry, &level_state);
+	} while (err == TDX_ERROR_SEPT_BUSY);
+	if (err) {
+		ret = -EIO;
+		goto out;
+	}
+
+	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
+		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+			err = tdh_mr_extend(kvm_tdx->tdr_pa, gpa + i, &entry,
+					    &level_state);
+			if (err) {
+				ret = -EIO;
+				break;
+			}
+		}
+	}
+
+out:
+	put_page(page);
+	return ret;
+}
+
+static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct kvm_tdx_init_mem_region region;
+	struct tdx_gmem_post_populate_arg arg;
+	long gmem_ret;
+	int ret;
+
+	if (tdx->state != VCPU_TD_STATE_INITIALIZED)
+		return -EINVAL;
+
+	guard(mutex)(&kvm->slots_lock);
+
+	/* Once TD is finalized, the initial guest memory is fixed. */
+	if (kvm_tdx->state == TD_STATE_RUNNABLE)
+		return -EINVAL;
+
+	if (cmd->flags & ~KVM_TDX_MEASURE_MEMORY_REGION)
+		return -EINVAL;
+
+	if (copy_from_user(&region, u64_to_user_ptr(cmd->data), sizeof(region)))
+		return -EFAULT;
+
+	if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
+	    !region.nr_pages ||
+	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
+	    !vt_is_tdx_private_gpa(kvm, region.gpa) ||
+	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
+		return -EINVAL;
+
+	kvm_mmu_reload(vcpu);
+	ret = 0;
+	while (region.nr_pages) {
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
+		arg = (struct tdx_gmem_post_populate_arg) {
+			.vcpu = vcpu,
+			.flags = cmd->flags,
+		};
+		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
+					     u64_to_user_ptr(region.source_addr),
+					     1, tdx_gmem_post_populate, &arg);
+		if (gmem_ret < 0) {
+			ret = gmem_ret;
+			break;
+		}
+
+		if (gmem_ret != 1) {
+			ret = -EIO;
+			break;
+		}
+
+		region.source_addr += PAGE_SIZE;
+		region.gpa += PAGE_SIZE;
+		region.nr_pages--;
+
+		cond_resched();
+	}
+
+	if (copy_to_user(u64_to_user_ptr(cmd->data), &region, sizeof(region)))
+		ret = -EFAULT;
+	return ret;
+}
+
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -1616,6 +1760,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	case KVM_TDX_INIT_VCPU:
 		ret = tdx_vcpu_init(vcpu, &cmd);
 		break;
+	case KVM_TDX_INIT_MEM_REGION:
+		ret = tdx_vcpu_init_mem_region(vcpu, &cmd);
+		break;
 	case KVM_TDX_GET_CPUID:
 		ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
 		break;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 152afe67a00b..5901d03e372c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2600,6 +2600,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
 
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
 
 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
 {
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 22/24] KVM: TDX: Finalize VM initialization
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (20 preceding siblings ...)
  2024-11-12  7:38 ` [PATCH v2 21/24] KVM: TDX: Add an ioctl to create initial guest memory Yan Zhao
@ 2024-11-12  7:38 ` Yan Zhao
  2024-12-24 14:31   ` Paolo Bonzini
  2024-11-12  7:38 ` [PATCH v2 23/24] KVM: TDX: Handle vCPU dissociation Yan Zhao
                   ` (4 subsequent siblings)
  26 siblings, 1 reply; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:38 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

Introduce a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.

The API documentation is provided in a separate patch:
“Documentation/virt/kvm: Document on Trust Domain Extensions (TDX)”.

Enhance TDX’s set_external_spte() hook to record the pre-mapping count
instead of returning without action when the TD is not finalized.

Adjust the pre-mapping count when pages are added or if the mapping is
dropped.

Set pre_fault_allowed to true after the finalization is complete.

Note: TD Measurement Finalization is the process by which the initial state
of the TDX VM is measured for attestation purposes. It uses the SEAMCALL
TDH.MR.FINALIZE, after which:
1. The VMM can no longer add TD private pages with arbitrary content.
2. The TDX VM becomes runnable.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2
 - Merge changes from patch "KVM: TDX: Premap initial guest memory" into
   this patch (Paolo)
 - Consolidate nr_premapped counting into this patch (Paolo)
 - Page level check should be (and is) in tdx_sept_set_private_spte() in
   patch "KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror
   page table" not in tdx_mem_page_record_premap_cnt() (Paolo)
 - Protect finalization using kvm->slots_lock (Paolo)
 - Set kvm->arch.pre_fault_allowed to true after finalization is done
   (Paolo)
 - Add a memory barrier to ensure correct ordering of the updates to
   kvm_tdx->finalized and kvm->arch.pre_fault_allowed (Adrian)
 - pre_fault_allowed must not be true before finalization is done.
   Highlight that fact by checking it in tdx_mem_page_record_premap_cnt()
   (Adrian)
 - No need for is_td_finalized() (Rick)
 - Fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
   wrappers (Kai)
 - Add nr_premapped where it's first used (Tao)

TDX MMU part 2 v1:
 - Added premapped check.
 - Update for the wrapper functions for SEAMCALLs. (Sean)
 - Add check if nr_premapped is zero.  If not, return error.
 - Use KVM_BUG_ON() in tdx_td_finalizer() for consistency.
 - Change tdx_td_finalizemr() to take struct kvm_tdx_cmd *cmd and return error
   (Adrian)
 - Handle TDX_OPERAND_BUSY case (Adrian)
 - Updates from seamcall overhaul (Kai)
 - Rename error->hw_error

v18:
 - Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.

v15:
 - removed unconditional tdx_track() by tdx_flush_tlb_current() that
   does tdx_track().
---
 arch/x86/include/uapi/asm/kvm.h |  1 +
 arch/x86/kvm/vmx/tdx.c          | 78 ++++++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/tdx.h          |  3 ++
 3 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index a19cd84cec76..eee6de05f261 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -932,6 +932,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
 	KVM_TDX_INIT_MEM_REGION,
+	KVM_TDX_FINALIZE_VM,
 	KVM_TDX_GET_CPUID,
 
 	KVM_TDX_CMD_NR_MAX,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 15cedacd717a..acaa11be1031 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -563,6 +563,31 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
+/*
+ * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to get guest pages and
+ * tdx_gmem_post_populate() to premap page table pages into private EPT.
+ * Mapping guest pages into private EPT before TD is finalized should use a
+ * seamcall TDH.MEM.PAGE.ADD(), which copies page content from a source page
+ * from user to target guest pages to be added. This source page is not
+ * available via common interface kvm_tdp_map_page(). So, currently,
+ * kvm_tdp_map_page() only premaps guest pages into KVM mirrored root.
+ * A counter nr_premapped is increased here to record status. The counter will
+ * be decreased after TDH.MEM.PAGE.ADD() is called after the kvm_tdp_map_page()
+ * in tdx_gmem_post_populate().
+ */
+static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
+					  enum pg_level level, kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
+		return -EINVAL;
+
+	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
+	atomic64_inc(&kvm_tdx->nr_premapped);
+	return 0;
+}
+
 int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 			      enum pg_level level, kvm_pfn_t pfn)
 {
@@ -582,14 +607,15 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	 */
 	get_page(pfn_to_page(pfn));
 
+	/*
+	 * To match ordering of 'finalized' and 'pre_fault_allowed' in
+	 * tdx_td_finalizemr().
+	 */
+	smp_rmb();
 	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
 		return tdx_mem_page_aug(kvm, gfn, level, pfn);
 
-	/*
-	 * TODO: KVM_TDX_INIT_MEM_REGION support to populate before finalize
-	 * comes here for the initial memory.
-	 */
-	return -EOPNOTSUPP;
+	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
 }
 
 static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -621,10 +647,12 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE &&
 		     err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
 		/*
-		 * This page was mapped with KVM_MAP_MEMORY, but
-		 * KVM_TDX_INIT_MEM_REGION is not issued yet.
+		 * Page is mapped by KVM_TDX_INIT_MEM_REGION, but hasn't called
+		 * tdh_mem_page_add().
 		 */
 		if (!is_last_spte(entry, level) || !(entry & VMX_EPT_RWX_MASK)) {
+			WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
+			atomic64_dec(&kvm_tdx->nr_premapped);
 			tdx_unpin(kvm, pfn);
 			return 0;
 		}
@@ -1368,6 +1396,36 @@ void tdx_flush_tlb_all(struct kvm_vcpu *vcpu)
 	ept_sync_global();
 }
 
+static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	guard(mutex)(&kvm->slots_lock);
+
+	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
+		return -EINVAL;
+	/*
+	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
+	 * TDH.MEM.PAGE.ADD().
+	 */
+	if (atomic64_read(&kvm_tdx->nr_premapped))
+		return -EINVAL;
+
+	cmd->hw_error = tdh_mr_finalize(kvm_tdx->tdr_pa);
+	if ((cmd->hw_error & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY)
+		return -EAGAIN;
+	if (KVM_BUG_ON(cmd->hw_error, kvm)) {
+		pr_tdx_error(TDH_MR_FINALIZE, cmd->hw_error);
+		return -EIO;
+	}
+
+	kvm_tdx->state = TD_STATE_RUNNABLE;
+	/* TD_STATE_RUNNABLE must be set before 'pre_fault_allowed' */
+	smp_wmb();
+	kvm->arch.pre_fault_allowed = true;
+	return 0;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -1392,6 +1450,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_INIT_VM:
 		r = tdx_td_init(kvm, &tdx_cmd);
 		break;
+	case KVM_TDX_FINALIZE_VM:
+		r = tdx_td_finalizemr(kvm, &tdx_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -1656,6 +1717,9 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 		goto out;
 	}
 
+	WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
+	atomic64_dec(&kvm_tdx->nr_premapped);
+
 	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
 		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
 			err = tdh_mr_extend(kvm_tdx->tdr_pa, gpa + i, &entry,
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 727bcf25d731..aeddf2bb0a94 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -32,6 +32,9 @@ struct kvm_tdx {
 	u64 tsc_offset;
 
 	enum kvm_tdx_state state;
+
+	/* For KVM_TDX_INIT_MEM_REGION. */
+	atomic64_t nr_premapped;
 };
 
 /* TDX module vCPU states */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 23/24] KVM: TDX: Handle vCPU dissociation
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (21 preceding siblings ...)
  2024-11-12  7:38 ` [PATCH v2 22/24] KVM: TDX: Finalize VM initialization Yan Zhao
@ 2024-11-12  7:38 ` Yan Zhao
  2024-11-12  7:39 ` [PATCH v2 24/24] [HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT Yan Zhao
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:38 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Isaku Yamahata <isaku.yamahata@intel.com>

Handle vCPUs dissociations by invoking SEAMCALL TDH.VP.FLUSH which flushes
the address translation caches and cached TD VMCS of a TD vCPU in its
associated pCPU.

In TDX, a vCPUs can only be associated with one pCPU at a time, which is
done by invoking SEAMCALL TDH.VP.ENTER. For a successful association, the
vCPU must be dissociated from its previous associated pCPU.

To facilitate vCPU dissociation, introduce a per-pCPU list
associated_tdvcpus. Add a vCPU into this list when it's loaded into a new
pCPU (i.e. when a vCPU is loaded for the first time or migrated to a new
pCPU).

vCPU dissociations can happen under below conditions:
- On the op hardware_disable is called.
  This op is called when virtualization is disabled on a given pCPU, e.g.
  when hot-unplug a pCPU or machine shutdown/suspend.
  In this case, dissociate all vCPUs from the pCPU by iterating its
  per-pCPU list associated_tdvcpus.

- On vCPU migration to a new pCPU.
  Before adding a vCPU into associated_tdvcpus list of the new pCPU,
  dissociation from its old pCPU is required, which is performed by issuing
  an IPI and executing SEAMCALL TDH.VP.FLUSH on the old pCPU.
  On a successful dissociation, the vCPU will be removed from the
  associated_tdvcpus list of its previously associated pCPU.

- On tdx_mmu_release_hkid() is called.
  TDX mandates that all vCPUs must be disassociated prior to the release of
  an hkid. Therefore, dissociation of all vCPUs is a must before executing
  the SEAMCALL TDH.MNG.VPFLUSHDONE and subsequently freeing the hkid.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
 - No need for is_td_vcpu_created() (Rick)
 - Fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
   wrappers (Kai)
 - Rename vt_hardware_disable() and tdx_hardware_disable() to track
   upstream changes
 - Updated the comment of per-cpu list (Yan)
 - Added an assertion
   KVM_BUG_ON(cpu != raw_smp_processor_id(), vcpu->kvm) in tdx_vcpu_load().
   (Yan)

TDX MMU part 2 v1:
 - Changed title to "KVM: TDX: Handle vCPU dissociation" .
 - Updated commit log.
 - Removed calling tdx_disassociate_vp_on_cpu() in tdx_vcpu_free() since
   no new TD enter would be called for vCPU association after
   tdx_mmu_release_hkid(), which is now called in vt_vm_destroy(), i.e.
   after releasing vcpu fd and kvm_unload_vcpu_mmus(), and before
   tdx_vcpu_free().
 - TODO: include Isaku's fix
   https://eclists.intel.com/sympa/arc/kvm-qemu-review/2024-07/msg00359.html
 - Update for the wrapper functions for SEAMCALLs. (Sean)
 - Removed unnecessary pr_err() in tdx_flush_vp_on_cpu().
 - Use KVM_BUG_ON() in tdx_flush_vp_on_cpu() for consistency.
 - Capitalize the first word of tile. (Binbin)
 - Minor fixed in changelog. (Binbin, Reinette(internal))
 - Fix some comments. (Binbin, Reinette(internal))
 - Rename arg_ to _arg (Binbin)
 - Updates from seamcall overhaul (Kai)
 - Remove lockdep_assert_preemption_disabled() in tdx_hardware_setup()
   since now hardware_enable() is not called via SMP func call anymore,
   but (per-cpu) CPU hotplug thread
 - Use KVM_BUG_ON() for SEAMCALLs in tdx_mmu_release_hkid() (Kai)
 - Update based on upstream commit "KVM: x86: Fold kvm_arch_sched_in()
   into kvm_arch_vcpu_load()"
 - Eliminate TDX_FLUSHVP_NOT_DONE error check because vCPUs were all freed.
   So the error won't happen. (Sean)
---
 arch/x86/kvm/vmx/main.c    |  22 ++++-
 arch/x86/kvm/vmx/tdx.c     | 159 +++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.h     |   2 +
 arch/x86/kvm/vmx/x86_ops.h |   4 +
 4 files changed, 177 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 244fb80d385a..bfed421e6fbb 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -10,6 +10,14 @@
 #include "tdx.h"
 #include "tdx_arch.h"
 
+static void vt_disable_virtualization_cpu(void)
+{
+	/* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
+	if (enable_tdx)
+		tdx_disable_virtualization_cpu();
+	vmx_disable_virtualization_cpu();
+}
+
 static __init int vt_hardware_setup(void)
 {
 	int ret;
@@ -111,6 +119,16 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_vcpu_load(vcpu, cpu);
+		return;
+	}
+
+	vmx_vcpu_load(vcpu, cpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu)) {
@@ -199,7 +217,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.hardware_unsetup = vmx_hardware_unsetup,
 
 	.enable_virtualization_cpu = vmx_enable_virtualization_cpu,
-	.disable_virtualization_cpu = vmx_disable_virtualization_cpu,
+	.disable_virtualization_cpu = vt_disable_virtualization_cpu,
 	.emergency_disable_virtualization_cpu = vmx_emergency_disable_virtualization_cpu,
 
 	.has_emulated_msr = vmx_has_emulated_msr,
@@ -216,7 +234,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_reset = vt_vcpu_reset,
 
 	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
-	.vcpu_load = vmx_vcpu_load,
+	.vcpu_load = vt_vcpu_load,
 	.vcpu_put = vmx_vcpu_put,
 
 	.update_exception_bitmap = vmx_update_exception_bitmap,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index acaa11be1031..dc6c5f40608e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -155,6 +155,21 @@ static inline int pg_level_to_tdx_sept_level(enum pg_level level)
 /* Maximum number of retries to attempt for SEAMCALLs. */
 #define TDX_SEAMCALL_RETRIES	10000
 
+/*
+ * A per-CPU list of TD vCPUs associated with a given CPU.
+ * Protected by interrupt mask. Only manipulated by the CPU owning this per-CPU
+ * list.
+ * - When a vCPU is loaded onto a CPU, it is removed from the per-CPU list of
+ *   the old CPU during the IPI callback running on the old CPU, and then added
+ *   to the per-CPU list of the new CPU.
+ * - When a TD is tearing down, all vCPUs are disassociated from their current
+ *   running CPUs and removed from the per-CPU list during the IPI callback
+ *   running on those CPUs.
+ * - When a CPU is brought down, traverse the per-CPU list to disassociate all
+ *   associated TD vCPUs and remove them from the per-CPU list.
+ */
+static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
+
 static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
 {
 	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
@@ -172,6 +187,22 @@ static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
 	return kvm_tdx->hkid > 0;
 }
 
+static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
+{
+	lockdep_assert_irqs_disabled();
+
+	list_del(&to_tdx(vcpu)->cpu_list);
+
+	/*
+	 * Ensure tdx->cpu_list is updated before setting vcpu->cpu to -1,
+	 * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
+	 * to its list before it's deleted from this CPU's list.
+	 */
+	smp_wmb();
+
+	vcpu->cpu = -1;
+}
+
 static void tdx_clear_page(unsigned long page_pa)
 {
 	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -252,6 +283,83 @@ static void tdx_reclaim_control_page(unsigned long ctrl_page_pa)
 	free_page((unsigned long)__va(ctrl_page_pa));
 }
 
+struct tdx_flush_vp_arg {
+	struct kvm_vcpu *vcpu;
+	u64 err;
+};
+
+static void tdx_flush_vp(void *_arg)
+{
+	struct tdx_flush_vp_arg *arg = _arg;
+	struct kvm_vcpu *vcpu = arg->vcpu;
+	u64 err;
+
+	arg->err = 0;
+	lockdep_assert_irqs_disabled();
+
+	/* Task migration can race with CPU offlining. */
+	if (unlikely(vcpu->cpu != raw_smp_processor_id()))
+		return;
+
+	/*
+	 * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized.  The
+	 * list tracking still needs to be updated so that it's correct if/when
+	 * the vCPU does get initialized.
+	 */
+	if (to_tdx(vcpu)->state != VCPU_TD_STATE_UNINITIALIZED) {
+		/*
+		 * No need to retry.  TDX Resources needed for TDH.VP.FLUSH are:
+		 * TDVPR as exclusive, TDR as shared, and TDCS as shared.  This
+		 * vp flush function is called when destructing vCPU/TD or vCPU
+		 * migration.  No other thread uses TDVPR in those cases.
+		 */
+		err = tdh_vp_flush(to_tdx(vcpu)->tdvpr_pa);
+		if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
+			/*
+			 * This function is called in IPI context. Do not use
+			 * printk to avoid console semaphore.
+			 * The caller prints out the error message, instead.
+			 */
+			if (err)
+				arg->err = err;
+		}
+	}
+
+	tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+	struct tdx_flush_vp_arg arg = {
+		.vcpu = vcpu,
+	};
+	int cpu = vcpu->cpu;
+
+	if (unlikely(cpu == -1))
+		return;
+
+	smp_call_function_single(cpu, tdx_flush_vp, &arg, 1);
+	if (KVM_BUG_ON(arg.err, vcpu->kvm))
+		pr_tdx_error(TDH_VP_FLUSH, arg.err);
+}
+
+void tdx_disable_virtualization_cpu(void)
+{
+	int cpu = raw_smp_processor_id();
+	struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
+	struct tdx_flush_vp_arg arg;
+	struct vcpu_tdx *tdx, *tmp;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	/* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
+	list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list) {
+		arg.vcpu = &tdx->vcpu;
+		tdx_flush_vp(&arg);
+	}
+	local_irq_restore(flags);
+}
+
 static void smp_func_do_phymem_cache_wb(void *unused)
 {
 	u64 err = 0;
@@ -288,22 +396,21 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	bool packages_allocated, targets_allocated;
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	cpumask_var_t packages, targets;
-	u64 err;
+	struct kvm_vcpu *vcpu;
+	unsigned long j;
 	int i;
+	u64 err;
 
 	if (!is_hkid_assigned(kvm_tdx))
 		return;
 
-	/* KeyID has been allocated but guest is not yet configured */
-	if (!kvm_tdx->tdr_pa) {
-		tdx_hkid_free(kvm_tdx);
-		return;
-	}
-
 	packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
 	targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
 	cpus_read_lock();
 
+	kvm_for_each_vcpu(j, vcpu, kvm)
+		tdx_flush_vp_on_cpu(vcpu);
+
 	/*
 	 * TDH.PHYMEM.CACHE.WB tries to acquire the TDX module global lock
 	 * and can fail with TDX_OPERAND_BUSY when it fails to get the lock.
@@ -317,6 +424,16 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	 * After the above flushing vps, there should be no more vCPU
 	 * associations, as all vCPU fds have been released at this stage.
 	 */
+	err = tdh_mng_vpflushdone(kvm_tdx->tdr_pa);
+	if (err == TDX_FLUSHVP_NOT_DONE)
+		goto out;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MNG_VPFLUSHDONE, err);
+		pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n",
+		       kvm_tdx->hkid);
+		goto out;
+	}
+
 	for_each_online_cpu(i) {
 		if (packages_allocated &&
 		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -342,6 +459,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 		tdx_hkid_free(kvm_tdx);
 	}
 
+out:
 	mutex_unlock(&tdx_lock);
 	cpus_read_unlock();
 	free_cpumask_var(targets);
@@ -489,6 +607,27 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (vcpu->cpu == cpu)
+		return;
+
+	tdx_flush_vp_on_cpu(vcpu);
+
+	KVM_BUG_ON(cpu != raw_smp_processor_id(), vcpu->kvm);
+	local_irq_disable();
+	/*
+	 * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
+	 * vcpu->cpu is read before tdx->cpu_list.
+	 */
+	smp_rmb();
+
+	list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
+	local_irq_enable();
+}
+
 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -1937,7 +2076,7 @@ static int __init __do_tdx_bringup(void)
 static int __init __tdx_bringup(void)
 {
 	const struct tdx_sys_info_td_conf *td_conf;
-	int r;
+	int r, i;
 
 	if (!tdp_mmu_enabled || !enable_mmio_caching)
 		return -EOPNOTSUPP;
@@ -1947,6 +2086,10 @@ static int __init __tdx_bringup(void)
 		return -EOPNOTSUPP;
 	}
 
+	/* tdx_disable_virtualization_cpu() uses associated_tdvcpus. */
+	for_each_possible_cpu(i)
+		INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, i));
+
 	/*
 	 * Enabling TDX requires enabling hardware virtualization first,
 	 * as making SEAMCALLs requires CPU being in post-VMXON state.
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index aeddf2bb0a94..899654519df6 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -49,6 +49,8 @@ struct vcpu_tdx {
 	unsigned long tdvpr_pa;
 	unsigned long *tdcx_pa;
 
+	struct list_head cpu_list;
+
 	enum vcpu_tdx_state state;
 };
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f61daac5f2f0..06583b1afa4f 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -119,6 +119,7 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
 void vmx_setup_mce(struct kvm_vcpu *vcpu);
 
 #ifdef CONFIG_INTEL_TDX_HOST
+void tdx_disable_virtualization_cpu(void);
 int tdx_vm_init(struct kvm *kvm);
 void tdx_mmu_release_hkid(struct kvm *kvm);
 void tdx_vm_free(struct kvm *kvm);
@@ -127,6 +128,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
@@ -144,6 +146,7 @@ void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
 #else
+static inline void tdx_disable_virtualization_cpu(void) {}
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
 static inline void tdx_vm_free(struct kvm *kvm) {}
@@ -152,6 +155,7 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
 
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 24/24] [HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (22 preceding siblings ...)
  2024-11-12  7:38 ` [PATCH v2 23/24] KVM: TDX: Handle vCPU dissociation Yan Zhao
@ 2024-11-12  7:39 ` Yan Zhao
  2024-11-21 11:51 ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Yan Zhao
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-12  7:39 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

From: Yuan Yao <yuan.yao@intel.com>

Temporary retry in SEAMCALL wrappers when the TDX module returns
TDX_OPERAND_BUSY with operand SEPT.

The TDX module has many internal locks to protect its resources. To avoid
staying in SEAM mode for too long, SEAMCALLs will return a TDX_OPERAND_BUSY
error code to the kernel instead of spinning on the locks.

Usually, callers of the SEAMCALL wrappers can avoid contentions by
implementing proper locks on their side. For example, KVM can efficiently
avoid the TDX module's lock contentions for resources like TDR, TDCS, KOT,
and TDVPR by taking locks within KVM or making a resource per-thread.

However, for performance reasons, callers like KVM may not want to use
exclusive locks to avoid internal contentions on the SEPT tree within the
TDX module. For instance, KVM allows TDH.VP.ENTER to run concurrently with
TDH.MEM.SEPT.ADD, TDH.MEM.PAGE.AUG, and TDH.MEM.PAGE.REMOVE.

Resources       SHARED users               EXCLUSIVE users
------------------------------------------------------------------------
SEPT tree       TDH.MEM.SEPT.ADD           TDH.VP.ENTER
                TDH.MEM.PAGE.AUG           TDH.MEM.SEPT.REMOVE
                TDH.MEM.PAGE.REMOVE        TDH.MEM.RANGE.BLOCK

Inside the TDX module, although TDH.VP.ENTER only acquires an exclusive
lock on the SEPT tree when zero-step mitigation is triggered, it is still
possible to encounter TDX_OPERAND_BUSY with operand SEPT in KVM. Retry in
the SEAMCALL wrappers temporarily until KVM either retries on the caller
side or finds a way to avoid the contentions.

Note:
The wrappers only retry for 16 times for the TDX_OPERAND_BUSY with operand
SEPT. Retries exceeding 16 times are rare.
SEAMCALLs TDH.MEM.* can also contend with TDCALL TDG.MEM.PAGE.ACCEPT,
returning TDX_OPERAND_BUSY without operand SEPT. Do not retry in the
SEAMCALL wrappers for such rare errors.
Let the callers handle these rare errors.

Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
TDX MMU part 2 v2:
 - Updates the patch log. (Yan)

TDX MMU part 2 v1:
 - Updates from seamcall overhaul (Kai)

v19:
 - fix typo TDG.VP.ENTER => TDH.VP.ENTER,
   TDX_OPRRAN_BUSY => TDX_OPERAND_BUSY
 - drop the description on TDH.VP.ENTER as this patch doesn't touch
   TDH.VP.ENTER
---
 arch/x86/virt/vmx/tdx/tdx.c | 47 +++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 7e0574facfb0..04cb2f1d6deb 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1563,6 +1563,43 @@ void tdx_guest_keyid_free(unsigned int keyid)
 }
 EXPORT_SYMBOL_GPL(tdx_guest_keyid_free);
 
+/*
+ * TDX module acquires its internal lock for resources.  It doesn't spin to get
+ * locks because of its restrictions of allowed execution time.  Instead, it
+ * returns TDX_OPERAND_BUSY with an operand id.
+ *
+ * Multiple VCPUs can operate on SEPT.  Also with zero-step attack mitigation,
+ * TDH.VP.ENTER may rarely acquire SEPT lock and release it when zero-step
+ * attack is suspected.  It results in TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT
+ * with TDH.MEM.* operation.  Note: TDH.MEM.TRACK is an exception.
+ *
+ * Because TDP MMU uses read lock for scalability, spin lock around SEAMCALL
+ * spoils TDP MMU effort.  Retry several times with the assumption that SEPT
+ * lock contention is rare.  But don't loop forever to avoid lockup.  Let TDP
+ * MMU retry.
+ */
+#define TDX_OPERAND_BUSY			0x8000020000000000ULL
+#define TDX_OPERAND_ID_SEPT			0x92
+
+#define TDX_ERROR_SEPT_BUSY    (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)
+
+static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
+{
+#define SEAMCALL_RETRY_MAX     16
+	struct tdx_module_args args_in;
+	int retry = SEAMCALL_RETRY_MAX;
+	u64 ret;
+
+	do {
+		args_in = *in;
+		ret = seamcall_ret(op, in);
+	} while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0);
+
+	*in = args_in;
+
+	return ret;
+}
+
 u64 tdh_mng_addcx(u64 tdr, u64 tdcs)
 {
 	struct tdx_module_args args = {
@@ -1586,7 +1623,7 @@ u64 tdh_mem_page_add(u64 tdr, u64 gpa, u64 hpa, u64 source, u64 *rcx, u64 *rdx)
 	u64 ret;
 
 	clflush_cache_range(__va(hpa), PAGE_SIZE);
-	ret = seamcall_ret(TDH_MEM_PAGE_ADD, &args);
+	ret = tdx_seamcall_sept(TDH_MEM_PAGE_ADD, &args);
 
 	*rcx = args.rcx;
 	*rdx = args.rdx;
@@ -1605,7 +1642,7 @@ u64 tdh_mem_sept_add(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx)
 	u64 ret;
 
 	clflush_cache_range(__va(hpa), PAGE_SIZE);
-	ret = seamcall_ret(TDH_MEM_SEPT_ADD, &args);
+	ret = tdx_seamcall_sept(TDH_MEM_SEPT_ADD, &args);
 
 	*rcx = args.rcx;
 	*rdx = args.rdx;
@@ -1636,7 +1673,7 @@ u64 tdh_mem_page_aug(u64 tdr, u64 gpa, u64 hpa, u64 *rcx, u64 *rdx)
 	u64 ret;
 
 	clflush_cache_range(__va(hpa), PAGE_SIZE);
-	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
+	ret = tdx_seamcall_sept(TDH_MEM_PAGE_AUG, &args);
 
 	*rcx = args.rcx;
 	*rdx = args.rdx;
@@ -1653,7 +1690,7 @@ u64 tdh_mem_range_block(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx)
 	};
 	u64 ret;
 
-	ret = seamcall_ret(TDH_MEM_RANGE_BLOCK, &args);
+	ret = tdx_seamcall_sept(TDH_MEM_RANGE_BLOCK, &args);
 
 	*rcx = args.rcx;
 	*rdx = args.rdx;
@@ -1882,7 +1919,7 @@ u64 tdh_mem_page_remove(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx)
 	};
 	u64 ret;
 
-	ret = seamcall_ret(TDH_MEM_PAGE_REMOVE, &args);
+	ret = tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &args);
 
 	*rcx = args.rcx;
 	*rdx = args.rdx;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 0/2] SEPT SEAMCALL retry proposal
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (23 preceding siblings ...)
  2024-11-12  7:39 ` [PATCH v2 24/24] [HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT Yan Zhao
@ 2024-11-21 11:51 ` Yan Zhao
  2024-11-21 11:56   ` [RFC PATCH 1/2] KVM: TDX: Retry in TDX when installing TD private/sept pages Yan Zhao
                     ` (4 more replies)
  2024-12-10 18:21 ` [PATCH v2 00/24] TDX MMU Part 2 Paolo Bonzini
  2024-12-24 14:33 ` Paolo Bonzini
  26 siblings, 5 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-21 11:51 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: dave.hansen, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata, nik.borisov, linux-kernel, x86,
	Yan Zhao

This SEPT SEAMCALL retry proposal aims to remove patch
"[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
[1] at the tail of v2 series "TDX MMU Part 2".

==brief history==

In the v1 series 'TDX MMU Part 2', there were several discussions regarding
the necessity of retrying SEPT-related SEAMCALLs up to 16 times within the
SEAMCALL wrapper tdx_seamcall_sept().

The lock status of each SEAMCALL relevant to KVM was analyzed in [2].

The conclusion was that 16 retries was necessary because
- tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.

  When the TDX module detects that EPT violations are caused by the same
  RIP as in the last tdh_vp_enter() for 6 consecutive times, tdh_vp_enter()
  will take SEPT tree lock and contend with tdh_mem*().

- tdg_mem_page_accept() can contend with other tdh_mem*().


Sean provided several good suggestions[3], including:
- Implement retries within TDX code when the TDP MMU returns
  RET_PF_RETRY_FOZEN (for RET_PF_RETRY and frozen SPTE) to avoid triggering
  0-step mitigation.
- It's not necessary for tdg_mem_page_accept() to contend with tdh_mem*()
  inside TDX module.
- Use a method similar to KVM_REQ_MCLOCK_INPROGRESS to kick off vCPUs and
  prevent tdh_vp_enter() during page uninstallation.

Yan later found out that only retry RET_PF_RETRY_FOZEN within TDX code is
insufficient to prevent 0-step mitigation [4].

Rick and Yan then consulted TDX module team with findings that:
- The threshold of zero-step mitigation is counted per vCPU.
  It's of value 6 because

    "There can be at most 2 mapping faults on instruction fetch
     (x86 macro-instructions length is at most 15 bytes) when the
     instruction crosses page boundary; then there can be at most 2
     mapping faults for each memory operand, when the operand crosses
     page boundary. For most of x86 macro-instructions, there are up to 2
     memory operands and each one of them is small, which brings us to
     maximum 2+2*2 = 6 legal mapping faults."
  
- Besides tdg_mem_page_accept(),  tdg_mem_page_attr_rd/wr() can also 
  contend with SEAMCALLs tdh_mem*().

So, we decided to make a proposal to tolerate 0-step mitigation.

==proposal details==

The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
together which are required for page installation and uninstallation.

These SEAMCALLs can be divided into three groups:
Group 1: tdh_mem_page_add().
         The SEAMCALL is invoked only during TD build time and therefore
         KVM has ensured that no contention will occur.

         Proposal: (as in patch 1)
         Just return error when TDX_OPERAND_BUSY is found.

Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
         These two SEAMCALLs are invoked for page installation. 
         They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
	 tdg_mem_page_attr_rd/wr().

         Proposal: (as in patch 1)
         - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
           to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
         
         - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
           long as there are no pending signals/interrupts.

         The retry inside TDX aims to reduce the count of tdh_vp_enter()
         before resolving EPT violations in the local vCPU, thereby
         minimizing contentions with other vCPUs. However, it can't
         completely eliminate 0-step mitigation as it exits when there're
         pending signals/interrupts and does not (and cannot) remove the
         tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.

         Resources    SHARED  users      EXCLUSIVE users
         ------------------------------------------------------------
         SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
                    tdh_mem_page_aug
         ------------------------------------------------------------
         SEPT entry                      tdh_mem_sept_add (Host lock)
                                         tdh_mem_page_aug (Host lock)
                                         tdg_mem_page_accept (Guest lock)
                                         tdg_mem_page_attr_rd (Guest lock)
                                         tdg_mem_page_attr_wr (Guest lock)

Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
         These three SEAMCALLs are invoked for page uninstallation, with
         KVM mmu_lock held for writing.

         Resources     SHARED users      EXCLUSIVE users
         ------------------------------------------------------------
         TDCS epoch    tdh_vp_enter      tdh_mem_track
         ------------------------------------------------------------
         SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
                                         tdh_mem_range_block   
         ------------------------------------------------------------
         SEPT entry                      tdh_mem_range_block (Host lock)
                                         tdh_mem_page_remove (Host lock)
                                         tdg_mem_page_accept (Guest lock)
                                         tdg_mem_page_attr_rd (Guest lock)
                                         tdg_mem_page_attr_wr (Guest lock)

         Proposal: (as in patch 2)
         - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
           once.
         - During the retry, kick off all vCPUs and prevent any vCPU from
           entering to avoid potential contentions.

         This is because tdh_vp_enter() and TDCALLs are not protected by
         KVM mmu_lock, and remove_external_spte() in KVM must not fail.



SEAMCALL                Lock Type        Resource 
-----------------------------Group 1--------------------------------
tdh_mem_page_add        EXCLUSIVE        TDR
                        NO_LOCK          TDCS
                        NO_LOCK          SEPT tree
                        EXCLUSIVE        page to add

----------------------------Group 2--------------------------------
tdh_mem_sept_add        SHARED           TDR
                        SHARED           TDCS
                        SHARED           SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify
                        EXCLUSIVE        page to add


tdh_mem_page_aug        SHARED           TDR
                        SHARED           TDCS
                        SHARED           SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify
                        EXCLUSIVE        page to aug

----------------------------Group 3--------------------------------
tdh_mem_range_block     SHARED           TDR
                        SHARED           TDCS
                        EXCLUSIVE        SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify

tdh_mem_track           SHARED           TDR
                        SHARED           TDCS
                        EXCLUSIVE        TDCS epoch

tdh_mem_page_remove     SHARED           TDR
                        SHARED           TDCS
                        SHARED           SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify
                        EXCLUSIVE        page to remove


[1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com
[2] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com
[3] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com
[4] https://lore.kernel.org/kvm/Zw%2FKElXSOf1xqLE7@yzhao56-desk.sh.intel.com

Yan Zhao (2):
  KVM: TDX: Retry in TDX when installing TD private/sept pages
  KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal

 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/mmu/mmu.c          |   2 +-
 arch/x86/kvm/vmx/tdx.c          | 102 +++++++++++++++++++++++++-------
 3 files changed, 85 insertions(+), 21 deletions(-)

-- 
2.43.2


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 1/2] KVM: TDX: Retry in TDX when installing TD private/sept pages
  2024-11-21 11:51 ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Yan Zhao
@ 2024-11-21 11:56   ` Yan Zhao
  2024-11-21 11:57   ` [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal Yan Zhao
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-21 11:56 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: dave.hansen, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata, nik.borisov, linux-kernel, x86,
	Yan Zhao

For tdh_mem_page_add, Just return error when TDX_OPERAND_BUSY is found.

For tdh_mem_sept_add(), tdh_mem_page_aug(),
- Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
  to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
- Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
  long as there are no pending signals/interrupts.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c |  2 +-
 arch/x86/kvm/vmx/tdx.c | 53 +++++++++++++++++++++++++++++++++++-------
 2 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd2eaa1dbebb..f16c0e7248eb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6179,7 +6179,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 		vcpu->stat.pf_spurious++;
 
 	if (r != RET_PF_EMULATE)
-		return 1;
+		return r;
 
 emulate:
 	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 23e6f25dd837..60d9e9d050ad 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1705,8 +1705,9 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 
 	err = tdh_mem_sept_add(to_kvm_tdx(kvm)->tdr_pa, gpa, tdx_level, hpa, &entry,
 			       &level_state);
-	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
-		return -EAGAIN;
+	if (unlikely(err & TDX_OPERAND_BUSY))
+		return -EBUSY;
+
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_SEPT_ADD, err, entry, level_state);
 		return -EIO;
@@ -1855,6 +1856,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 {
 	gpa_t gpa = tdexit_gpa(vcpu);
 	unsigned long exit_qual;
+	bool local_retry = false;
+	int ret;
 
 	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
 		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
@@ -1873,6 +1876,23 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 		 * due to aliasing a single HPA to multiple GPAs.
 		 */
 		exit_qual = EPT_VIOLATION_ACC_WRITE;
+
+		/*
+		 * Mapping of private memory may meet RET_PF_RETRY due to
+		 * SEAMCALL contentions, e.g.
+		 * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU vs
+		 *   TDH.VP.ENTER with 0-step mitigation on a remote vCPU.
+		 * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU vs
+		 *   TDG.MEM.PAGE.ACCEPT on a remote vCPU.
+		 *
+		 * Retry internally in TDX to prevent exacerbating the
+		 * activation of 0-step mitigation on local vCPU.
+		 * However, despite these retries, the 0-step mitigation on the
+		 * local vCPU may still be triggered due to:
+		 * - Exiting on signals, interrupts.
+		 * - KVM_EXIT_MEMORY_FAULT.
+		 */
+		local_retry = true;
 	} else {
 		exit_qual = tdexit_exit_qual(vcpu);
 		/*
@@ -1885,7 +1905,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 	}
 
 	trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
-	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+
+	while (1) {
+		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
+
+		if (ret != RET_PF_RETRY || !local_retry)
+			break;
+
+		/*
+		 * Break and keep the orig return value.
+		 * Signal & irq handling will be done later in vcpu_run()
+		 */
+		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
+		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
+			break;
+
+		cond_resched();
+	}
+	return ret;
 }
 
 int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
@@ -3028,13 +3065,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	}
 
 	ret = 0;
-	do {
-		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
-				       pfn_to_hpa(page_to_pfn(page)),
-				       &entry, &level_state);
-	} while (err == TDX_ERROR_SEPT_BUSY);
+	err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
+			       pfn_to_hpa(page_to_pfn(page)),
+			       &entry, &level_state);
 	if (err) {
-		ret = -EIO;
+		ret = unlikely(err & TDX_OPERAND_BUSY) ? -EBUSY : -EIO;
 		goto out;
 	}
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2024-11-21 11:51 ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Yan Zhao
  2024-11-21 11:56   ` [RFC PATCH 1/2] KVM: TDX: Retry in TDX when installing TD private/sept pages Yan Zhao
@ 2024-11-21 11:57   ` Yan Zhao
  2024-11-26  0:47     ` Edgecombe, Rick P
  2024-12-17 23:29     ` Sean Christopherson
  2024-11-26  0:46   ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Edgecombe, Rick P
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-21 11:57 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: dave.hansen, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata, nik.borisov, linux-kernel, x86,
	Yan Zhao

For tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(),

- Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once.
- During the retry, kick off all vCPUs and prevent any vCPU from entering
  to avoid potential contentions.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/vmx/tdx.c          | 49 +++++++++++++++++++++++++--------
 2 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 521c7cf725bc..bb7592110337 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -123,6 +123,8 @@
 #define KVM_REQ_HV_TLB_FLUSH \
 	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
+#define KVM_REQ_NO_VCPU_ENTER_INPROGRESS \
+	KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 60d9e9d050ad..ed6b41bbcec6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -311,6 +311,20 @@ static void tdx_clear_page(unsigned long page_pa)
 	__mb();
 }
 
+static void tdx_no_vcpus_enter_start(struct kvm *kvm)
+{
+	kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
+}
+
+static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_clear_request(KVM_REQ_NO_VCPU_ENTER_INPROGRESS, vcpu);
+}
+
 /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
 static int __tdx_reclaim_page(hpa_t pa)
 {
@@ -1648,15 +1662,20 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
 		return -EINVAL;
 
-	do {
-		/*
-		 * When zapping private page, write lock is held. So no race
-		 * condition with other vcpu sept operation.  Race only with
-		 * TDH.VP.ENTER.
-		 */
+	/*
+	 * When zapping private page, write lock is held. So no race
+	 * condition with other vcpu sept operation.  Race only with
+	 * TDH.VP.ENTER.
+	 */
+	err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
+				  &level_state);
+	if ((err & TDX_OPERAND_BUSY)) {
+		/* After no vCPUs enter, the second retry is expected to succeed */
+		tdx_no_vcpus_enter_start(kvm);
 		err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
 					  &level_state);
-	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
+		tdx_no_vcpus_enter_stop(kvm);
+	}
 
 	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE &&
 		     err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
@@ -1728,8 +1747,12 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	WARN_ON_ONCE(level != PG_LEVEL_4K);
 
 	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
-	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
-		return -EAGAIN;
+	if (unlikely(err & TDX_OPERAND_BUSY)) {
+		/* After no vCPUs enter, the second retry is expected to succeed */
+		tdx_no_vcpus_enter_start(kvm);
+		err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
+		tdx_no_vcpus_enter_stop(kvm);
+	}
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
 		return -EIO;
@@ -1772,9 +1795,13 @@ static void tdx_track(struct kvm *kvm)
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
-	do {
+	err = tdh_mem_track(kvm_tdx->tdr_pa);
+	if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY) {
+		/* After no vCPUs enter, the second retry is expected to succeed */
+		tdx_no_vcpus_enter_start(kvm);
 		err = tdh_mem_track(kvm_tdx->tdr_pa);
-	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
+		tdx_no_vcpus_enter_stop(kvm);
+	}
 
 	if (KVM_BUG_ON(err, kvm))
 		pr_tdx_error(TDH_MEM_TRACK, err);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] SEPT SEAMCALL retry proposal
  2024-11-21 11:51 ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Yan Zhao
  2024-11-21 11:56   ` [RFC PATCH 1/2] KVM: TDX: Retry in TDX when installing TD private/sept pages Yan Zhao
  2024-11-21 11:57   ` [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal Yan Zhao
@ 2024-11-26  0:46   ` Edgecombe, Rick P
  2024-11-26  6:24     ` Yan Zhao
  2024-12-13  1:01   ` Yan Zhao
  2024-12-17 17:00   ` Edgecombe, Rick P
  4 siblings, 1 reply; 49+ messages in thread
From: Edgecombe, Rick P @ 2024-11-26  0:46 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y
  Cc: isaku.yamahata@gmail.com, x86@kernel.org,
	dave.hansen@linux.intel.com, binbin.wu@linux.intel.com,
	Li, Xiaoyao, Chatre, Reinette, Hunter, Adrian, Lindgren, Tony,
	dmatlack@google.com, linux-kernel@vger.kernel.org,
	Yamahata, Isaku, Huang, Kai, nik.borisov@suse.com

On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> ==proposal details==
> 
> The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
> together which are required for page installation and uninstallation.
> 
> These SEAMCALLs can be divided into three groups:
> Group 1: tdh_mem_page_add().
>          The SEAMCALL is invoked only during TD build time and therefore
>          KVM has ensured that no contention will occur.
> 
>          Proposal: (as in patch 1)
>          Just return error when TDX_OPERAND_BUSY is found.
> 
> Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
>          These two SEAMCALLs are invoked for page installation. 
>          They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
> 	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
> 	 tdg_mem_page_attr_rd/wr().

We did verify with TDX module folks that the TDX module could be changed to not
take the sept host priority lock for zero entries (that happen during the guest
operations). In that case, I think we shouldn't expect contention for
tdh_mem_sept_add() and tdh_mem_page_aug() from them? We still need it for
tdh_vp_enter() though.


> 
>          Proposal: (as in patch 1)
>          - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
>            to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
>          
>          - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
>            long as there are no pending signals/interrupts.
> 
>          The retry inside TDX aims to reduce the count of tdh_vp_enter()
>          before resolving EPT violations in the local vCPU, thereby
>          minimizing contentions with other vCPUs. However, it can't
>          completely eliminate 0-step mitigation as it exits when there're
>          pending signals/interrupts and does not (and cannot) remove the
>          tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.
> 
>          Resources    SHARED  users      EXCLUSIVE users
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
>                     tdh_mem_page_aug
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_sept_add (Host lock)
>                                          tdh_mem_page_aug (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
> Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
>          These three SEAMCALLs are invoked for page uninstallation, with
>          KVM mmu_lock held for writing.
> 
>          Resources     SHARED users      EXCLUSIVE users
>          ------------------------------------------------------------
>          TDCS epoch    tdh_vp_enter      tdh_mem_track
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
>                                          tdh_mem_range_block   
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_range_block (Host lock)
>                                          tdh_mem_page_remove (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
>          Proposal: (as in patch 2)
>          - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
>            once.
>          - During the retry, kick off all vCPUs and prevent any vCPU from
>            entering to avoid potential contentions.
> 
>          This is because tdh_vp_enter() and TDCALLs are not protected by
>          KVM mmu_lock, and remove_external_spte() in KVM must not fail.

The solution for group 3 actually doesn't look too bad at all to me. At least
from code and complexity wise. It's pretty compact, doesn't add any locks, and
limited to the tdx.c code. Although, I didn't evaluate the implementation 
correctness of tdx_no_vcpus_enter_start() and tdx_no_vcpus_enter_stop() yet.

Were you able to test the fallback path at all? Can we think of any real
situations where it could be burdensome?

One other thing to note I think, is that group 3 are part of no-fail operations.
The core KVM calling code doesn't have the understanding of a failure there. So
in this scheme of not avoiding contention we have to succeed before returning,
where group 1 and 2 can fail, so don't need the special fallback scheme.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2024-11-21 11:57   ` [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal Yan Zhao
@ 2024-11-26  0:47     ` Edgecombe, Rick P
  2024-11-26  6:39       ` Yan Zhao
  2024-12-17 23:29     ` Sean Christopherson
  1 sibling, 1 reply; 49+ messages in thread
From: Edgecombe, Rick P @ 2024-11-26  0:47 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y
  Cc: isaku.yamahata@gmail.com, x86@kernel.org,
	dave.hansen@linux.intel.com, binbin.wu@linux.intel.com,
	Li, Xiaoyao, Chatre, Reinette, Hunter, Adrian, Lindgren, Tony,
	dmatlack@google.com, linux-kernel@vger.kernel.org,
	Yamahata, Isaku, Huang, Kai, nik.borisov@suse.com

On Thu, 2024-11-21 at 19:57 +0800, Yan Zhao wrote:
>  
> +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> +{

I wonder if an mmu write lock assert here would be excessive.

> +	kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
> +}
> +
> +static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu;
> +	unsigned long i;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		kvm_clear_request(KVM_REQ_NO_VCPU_ENTER_INPROGRESS, vcpu);
> +}
> +


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] SEPT SEAMCALL retry proposal
  2024-11-26  0:46   ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Edgecombe, Rick P
@ 2024-11-26  6:24     ` Yan Zhao
  0 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-26  6:24 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	isaku.yamahata@gmail.com, x86@kernel.org,
	dave.hansen@linux.intel.com, binbin.wu@linux.intel.com,
	Li, Xiaoyao, Chatre, Reinette, Hunter, Adrian, Lindgren, Tony,
	dmatlack@google.com, linux-kernel@vger.kernel.org,
	Yamahata, Isaku, Huang, Kai, nik.borisov@suse.com

On Tue, Nov 26, 2024 at 08:46:51AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> > ==proposal details==
> > 
> > The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
> > together which are required for page installation and uninstallation.
> > 
> > These SEAMCALLs can be divided into three groups:
> > Group 1: tdh_mem_page_add().
> >          The SEAMCALL is invoked only during TD build time and therefore
> >          KVM has ensured that no contention will occur.
> > 
> >          Proposal: (as in patch 1)
> >          Just return error when TDX_OPERAND_BUSY is found.
> > 
> > Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
> >          These two SEAMCALLs are invoked for page installation. 
> >          They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
> > 	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
> > 	 tdg_mem_page_attr_rd/wr().
> 
> We did verify with TDX module folks that the TDX module could be changed to not
> take the sept host priority lock for zero entries (that happen during the guest
> operations). In that case, I think we shouldn't expect contention for
> tdh_mem_sept_add() and tdh_mem_page_aug() from them? We still need it for
> tdh_vp_enter() though.
Ah, you are right.

I previously incorrectly thought TDX module will avoid locking free entries for
tdg_mem_page_accept() only.


> 
> 
> > 
> >          Proposal: (as in patch 1)
> >          - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
> >            to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
> >          
> >          - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
> >            long as there are no pending signals/interrupts.
> > 
> >          The retry inside TDX aims to reduce the count of tdh_vp_enter()
> >          before resolving EPT violations in the local vCPU, thereby
> >          minimizing contentions with other vCPUs. However, it can't
> >          completely eliminate 0-step mitigation as it exits when there're
> >          pending signals/interrupts and does not (and cannot) remove the
> >          tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.
> > 
> >          Resources    SHARED  users      EXCLUSIVE users
> >          ------------------------------------------------------------
> >          SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
> >                     tdh_mem_page_aug
> >          ------------------------------------------------------------
> >          SEPT entry                      tdh_mem_sept_add (Host lock)
> >                                          tdh_mem_page_aug (Host lock)
> >                                          tdg_mem_page_accept (Guest lock)
> >                                          tdg_mem_page_attr_rd (Guest lock)
> >                                          tdg_mem_page_attr_wr (Guest lock)
> > 
> > Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
> >          These three SEAMCALLs are invoked for page uninstallation, with
> >          KVM mmu_lock held for writing.
> > 
> >          Resources     SHARED users      EXCLUSIVE users
> >          ------------------------------------------------------------
> >          TDCS epoch    tdh_vp_enter      tdh_mem_track
> >          ------------------------------------------------------------
> >          SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
> >                                          tdh_mem_range_block   
> >          ------------------------------------------------------------
> >          SEPT entry                      tdh_mem_range_block (Host lock)
> >                                          tdh_mem_page_remove (Host lock)
> >                                          tdg_mem_page_accept (Guest lock)
> >                                          tdg_mem_page_attr_rd (Guest lock)
> >                                          tdg_mem_page_attr_wr (Guest lock)
> > 
> >          Proposal: (as in patch 2)
> >          - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
> >            once.
> >          - During the retry, kick off all vCPUs and prevent any vCPU from
> >            entering to avoid potential contentions.
> > 
> >          This is because tdh_vp_enter() and TDCALLs are not protected by
> >          KVM mmu_lock, and remove_external_spte() in KVM must not fail.
> 
> The solution for group 3 actually doesn't look too bad at all to me. At least
> from code and complexity wise. It's pretty compact, doesn't add any locks, and
> limited to the tdx.c code. Although, I didn't evaluate the implementation 
> correctness of tdx_no_vcpus_enter_start() and tdx_no_vcpus_enter_stop() yet.
> 
> Were you able to test the fallback path at all? Can we think of any real
> situations where it could be burdensome?
Yes, I did some negative tests to fail block/track/remove to check if the
tdx_no_vcpus_enter*() work.

Even without those negative tests, it's not rare for tdh_mem_track() to return
busy due to its contention with tdh_vp_enter().

Given that normally it's not frequent to find tdh_mem_range_block() or
tdh_mem_page_remove() to return busy (if we reduce the frequency of zero-step
mitigation) and that tdh_mem_track() will kick off all vCPUs later any way, I
think it's acceptable to do the tdx_no_vcpus_enter*() stuffs in the page removal
path.


> One other thing to note I think, is that group 3 are part of no-fail operations.
> The core KVM calling code doesn't have the understanding of a failure there. So
> in this scheme of not avoiding contention we have to succeed before returning,
> where group 1 and 2 can fail, so don't need the special fallback scheme.
Yes, exactly.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2024-11-26  0:47     ` Edgecombe, Rick P
@ 2024-11-26  6:39       ` Yan Zhao
  0 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-26  6:39 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	isaku.yamahata@gmail.com, x86@kernel.org,
	dave.hansen@linux.intel.com, binbin.wu@linux.intel.com,
	Li, Xiaoyao, Chatre, Reinette, Hunter, Adrian, Lindgren, Tony,
	dmatlack@google.com, linux-kernel@vger.kernel.org,
	Yamahata, Isaku, Huang, Kai, nik.borisov@suse.com

On Tue, Nov 26, 2024 at 08:47:42AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2024-11-21 at 19:57 +0800, Yan Zhao wrote:
> >  
> > +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> > +{
> 
> I wonder if an mmu write lock assert here would be excessive.
Hmm, I didn't do it because all current callers already assert on that.

But asserting on that would be safer if the function itself doesn't take a lock
or a counter.


> > +	kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
> > +}
> > +
> > +static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
> > +{
> > +	struct kvm_vcpu *vcpu;
> > +	unsigned long i;
> > +
> > +	kvm_for_each_vcpu(i, vcpu, kvm)
> > +		kvm_clear_request(KVM_REQ_NO_VCPU_ENTER_INPROGRESS, vcpu);
> > +}
> > +
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 21/24] KVM: TDX: Add an ioctl to create initial guest memory
  2024-11-12  7:38 ` [PATCH v2 21/24] KVM: TDX: Add an ioctl to create initial guest memory Yan Zhao
@ 2024-11-27 18:08   ` Nikolay Borisov
  2024-11-28  2:20     ` Yan Zhao
  0 siblings, 1 reply; 49+ messages in thread
From: Nikolay Borisov @ 2024-11-27 18:08 UTC (permalink / raw)
  To: Yan Zhao, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, linux-kernel, x86



On 12.11.24 г. 9:38 ч., Yan Zhao wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a new ioctl for the user space VMM to initialize guest memory with the
> specified memory contents.
> 
> Because TDX protects the guest's memory, the creation of the initial guest
> memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of
> directly copying the memory contents into the guest's memory in the case of
> the default VM type.
> 
> Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped
> KVM_MEMORY_ENCRYPT_OP.  Check if the GFN is already pre-allocated, assign
> the guest page in Secure-EPT, copy the initial memory contents into the
> guest memory, and encrypt the guest memory.  Optionally, extend the memory
> measurement of the TDX guest.
> 
> The ioctl uses the vCPU file descriptor because of the TDX module's
> requirement that the memory is added to the S-EPT (via TDH.MEM.SEPT.ADD)
> prior to initialization (TDH.MEM.PAGE.ADD).  Accessing the MMU in turn
> requires a vCPU file descriptor, just like for KVM_PRE_FAULT_MEMORY.  In
> fact, the post-populate callback is able to reuse the same logic used by
> KVM_PRE_FAULT_MEMORY, so that userspace can do everything with a single
> ioctl.
> 
> Note that this is the only way to invoke TDH.MEM.SEPT.ADD before the TD
> in finalized, as userspace cannot use KVM_PRE_FAULT_MEMORY at that
> point.  This ensures that there cannot be pages in the S-EPT awaiting
> TDH.MEM.PAGE.ADD, which would be treated incorrectly as spurious by
> tdp_mmu_map_handle_target_level() (KVM would see the SPTE as PRESENT,
> but the corresponding S-EPT entry will be !PRESENT).
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> TDX MMU part 2 v2:
>   - Updated commit msg (Paolo)
>   - Added a guard around kvm_tdp_mmu_gpa_is_mapped() (Paolo).
>   - Remove checking kvm_mem_is_private() in tdx_gmem_post_populate (Rick)
>   - No need for is_td_finalized() (Rick)
>   - Remove decrement of nr_premapped (moved to "Finalize VM initialization"
>     patch) (Paolo)
>   - Take slots_lock before checking kvm_tdx->finalized in
>     tdx_vcpu_init_mem_region(), and use guard() (Paolo)
>   - Fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
>     wrappers (Kai)
>   - Add TD state handling (Tony)
> 
> TDX MMU part 2 v1:
>   - Update the code according to latest gmem update.
>     https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/
>   - Fixup a aligment bug reported by Binbin.
>   - Rename KVM_MEMORY_MAPPING => KVM_MAP_MEMORY (Sean)
>   - Drop issueing TDH.MEM.PAGE.ADD() on KVM_MAP_MEMORY(), defer it to
>     KVM_TDX_INIT_MEM_REGION. (Sean)
>   - Added nr_premapped to track the number of premapped pages
>   - Drop tdx_post_mmu_map_page().
>   - Drop kvm_slot_can_be_private() check (Paolo)
>   - Use kvm_tdp_mmu_gpa_is_mapped() (Paolo)
> 
> v19:
>   - Switched to use KVM_MEMORY_MAPPING
>   - Dropped measurement extension
>   - updated commit message. private_page_add() => set_private_spte()
> ---
>   arch/x86/include/uapi/asm/kvm.h |   9 ++
>   arch/x86/kvm/vmx/tdx.c          | 147 ++++++++++++++++++++++++++++++++
>   virt/kvm/kvm_main.c             |   1 +
>   3 files changed, 157 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 36fa03376581..a19cd84cec76 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -931,6 +931,7 @@ enum kvm_tdx_cmd_id {
>   	KVM_TDX_CAPABILITIES = 0,
>   	KVM_TDX_INIT_VM,
>   	KVM_TDX_INIT_VCPU,
> +	KVM_TDX_INIT_MEM_REGION,
>   	KVM_TDX_GET_CPUID,
>   
>   	KVM_TDX_CMD_NR_MAX,
> @@ -985,4 +986,12 @@ struct kvm_tdx_init_vm {
>   	struct kvm_cpuid2 cpuid;
>   };
>   
> +#define KVM_TDX_MEASURE_MEMORY_REGION   _BITULL(0)
> +
> +struct kvm_tdx_init_mem_region {
> +	__u64 source_addr;
> +	__u64 gpa;
> +	__u64 nr_pages;
> +};
> +
>   #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index ead520083397..15cedacd717a 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1,4 +1,5 @@
>   // SPDX-License-Identifier: GPL-2.0
> +#include <linux/cleanup.h>
>   #include <linux/cpu.h>
>   #include <asm/tdx.h>
>   #include "capabilities.h"
> @@ -7,6 +8,7 @@
>   #include "tdx.h"
>   #include "vmx.h"
>   #include "mmu/spte.h"
> +#include "common.h"
>   
>   #undef pr_fmt
>   #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> @@ -1597,6 +1599,148 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
>   	return 0;
>   }
>   
> +struct tdx_gmem_post_populate_arg {
> +	struct kvm_vcpu *vcpu;
> +	__u32 flags;
> +};
> +
> +static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> +				  void __user *src, int order, void *_arg)
> +{
> +	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct tdx_gmem_post_populate_arg *arg = _arg;
> +	struct kvm_vcpu *vcpu = arg->vcpu;
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	u8 level = PG_LEVEL_4K;
> +	struct page *page;
> +	int ret, i;
> +	u64 err, entry, level_state;
> +
> +	/*
> +	 * Get the source page if it has been faulted in. Return failure if the
> +	 * source page has been swapped out or unmapped in primary memory.
> +	 */
> +	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page);
> +	if (ret < 0)
> +		return ret;
> +	if (ret != 1)
> +		return -ENOMEM;
> +
> +	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> +	if (ret < 0)
> +		goto out;
> +
> +	/*
> +	 * The private mem cannot be zapped after kvm_tdp_map_page()
> +	 * because all paths are covered by slots_lock and the
> +	 * filemap invalidate lock.  Check that they are indeed enough.
> +	 */
> +	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> +		scoped_guard(read_lock, &kvm->mmu_lock) {
> +			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
> +				ret = -EIO;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +	ret = 0;
> +	do {
> +		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
> +				       pfn_to_hpa(page_to_pfn(page)),
> +				       &entry, &level_state);
> +	} while (err == TDX_ERROR_SEPT_BUSY);
> +	if (err) {
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> +		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +			err = tdh_mr_extend(kvm_tdx->tdr_pa, gpa + i, &entry,
> +					    &level_state);
> +			if (err) {
> +				ret = -EIO;
> +				break;
> +			}
> +		}
> +	}
> +
> +out:
> +	put_page(page);
> +	return ret;
> +}
> +
> +static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct kvm_tdx_init_mem_region region;
> +	struct tdx_gmem_post_populate_arg arg;
> +	long gmem_ret;
> +	int ret;
> +
> +	if (tdx->state != VCPU_TD_STATE_INITIALIZED)
> +		return -EINVAL;
> +
> +	guard(mutex)(&kvm->slots_lock);

It seems the scope of this lock can be reduced. It's really needed for 
the kvm_gmem_populate call only, no ?

<snip>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 21/24] KVM: TDX: Add an ioctl to create initial guest memory
  2024-11-27 18:08   ` Nikolay Borisov
@ 2024-11-28  2:20     ` Yan Zhao
  0 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-11-28  2:20 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: pbonzini, seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata, linux-kernel,
	x86

On Wed, Nov 27, 2024 at 08:08:26PM +0200, Nikolay Borisov wrote:
> 
> On 12.11.24 г. 9:38 ч., Yan Zhao wrote:
> > +static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> > +{
> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +	struct kvm *kvm = vcpu->kvm;
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +	struct kvm_tdx_init_mem_region region;
> > +	struct tdx_gmem_post_populate_arg arg;
> > +	long gmem_ret;
> > +	int ret;
> > +
> > +	if (tdx->state != VCPU_TD_STATE_INITIALIZED)
> > +		return -EINVAL;
> > +
> > +	guard(mutex)(&kvm->slots_lock);
> 
> It seems the scope of this lock can be reduced. It's really needed for the
> kvm_gmem_populate call only, no ?
Strictly speaking, yes.
But this KVM_TDX_INIT_MEM_REGION ioctl is only expected to be executed after
QEMU machine creation done and before any vCPU starts running. So no slot
changes are expected during the ioctl.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 00/24] TDX MMU Part 2
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (24 preceding siblings ...)
  2024-11-21 11:51 ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Yan Zhao
@ 2024-12-10 18:21 ` Paolo Bonzini
  2024-12-24 14:33 ` Paolo Bonzini
  26 siblings, 0 replies; 49+ messages in thread
From: Paolo Bonzini @ 2024-12-10 18:21 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata, nik.borisov,
	linux-kernel, x86

Applied to kvm-coco-queue, thanks.  Tomorrow I will go through the
changes and review.

Paolo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] SEPT SEAMCALL retry proposal
  2024-11-21 11:51 ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Yan Zhao
                     ` (2 preceding siblings ...)
  2024-11-26  0:46   ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Edgecombe, Rick P
@ 2024-12-13  1:01   ` Yan Zhao
  2024-12-17 17:00   ` Edgecombe, Rick P
  4 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-12-13  1:01 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: dave.hansen, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata, nik.borisov, linux-kernel, x86

On Thu, Nov 21, 2024 at 07:51:39PM +0800, Yan Zhao wrote:
> This SEPT SEAMCALL retry proposal aims to remove patch
> "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
> [1] at the tail of v2 series "TDX MMU Part 2".
> 
> ==brief history==
> 
> In the v1 series 'TDX MMU Part 2', there were several discussions regarding
> the necessity of retrying SEPT-related SEAMCALLs up to 16 times within the
> SEAMCALL wrapper tdx_seamcall_sept().
> 
> The lock status of each SEAMCALL relevant to KVM was analyzed in [2].
> 
> The conclusion was that 16 retries was necessary because
> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> 
>   When the TDX module detects that EPT violations are caused by the same
>   RIP as in the last tdh_vp_enter() for 6 consecutive times, tdh_vp_enter()
>   will take SEPT tree lock and contend with tdh_mem*().
> 
> - tdg_mem_page_accept() can contend with other tdh_mem*().
> 
> 
> Sean provided several good suggestions[3], including:
> - Implement retries within TDX code when the TDP MMU returns
>   RET_PF_RETRY_FOZEN (for RET_PF_RETRY and frozen SPTE) to avoid triggering
>   0-step mitigation.
> - It's not necessary for tdg_mem_page_accept() to contend with tdh_mem*()
>   inside TDX module.
> - Use a method similar to KVM_REQ_MCLOCK_INPROGRESS to kick off vCPUs and
>   prevent tdh_vp_enter() during page uninstallation.
> 
> Yan later found out that only retry RET_PF_RETRY_FOZEN within TDX code is
> insufficient to prevent 0-step mitigation [4].
> 
> Rick and Yan then consulted TDX module team with findings that:
> - The threshold of zero-step mitigation is counted per vCPU.
>   It's of value 6 because
> 
>     "There can be at most 2 mapping faults on instruction fetch
>      (x86 macro-instructions length is at most 15 bytes) when the
>      instruction crosses page boundary; then there can be at most 2
>      mapping faults for each memory operand, when the operand crosses
>      page boundary. For most of x86 macro-instructions, there are up to 2
>      memory operands and each one of them is small, which brings us to
>      maximum 2+2*2 = 6 legal mapping faults."
>   
> - Besides tdg_mem_page_accept(),  tdg_mem_page_attr_rd/wr() can also 
>   contend with SEAMCALLs tdh_mem*().
> 
> So, we decided to make a proposal to tolerate 0-step mitigation.
> 
> ==proposal details==
> 
> The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
> together which are required for page installation and uninstallation.
> 
> These SEAMCALLs can be divided into three groups:
> Group 1: tdh_mem_page_add().
>          The SEAMCALL is invoked only during TD build time and therefore
>          KVM has ensured that no contention will occur.
> 
>          Proposal: (as in patch 1)
>          Just return error when TDX_OPERAND_BUSY is found.
> 
> Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
>          These two SEAMCALLs are invoked for page installation. 
>          They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
> 	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
> 	 tdg_mem_page_attr_rd/wr().
> 
>          Proposal: (as in patch 1)
>          - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
>            to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
>          
>          - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
>            long as there are no pending signals/interrupts.
Alternatively, we can have the tdx_handle_ept_violation() do not retry
internally TDX.
Instead, keep the 16 times retries in tdx_seamcall_sept() for                    
tdh_mem_sept_add() and tdh_mem_page_aug() only, i.e. only for SEAMCALLs in       
Group 2.

> 
>          The retry inside TDX aims to reduce the count of tdh_vp_enter()
>          before resolving EPT violations in the local vCPU, thereby
>          minimizing contentions with other vCPUs. However, it can't
>          completely eliminate 0-step mitigation as it exits when there're
>          pending signals/interrupts and does not (and cannot) remove the
>          tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.
> 
>          Resources    SHARED  users      EXCLUSIVE users
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
>                     tdh_mem_page_aug
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_sept_add (Host lock)
>                                          tdh_mem_page_aug (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
> Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
>          These three SEAMCALLs are invoked for page uninstallation, with
>          KVM mmu_lock held for writing.
> 
>          Resources     SHARED users      EXCLUSIVE users
>          ------------------------------------------------------------
>          TDCS epoch    tdh_vp_enter      tdh_mem_track
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
>                                          tdh_mem_range_block   
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_range_block (Host lock)
>                                          tdh_mem_page_remove (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
>          Proposal: (as in patch 2)
>          - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
>            once.
>          - During the retry, kick off all vCPUs and prevent any vCPU from
>            entering to avoid potential contentions.
> 
>          This is because tdh_vp_enter() and TDCALLs are not protected by
>          KVM mmu_lock, and remove_external_spte() in KVM must not fail.
> 
> 
> 
> SEAMCALL                Lock Type        Resource 
> -----------------------------Group 1--------------------------------
> tdh_mem_page_add        EXCLUSIVE        TDR
>                         NO_LOCK          TDCS
>                         NO_LOCK          SEPT tree
>                         EXCLUSIVE        page to add
> 
> ----------------------------Group 2--------------------------------
> tdh_mem_sept_add        SHARED           TDR
>                         SHARED           TDCS
>                         SHARED           SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
>                         EXCLUSIVE        page to add
> 
> 
> tdh_mem_page_aug        SHARED           TDR
>                         SHARED           TDCS
>                         SHARED           SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
>                         EXCLUSIVE        page to aug
> 
> ----------------------------Group 3--------------------------------
> tdh_mem_range_block     SHARED           TDR
>                         SHARED           TDCS
>                         EXCLUSIVE        SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
> 
> tdh_mem_track           SHARED           TDR
>                         SHARED           TDCS
>                         EXCLUSIVE        TDCS epoch
> 
> tdh_mem_page_remove     SHARED           TDR
>                         SHARED           TDCS
>                         SHARED           SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
>                         EXCLUSIVE        page to remove
> 
> 
> [1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com
> [2] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com
> [3] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com
> [4] https://lore.kernel.org/kvm/Zw%2FKElXSOf1xqLE7@yzhao56-desk.sh.intel.com
> 
> Yan Zhao (2):
>   KVM: TDX: Retry in TDX when installing TD private/sept pages
>   KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
> 
>  arch/x86/include/asm/kvm_host.h |   2 +
>  arch/x86/kvm/mmu/mmu.c          |   2 +-
>  arch/x86/kvm/vmx/tdx.c          | 102 +++++++++++++++++++++++++-------
>  3 files changed, 85 insertions(+), 21 deletions(-)
> 
> -- 
> 2.43.2
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] SEPT SEAMCALL retry proposal
  2024-11-21 11:51 ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Yan Zhao
                     ` (3 preceding siblings ...)
  2024-12-13  1:01   ` Yan Zhao
@ 2024-12-17 17:00   ` Edgecombe, Rick P
  2024-12-17 23:18     ` Sean Christopherson
  4 siblings, 1 reply; 49+ messages in thread
From: Edgecombe, Rick P @ 2024-12-17 17:00 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y
  Cc: isaku.yamahata@gmail.com, x86@kernel.org,
	dave.hansen@linux.intel.com, binbin.wu@linux.intel.com,
	Li, Xiaoyao, Chatre, Reinette, Hunter, Adrian, Lindgren, Tony,
	dmatlack@google.com, linux-kernel@vger.kernel.org,
	Yamahata, Isaku, Huang, Kai, nik.borisov@suse.com

On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> This SEPT SEAMCALL retry proposal aims to remove patch
> "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
> [1] at the tail of v2 series "TDX MMU Part 2".

We discussed this on the PUCK call. A couple alternatives were considered:
 - Avoiding 0-step. To handle signals kicking to userspace we could try to add
code to generate synthetic EPT violations if KVM thinks the 0-step mitigation
might be active (i.e. the fault was not resolved). The consensus was that this
would be continuing battle and possibly impossible due normal guest behavior
triggering the mitigation. 
 - Pre-faulting all S-EPT, such that contention with AUG won't happen. The
discussion was that this would only be a temporary solution as the MMU
operations get more complicated (huge pages, etc). Also there is also
private/shared conversions and memory hotplug already.

So we will proceed with this kick+lock+retry solution. The reasoning is to
optimize for the normal non-contention path, without having an overly
complicated solution for KVM.

In all the branch commotion recently, these patches fell out of our dev branch.
So we just recently integrated then into a 6.13 kvm-coco-queue based branch. We
need to perform some regression tests based on 6.13 TDP MMU changes. Assuming no
issues, we can post the 6.13 rebase to included in kvm-coco-queue with
instructions on which patches to remove from kvm-coco-queue (i.e. the 16
retries).

We also briefly touched on the TDX module behavior where guest operations can
lock NP PTEs. The kick solution doesn't require changing this functionally, but
it should still be done to help with debugging issues related to KVM's
contention solution.

Thanks all for the discussion!

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] SEPT SEAMCALL retry proposal
  2024-12-17 17:00   ` Edgecombe, Rick P
@ 2024-12-17 23:18     ` Sean Christopherson
  0 siblings, 0 replies; 49+ messages in thread
From: Sean Christopherson @ 2024-12-17 23:18 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Yan Y Zhao,
	isaku.yamahata@gmail.com, x86@kernel.org,
	dave.hansen@linux.intel.com, binbin.wu@linux.intel.com,
	Xiaoyao Li, Reinette Chatre, Adrian Hunter, Tony Lindgren,
	dmatlack@google.com, linux-kernel@vger.kernel.org, Isaku Yamahata,
	Kai Huang, nik.borisov@suse.com

On Tue, Dec 17, 2024, Rick P Edgecombe wrote:
> On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> > This SEPT SEAMCALL retry proposal aims to remove patch
> > "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
> > [1] at the tail of v2 series "TDX MMU Part 2".
> 
> We discussed this on the PUCK call. A couple alternatives were considered:
>  - Avoiding 0-step. To handle signals kicking to userspace we could try to add
> code to generate synthetic EPT violations if KVM thinks the 0-step mitigation
> might be active (i.e. the fault was not resolved). The consensus was that this
> would be continuing battle and possibly impossible due normal guest behavior
> triggering the mitigation. 

Specifically, the TDX Module takes its write-lock if the guest takes EPT
violations exits on the same RIP 6 times, i.e. detects forward progress based
purely on the RIP at entry vs. exit.  So a guest that is touching memory in a
loop could trigger zero-step checking even if KVM promptly fixes every EPT
violation.

>  - Pre-faulting all S-EPT, such that contention with AUG won't happen. The
> discussion was that this would only be a temporary solution as the MMU
> operations get more complicated (huge pages, etc). Also there is also
> private/shared conversions and memory hotplug already.
> 
> So we will proceed with this kick+lock+retry solution. The reasoning is to
> optimize for the normal non-contention path, without having an overly
> complicated solution for KVM.
> 
> In all the branch commotion recently, these patches fell out of our dev branch.
> So we just recently integrated then into a 6.13 kvm-coco-queue based branch. We
> need to perform some regression tests based on 6.13 TDP MMU changes. Assuming no
> issues, we can post the 6.13 rebase to included in kvm-coco-queue with
> instructions on which patches to remove from kvm-coco-queue (i.e. the 16
> retries).
> 
> 
> We also briefly touched on the TDX module behavior where guest operations can
> lock NP PTEs. The kick solution doesn't require changing this functionally, but
> it should still be done to help with debugging issues related to KVM's
> contention solution.

And so that KVM developers don't have to deal with customer escalations due to
performance issues caused by known flaws in the TDX module.

> 
> Thanks all for the discussion!

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2024-11-21 11:57   ` [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal Yan Zhao
  2024-11-26  0:47     ` Edgecombe, Rick P
@ 2024-12-17 23:29     ` Sean Christopherson
  2024-12-18  5:45       ` Yan Zhao
  1 sibling, 1 reply; 49+ messages in thread
From: Sean Christopherson @ 2024-12-17 23:29 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata, nik.borisov,
	linux-kernel, x86

On Thu, Nov 21, 2024, Yan Zhao wrote:
> For tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(),
> 
> - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once.
> - During the retry, kick off all vCPUs and prevent any vCPU from entering
>   to avoid potential contentions.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/vmx/tdx.c          | 49 +++++++++++++++++++++++++--------
>  2 files changed, 40 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 521c7cf725bc..bb7592110337 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -123,6 +123,8 @@
>  #define KVM_REQ_HV_TLB_FLUSH \
>  	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
> +#define KVM_REQ_NO_VCPU_ENTER_INPROGRESS \
> +	KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  
>  #define CR0_RESERVED_BITS                                               \
>  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 60d9e9d050ad..ed6b41bbcec6 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -311,6 +311,20 @@ static void tdx_clear_page(unsigned long page_pa)
>  	__mb();
>  }
>  
> +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> +{
> +	kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);

I vote for making this a common request with a more succient name, e.g. KVM_REQ_PAUSE.
And with appropriate helpers in common code.  I could have sworn I floated this
idea in the past for something else, but apparently not.  The only thing I can
find is an old arm64 version for pausing vCPUs to emulated.  Hmm, maybe I was
thinking of KVM_REQ_OUTSIDE_GUEST_MODE?

Anyways, I don't see any reason to make this an arch specific request.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2024-12-17 23:29     ` Sean Christopherson
@ 2024-12-18  5:45       ` Yan Zhao
  2024-12-18 16:10         ` Sean Christopherson
  0 siblings, 1 reply; 49+ messages in thread
From: Yan Zhao @ 2024-12-18  5:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata, nik.borisov,
	linux-kernel, x86

On Tue, Dec 17, 2024 at 03:29:03PM -0800, Sean Christopherson wrote:
> On Thu, Nov 21, 2024, Yan Zhao wrote:
> > For tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(),
> > 
> > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once.
> > - During the retry, kick off all vCPUs and prevent any vCPU from entering
> >   to avoid potential contentions.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 ++
> >  arch/x86/kvm/vmx/tdx.c          | 49 +++++++++++++++++++++++++--------
> >  2 files changed, 40 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 521c7cf725bc..bb7592110337 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -123,6 +123,8 @@
> >  #define KVM_REQ_HV_TLB_FLUSH \
> >  	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> >  #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
> > +#define KVM_REQ_NO_VCPU_ENTER_INPROGRESS \
> > +	KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> >  
> >  #define CR0_RESERVED_BITS                                               \
> >  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 60d9e9d050ad..ed6b41bbcec6 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -311,6 +311,20 @@ static void tdx_clear_page(unsigned long page_pa)
> >  	__mb();
> >  }
> >  
> > +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> > +{
> > +	kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
> 
> I vote for making this a common request with a more succient name, e.g. KVM_REQ_PAUSE.
KVM_REQ_PAUSE looks good to me. But will the "pause" cause any confusion with
the guest's pause state?

> And with appropriate helpers in common code.  I could have sworn I floated this
> idea in the past for something else, but apparently not.  The only thing I can
Yes, you suggested me to implement it via a request, similar to
KVM_REQ_MCLOCK_INPROGRESS. [1].
(I didn't add your suggested-by tag in this patch because it's just an RFC).

[1] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com/

> find is an old arm64 version for pausing vCPUs to emulated.  Hmm, maybe I was
> thinking of KVM_REQ_OUTSIDE_GUEST_MODE?
KVM_REQ_OUTSIDE_GUEST_MODE just kicks vCPUs outside guest mode, it does not set
a bit in vcpu->requests to prevent later vCPUs entering.

> Anyways, I don't see any reason to make this an arch specific request.
After making it non-arch specific, probably we need an atomic counter for the
start/stop requests in the common helpers. So I just made it TDX-specific to
keep it simple in the RFC.

Will do in non-arch specific way if you think it's worth.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2024-12-18  5:45       ` Yan Zhao
@ 2024-12-18 16:10         ` Sean Christopherson
  2024-12-19  1:52           ` Yan Zhao
  0 siblings, 1 reply; 49+ messages in thread
From: Sean Christopherson @ 2024-12-18 16:10 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata, nik.borisov,
	linux-kernel, x86

On Wed, Dec 18, 2024, Yan Zhao wrote:
> On Tue, Dec 17, 2024 at 03:29:03PM -0800, Sean Christopherson wrote:
> > On Thu, Nov 21, 2024, Yan Zhao wrote:
> > > For tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(),
> > > 
> > > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once.
> > > - During the retry, kick off all vCPUs and prevent any vCPU from entering
> > >   to avoid potential contentions.
> > > 
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |  2 ++
> > >  arch/x86/kvm/vmx/tdx.c          | 49 +++++++++++++++++++++++++--------
> > >  2 files changed, 40 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 521c7cf725bc..bb7592110337 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -123,6 +123,8 @@
> > >  #define KVM_REQ_HV_TLB_FLUSH \
> > >  	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > >  #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
> > > +#define KVM_REQ_NO_VCPU_ENTER_INPROGRESS \
> > > +	KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > >  
> > >  #define CR0_RESERVED_BITS                                               \
> > >  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 60d9e9d050ad..ed6b41bbcec6 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -311,6 +311,20 @@ static void tdx_clear_page(unsigned long page_pa)
> > >  	__mb();
> > >  }
> > >  
> > > +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> > > +{
> > > +	kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
> > 
> > I vote for making this a common request with a more succient name, e.g. KVM_REQ_PAUSE.
> KVM_REQ_PAUSE looks good to me. But will the "pause" cause any confusion with
> the guest's pause state?

Maybe?

> > And with appropriate helpers in common code.  I could have sworn I floated this
> > idea in the past for something else, but apparently not.  The only thing I can
> Yes, you suggested me to implement it via a request, similar to
> KVM_REQ_MCLOCK_INPROGRESS. [1].
> (I didn't add your suggested-by tag in this patch because it's just an RFC).
> 
> [1] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com/
> 
> > find is an old arm64 version for pausing vCPUs to emulated.  Hmm, maybe I was
> > thinking of KVM_REQ_OUTSIDE_GUEST_MODE?
> KVM_REQ_OUTSIDE_GUEST_MODE just kicks vCPUs outside guest mode, it does not set
> a bit in vcpu->requests to prevent later vCPUs entering.

Yeah, I was mostly just talking to myself. :-)

> > Anyways, I don't see any reason to make this an arch specific request.
> After making it non-arch specific, probably we need an atomic counter for the
> start/stop requests in the common helpers. So I just made it TDX-specific to
> keep it simple in the RFC.

Oh, right.  I didn't consider the complications with multiple users.  Hrm.

Actually, this doesn't need to be a request.  KVM_REQ_OUTSIDE_GUEST_MODE will
forces vCPUs to exit, at which point tdx_vcpu_run() can return immediately with
EXIT_FASTPATH_EXIT_HANDLED, which is all that kvm_vcpu_exit_request() does.  E.g.
have the zap side set wait_for_sept_zap before blasting the request to all vCPU,
and then in tdx_vcpu_run():

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b49dcf32206b..508ad6462e6d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -921,6 +921,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
                return EXIT_FASTPATH_NONE;
        }
 
+       if (unlikely(READ_ONCE(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap)))
+               return EXIT_FASTPATH_EXIT_HANDLED;
+
        trace_kvm_entry(vcpu, force_immediate_exit);
 
        if (pi_test_on(&tdx->pi_desc)) {


Ooh, but there's a subtle flaw with that approach.  Unlike kvm_vcpu_exit_request(),
the above check would obviously happen before entry to the guest, which means that,
in theory, KVM needs to goto cancel_injection to re-request req_immediate_exit and
cancel injection:

	if (req_immediate_exit)
		kvm_make_request(KVM_REQ_EVENT, vcpu);
	kvm_x86_call(cancel_injection)(vcpu);

But!  This actually an opportunity to harden KVM.  Because the TDX module doesn't
guarantee entry, it's already possible for KVM to _think_ it completely entry to
the guest without actually having done so.  It just happens to work because KVM
never needs to force an immediate exit for TDX, and can't do direct injection,
i.e. can "safely" skip the cancel_injection path.

So, I think can and should go with the above suggestion, but also add a WARN on
req_immediate_exit being set, because TDX ignores it entirely.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b49dcf32206b..e23cd8231144 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -914,6 +914,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
        struct vcpu_tdx *tdx = to_tdx(vcpu);
        struct vcpu_vt *vt = to_tdx(vcpu);
 
+       /* <comment goes here> */
+       WARN_ON_ONCE(force_immediate_exit);
+
        /* TDX exit handle takes care of this error case. */
        if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
                tdx->vp_enter_ret = TDX_SW_ERROR;
@@ -921,6 +924,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
                return EXIT_FASTPATH_NONE;
        }
 
+       if (unlikely(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap))
+               return EXIT_FASTPATH_EXIT_HANDLED;
+
        trace_kvm_entry(vcpu, force_immediate_exit);
 
        if (pi_test_on(&tdx->pi_desc)) {

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2024-12-18 16:10         ` Sean Christopherson
@ 2024-12-19  1:52           ` Yan Zhao
  2024-12-19  2:39             ` Sean Christopherson
  0 siblings, 1 reply; 49+ messages in thread
From: Yan Zhao @ 2024-12-19  1:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata, nik.borisov,
	linux-kernel, x86

On Wed, Dec 18, 2024 at 08:10:48AM -0800, Sean Christopherson wrote:
> On Wed, Dec 18, 2024, Yan Zhao wrote:
> > On Tue, Dec 17, 2024 at 03:29:03PM -0800, Sean Christopherson wrote:
> > > On Thu, Nov 21, 2024, Yan Zhao wrote:
> > > > For tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(),
> > > > 
> > > > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once.
> > > > - During the retry, kick off all vCPUs and prevent any vCPU from entering
> > > >   to avoid potential contentions.
> > > > 
> > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h |  2 ++
> > > >  arch/x86/kvm/vmx/tdx.c          | 49 +++++++++++++++++++++++++--------
> > > >  2 files changed, 40 insertions(+), 11 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 521c7cf725bc..bb7592110337 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -123,6 +123,8 @@
> > > >  #define KVM_REQ_HV_TLB_FLUSH \
> > > >  	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > > >  #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
> > > > +#define KVM_REQ_NO_VCPU_ENTER_INPROGRESS \
> > > > +	KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > > >  
> > > >  #define CR0_RESERVED_BITS                                               \
> > > >  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > > index 60d9e9d050ad..ed6b41bbcec6 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -311,6 +311,20 @@ static void tdx_clear_page(unsigned long page_pa)
> > > >  	__mb();
> > > >  }
> > > >  
> > > > +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> > > > +{
> > > > +	kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
> > > 
> > > I vote for making this a common request with a more succient name, e.g. KVM_REQ_PAUSE.
> > KVM_REQ_PAUSE looks good to me. But will the "pause" cause any confusion with
> > the guest's pause state?
> 
> Maybe?
> 
> > > And with appropriate helpers in common code.  I could have sworn I floated this
> > > idea in the past for something else, but apparently not.  The only thing I can
> > Yes, you suggested me to implement it via a request, similar to
> > KVM_REQ_MCLOCK_INPROGRESS. [1].
> > (I didn't add your suggested-by tag in this patch because it's just an RFC).
> > 
> > [1] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com/
> > 
> > > find is an old arm64 version for pausing vCPUs to emulated.  Hmm, maybe I was
> > > thinking of KVM_REQ_OUTSIDE_GUEST_MODE?
> > KVM_REQ_OUTSIDE_GUEST_MODE just kicks vCPUs outside guest mode, it does not set
> > a bit in vcpu->requests to prevent later vCPUs entering.
> 
> Yeah, I was mostly just talking to myself. :-)
> 
> > > Anyways, I don't see any reason to make this an arch specific request.
> > After making it non-arch specific, probably we need an atomic counter for the
> > start/stop requests in the common helpers. So I just made it TDX-specific to
> > keep it simple in the RFC.
> 
> Oh, right.  I didn't consider the complications with multiple users.  Hrm.
> 
> Actually, this doesn't need to be a request.  KVM_REQ_OUTSIDE_GUEST_MODE will
> forces vCPUs to exit, at which point tdx_vcpu_run() can return immediately with
> EXIT_FASTPATH_EXIT_HANDLED, which is all that kvm_vcpu_exit_request() does.  E.g.
> have the zap side set wait_for_sept_zap before blasting the request to all vCPU,
Hmm, the wait_for_sept_zap also needs to be set and unset in all vCPUs except
the current one.

> and then in tdx_vcpu_run():
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b49dcf32206b..508ad6462e6d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -921,6 +921,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>                 return EXIT_FASTPATH_NONE;
>         }
>  
> +       if (unlikely(READ_ONCE(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap)))
> +               return EXIT_FASTPATH_EXIT_HANDLED;
> +
>         trace_kvm_entry(vcpu, force_immediate_exit);
>  
>         if (pi_test_on(&tdx->pi_desc)) {
> 
> 
> Ooh, but there's a subtle flaw with that approach.  Unlike kvm_vcpu_exit_request(),
> the above check would obviously happen before entry to the guest, which means that,
> in theory, KVM needs to goto cancel_injection to re-request req_immediate_exit and
> cancel injection:
> 
> 	if (req_immediate_exit)
> 		kvm_make_request(KVM_REQ_EVENT, vcpu);
> 	kvm_x86_call(cancel_injection)(vcpu);
> 
> But!  This actually an opportunity to harden KVM.  Because the TDX module doesn't
> guarantee entry, it's already possible for KVM to _think_ it completely entry to
> the guest without actually having done so.  It just happens to work because KVM
> never needs to force an immediate exit for TDX, and can't do direct injection,
> i.e. can "safely" skip the cancel_injection path.
> 
> So, I think can and should go with the above suggestion, but also add a WARN on
> req_immediate_exit being set, because TDX ignores it entirely.
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b49dcf32206b..e23cd8231144 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -914,6 +914,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>         struct vcpu_tdx *tdx = to_tdx(vcpu);
>         struct vcpu_vt *vt = to_tdx(vcpu);
>  
> +       /* <comment goes here> */
> +       WARN_ON_ONCE(force_immediate_exit);
Better to put this hardending a separate fix to
commit 37d3baf545cd ("KVM: TDX: Implement TDX vcpu enter/exit path") ?
It's required no matter which approach is chosen for SEPT SEACALL retry.

>         /* TDX exit handle takes care of this error case. */
>         if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
>                 tdx->vp_enter_ret = TDX_SW_ERROR;
> @@ -921,6 +924,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>                 return EXIT_FASTPATH_NONE;
>         }
>  
> +       if (unlikely(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap))
> +               return EXIT_FASTPATH_EXIT_HANDLED;
> +
>         trace_kvm_entry(vcpu, force_immediate_exit);
>  
>         if (pi_test_on(&tdx->pi_desc)) {
Thanks for this suggestion.
But what's the advantage of this checking wait_for_sept_zap approach?
Is it to avoid introducing an arch specific request?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2024-12-19  1:52           ` Yan Zhao
@ 2024-12-19  2:39             ` Sean Christopherson
  2024-12-19  3:03               ` Yan Zhao
  0 siblings, 1 reply; 49+ messages in thread
From: Sean Christopherson @ 2024-12-19  2:39 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata, nik.borisov,
	linux-kernel, x86

On Thu, Dec 19, 2024, Yan Zhao wrote:
> On Wed, Dec 18, 2024 at 08:10:48AM -0800, Sean Christopherson wrote:
> > On Wed, Dec 18, 2024, Yan Zhao wrote:
> > > > Anyways, I don't see any reason to make this an arch specific request.
> > > After making it non-arch specific, probably we need an atomic counter for the
> > > start/stop requests in the common helpers. So I just made it TDX-specific to
> > > keep it simple in the RFC.
> > 
> > Oh, right.  I didn't consider the complications with multiple users.  Hrm.
> > 
> > Actually, this doesn't need to be a request.  KVM_REQ_OUTSIDE_GUEST_MODE will
> > forces vCPUs to exit, at which point tdx_vcpu_run() can return immediately with
> > EXIT_FASTPATH_EXIT_HANDLED, which is all that kvm_vcpu_exit_request() does.  E.g.
> > have the zap side set wait_for_sept_zap before blasting the request to all vCPU,
> Hmm, the wait_for_sept_zap also needs to be set and unset in all vCPUs except
> the current one.

Why can't it be a VM-wide flag?  The current vCPU isn't going to do VP.ENTER, is
it?  If it is, I've definitely missed something :-)

> >         /* TDX exit handle takes care of this error case. */
> >         if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
> >                 tdx->vp_enter_ret = TDX_SW_ERROR;
> > @@ -921,6 +924,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> >                 return EXIT_FASTPATH_NONE;
> >         }
> >  
> > +       if (unlikely(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap))
> > +               return EXIT_FASTPATH_EXIT_HANDLED;
> > +
> >         trace_kvm_entry(vcpu, force_immediate_exit);
> >  
> >         if (pi_test_on(&tdx->pi_desc)) {
> Thanks for this suggestion.
> But what's the advantage of this checking wait_for_sept_zap approach?
> Is it to avoid introducing an arch specific request?

Yes, and unless I've missed something, "releasing" vCPUs can be done by clearing
a single variable.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2024-12-19  2:39             ` Sean Christopherson
@ 2024-12-19  3:03               ` Yan Zhao
  0 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2024-12-19  3:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata, nik.borisov,
	linux-kernel, x86

On Wed, Dec 18, 2024 at 06:39:29PM -0800, Sean Christopherson wrote:
> On Thu, Dec 19, 2024, Yan Zhao wrote:
> > On Wed, Dec 18, 2024 at 08:10:48AM -0800, Sean Christopherson wrote:
> > > On Wed, Dec 18, 2024, Yan Zhao wrote:
> > > > > Anyways, I don't see any reason to make this an arch specific request.
> > > > After making it non-arch specific, probably we need an atomic counter for the
> > > > start/stop requests in the common helpers. So I just made it TDX-specific to
> > > > keep it simple in the RFC.
> > > 
> > > Oh, right.  I didn't consider the complications with multiple users.  Hrm.
> > > 
> > > Actually, this doesn't need to be a request.  KVM_REQ_OUTSIDE_GUEST_MODE will
> > > forces vCPUs to exit, at which point tdx_vcpu_run() can return immediately with
> > > EXIT_FASTPATH_EXIT_HANDLED, which is all that kvm_vcpu_exit_request() does.  E.g.
> > > have the zap side set wait_for_sept_zap before blasting the request to all vCPU,
> > Hmm, the wait_for_sept_zap also needs to be set and unset in all vCPUs except
> > the current one.
> 
> Why can't it be a VM-wide flag?  The current vCPU isn't going to do VP.ENTER, is
> it?  If it is, I've definitely missed something :-)
Ah, right. You are not missing anything. I just forgot it can be a VM-wide flag....
Sorry.

> 
> > >         /* TDX exit handle takes care of this error case. */
> > >         if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
> > >                 tdx->vp_enter_ret = TDX_SW_ERROR;
> > > @@ -921,6 +924,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> > >                 return EXIT_FASTPATH_NONE;
> > >         }
> > >  
> > > +       if (unlikely(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap))
> > > +               return EXIT_FASTPATH_EXIT_HANDLED;
> > > +
> > >         trace_kvm_entry(vcpu, force_immediate_exit);
> > >  
> > >         if (pi_test_on(&tdx->pi_desc)) {
> > Thanks for this suggestion.
> > But what's the advantage of this checking wait_for_sept_zap approach?
> > Is it to avoid introducing an arch specific request?
> 
> Yes, and unless I've missed something, "releasing" vCPUs can be done by clearing
> a single variable.
Right. Will do in this way in the next version.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 22/24] KVM: TDX: Finalize VM initialization
  2024-11-12  7:38 ` [PATCH v2 22/24] KVM: TDX: Finalize VM initialization Yan Zhao
@ 2024-12-24 14:31   ` Paolo Bonzini
  2025-01-07  7:44     ` Yan Zhao
  0 siblings, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2024-12-24 14:31 UTC (permalink / raw)
  To: Yan Zhao, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

On 11/12/24 08:38, Yan Zhao wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Introduce a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
> KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.
> 
> The API documentation is provided in a separate patch:
> “Documentation/virt/kvm: Document on Trust Domain Extensions (TDX)”.
> 
> Enhance TDX’s set_external_spte() hook to record the pre-mapping count
> instead of returning without action when the TD is not finalized.
> 
> Adjust the pre-mapping count when pages are added or if the mapping is
> dropped.
> 
> Set pre_fault_allowed to true after the finalization is complete.
> 
> Note: TD Measurement Finalization is the process by which the initial state
> of the TDX VM is measured for attestation purposes. It uses the SEAMCALL
> TDH.MR.FINALIZE, after which:
> 1. The VMM can no longer add TD private pages with arbitrary content.
> 2. The TDX VM becomes runnable.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> TDX MMU part 2 v2
>   - Merge changes from patch "KVM: TDX: Premap initial guest memory" into
>     this patch (Paolo)
>   - Consolidate nr_premapped counting into this patch (Paolo)
>   - Page level check should be (and is) in tdx_sept_set_private_spte() in
>     patch "KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror
>     page table" not in tdx_mem_page_record_premap_cnt() (Paolo)
>   - Protect finalization using kvm->slots_lock (Paolo)
>   - Set kvm->arch.pre_fault_allowed to true after finalization is done
>     (Paolo)
>   - Add a memory barrier to ensure correct ordering of the updates to
>     kvm_tdx->finalized and kvm->arch.pre_fault_allowed (Adrian)
>   - pre_fault_allowed must not be true before finalization is done.
>     Highlight that fact by checking it in tdx_mem_page_record_premap_cnt()
>     (Adrian)
>   - No need for is_td_finalized() (Rick)
>   - Fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
>     wrappers (Kai)
>   - Add nr_premapped where it's first used (Tao)

I have just a couple imprecesions to note:
- stale reference to 'finalized'
- atomic64_read WARN should block the following atomic64_dec (there is still
   a small race window but it's not worth using a dec-if-not-zero operation)
- rename tdx_td_finalizemr to tdx_td_finalize

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 61e4f126addd..eb0de85c3413 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -609,8 +609,8 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
  	get_page(pfn_to_page(pfn));
  
  	/*
-	 * To match ordering of 'finalized' and 'pre_fault_allowed' in
-	 * tdx_td_finalizemr().
+	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
+	 * barrier in tdx_td_finalize().
  	 */
  	smp_rmb();
  	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
@@ -1397,7 +1397,7 @@ void tdx_flush_tlb_all(struct kvm_vcpu *vcpu)
  	ept_sync_global();
  }
  
-static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
  {
  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
  
@@ -1452,7 +1452,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
  		r = tdx_td_init(kvm, &tdx_cmd);
  		break;
  	case KVM_TDX_FINALIZE_VM:
-		r = tdx_td_finalizemr(kvm, &tdx_cmd);
+		r = tdx_td_finalize(kvm, &tdx_cmd);
  		break;
  	default:
  		r = -EINVAL;
@@ -1715,8 +1715,8 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
  		goto out;
  	}
  
-	WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
-	atomic64_dec(&kvm_tdx->nr_premapped);
+	if (!WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped)))
+		atomic64_dec(&kvm_tdx->nr_premapped);
  
  	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
  		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 00/24] TDX MMU Part 2
  2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
                   ` (25 preceding siblings ...)
  2024-12-10 18:21 ` [PATCH v2 00/24] TDX MMU Part 2 Paolo Bonzini
@ 2024-12-24 14:33 ` Paolo Bonzini
  26 siblings, 0 replies; 49+ messages in thread
From: Paolo Bonzini @ 2024-12-24 14:33 UTC (permalink / raw)
  To: Yan Zhao, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	isaku.yamahata, nik.borisov, linux-kernel, x86

On 11/12/24 08:33, Yan Zhao wrote:
> Hi,
> 
> Here is v2 of the TDX “MMU part 2” series.
> As discussed earlier, non-nit feedbacks from v1[0] have been applied.
> - Among them, patch "KVM: TDX: MTRR: implement get_mt_mask() for TDX" was
>    dropped. The feature self-snoop was not made a dependency for enabling
>    TDX since checking for the feature self-snoop was not included in
>    kvm_mmu_may_ignore_guest_pat() in the base code. So, strickly speaking,
>    current code would incorrectly zap the mirrored root if non-coherent DMA
>    devices were hot-plugged.
> 
> There were also a few minor issues noticed by me and fixed without internal
> discussion (noted in each patch's version log).
> 
> It’s now ready to hand off to Paolo/kvm-coco-queue.
> 
> 
> One remaining item that requires further discussion is "How to handle
> the TDX module lock contention (i.e. SEAMCALL retry replacements)".
> The basis for future discussions includes:
> (1) TDH.MEM.TRACK can contend with TDH.VP.ENTER on the TD epoch lock.
> (2) TDH.VP.ENTER contends with TDH.MEM* on S-EPT tree lock when 0-stepping
>      mitigation is triggered.
>      - The threshold of zero-step mitigation is counted per-vCPU when the
>        TDX module finds that EPT violations are caused by the same RIP as
>        in the last TDH.VP.ENTER for 6 consecutive times.
>        The threshold value 6 is explained as
>        "There can be at most 2 mapping faults on instruction fetch
>         (x86 macro-instructions length is at most 15 bytes) when the
>         instruction crosses page boundary; then there can be at most 2
>         mapping faults for each memory operand, when the operand crosses
>         page boundary. For most of x86 macro-instructions, there are up to 2
>         memory operands and each one of them is small, which brings us to
>         maximum 2+2*2 = 6 legal mapping faults."
>      - If the EPT violations received by KVM are caused by
>        TDG.MEM.PAGE.ACCEPT, they will not trigger 0-stepping mitigation.
>        Since a TD is required to call TDG.MEM.PAGE.ACCEPT before accessing a
>        private memory when configured with pending_ve_disable=Y, 0-stepping
>        mitigation is not expected to occur in such a TD.
> (3) TDG.MEM.PAGE.ACCEPT can contend with SEAMCALLs TDH.MEM*.
>      (Actually, TDG.MEM.PAGE.ATTR.RD or TDG.MEM.PAGE.ATTR.WR can also
>       contend with SEAMCALLs TDH.MEM*. Although we don't need to consider
>       these two TDCALLs when enabling basic TDX, they are allowed by the
>       TDX module, and we can't control whether a TD invokes a TDCALL or
>       not).
> 
> The "KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" is
> still in place in this series (at the tail), but we should drop it when we
> finalize on the real solution.
> 
> 
> This series has 5 commits intended to collect Acks from x86 maintainers.
> These commits introduce and export SEAMCALL wrappers to allow KVM to manage
> the S-EPT (the EPT that maps private memory and is protected by the TDX
> module):
> 
>    x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT
>      pages
>    x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
>    x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
>    x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
>    x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial
>      contents

Apart from the possible changes to the SEAMCALL wrappers, this is in 
good shape.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 22/24] KVM: TDX: Finalize VM initialization
  2024-12-24 14:31   ` Paolo Bonzini
@ 2025-01-07  7:44     ` Yan Zhao
  2025-01-07 14:02       ` Paolo Bonzini
  0 siblings, 1 reply; 49+ messages in thread
From: Yan Zhao @ 2025-01-07  7:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata, nik.borisov,
	linux-kernel, x86

On Tue, Dec 24, 2024 at 03:31:19PM +0100, Paolo Bonzini wrote:
> On 11/12/24 08:38, Yan Zhao wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Introduce a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
> > KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.
> > 
> > The API documentation is provided in a separate patch:
> > “Documentation/virt/kvm: Document on Trust Domain Extensions (TDX)”.
> > 
> > Enhance TDX’s set_external_spte() hook to record the pre-mapping count
> > instead of returning without action when the TD is not finalized.
> > 
> > Adjust the pre-mapping count when pages are added or if the mapping is
> > dropped.
> > 
> > Set pre_fault_allowed to true after the finalization is complete.
> > 
> > Note: TD Measurement Finalization is the process by which the initial state
> > of the TDX VM is measured for attestation purposes. It uses the SEAMCALL
> > TDH.MR.FINALIZE, after which:
> > 1. The VMM can no longer add TD private pages with arbitrary content.
> > 2. The TDX VM becomes runnable.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
> > Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > TDX MMU part 2 v2
> >   - Merge changes from patch "KVM: TDX: Premap initial guest memory" into
> >     this patch (Paolo)
> >   - Consolidate nr_premapped counting into this patch (Paolo)
> >   - Page level check should be (and is) in tdx_sept_set_private_spte() in
> >     patch "KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror
> >     page table" not in tdx_mem_page_record_premap_cnt() (Paolo)
> >   - Protect finalization using kvm->slots_lock (Paolo)
> >   - Set kvm->arch.pre_fault_allowed to true after finalization is done
> >     (Paolo)
> >   - Add a memory barrier to ensure correct ordering of the updates to
> >     kvm_tdx->finalized and kvm->arch.pre_fault_allowed (Adrian)
> >   - pre_fault_allowed must not be true before finalization is done.
> >     Highlight that fact by checking it in tdx_mem_page_record_premap_cnt()
> >     (Adrian)
> >   - No need for is_td_finalized() (Rick)
> >   - Fixup SEAMCALL call sites due to function parameter changes to SEAMCALL
> >     wrappers (Kai)
> >   - Add nr_premapped where it's first used (Tao)
> 
> I have just a couple imprecesions to note:
> - stale reference to 'finalized'
Thanks for catching that!

> - atomic64_read WARN should block the following atomic64_dec (there is still
>   a small race window but it's not worth using a dec-if-not-zero operation)
> - rename tdx_td_finalizemr to tdx_td_finalize
Right. I didn't notice it because of the mr in tdh_mr_finalize(). :)

> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 61e4f126addd..eb0de85c3413 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -609,8 +609,8 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  	get_page(pfn_to_page(pfn));
>  	/*
> -	 * To match ordering of 'finalized' and 'pre_fault_allowed' in
> -	 * tdx_td_finalizemr().
> +	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> +	 * barrier in tdx_td_finalize().
>  	 */
>  	smp_rmb();
>  	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> @@ -1397,7 +1397,7 @@ void tdx_flush_tlb_all(struct kvm_vcpu *vcpu)
>  	ept_sync_global();
>  }
> -static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> +static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -1452,7 +1452,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  		r = tdx_td_init(kvm, &tdx_cmd);
>  		break;
>  	case KVM_TDX_FINALIZE_VM:
> -		r = tdx_td_finalizemr(kvm, &tdx_cmd);
> +		r = tdx_td_finalize(kvm, &tdx_cmd);
>  		break;
>  	default:
>  		r = -EINVAL;
> @@ -1715,8 +1715,8 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  		goto out;
>  	}
> -	WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
> -	atomic64_dec(&kvm_tdx->nr_premapped);
> +	if (!WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped)))
> +		atomic64_dec(&kvm_tdx->nr_premapped);
One concern here.
If tdx_gmem_post_populate() is called when kvm_tdx->nr_premapped is 0, it will
trigger the WARN_ON here, indicating that something has gone wrong.
Should KVM refuse to start the TD in this case?

If we don't decrease kvm_tdx->nr_premapped in that case, it will remain 0,
allowing it to pass the check in tdx_td_finalize().

>  	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
>  		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 22/24] KVM: TDX: Finalize VM initialization
  2025-01-07  7:44     ` Yan Zhao
@ 2025-01-07 14:02       ` Paolo Bonzini
  2025-01-08  2:18         ` Yan Zhao
  0 siblings, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2025-01-07 14:02 UTC (permalink / raw)
  To: Yan Zhao
  Cc: seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata, nik.borisov,
	linux-kernel, x86

On Tue, Jan 7, 2025 at 8:45 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > @@ -1715,8 +1715,8 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >               goto out;
> >       }
> > -     WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
> > -     atomic64_dec(&kvm_tdx->nr_premapped);
> > +     if (!WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped)))
> > +             atomic64_dec(&kvm_tdx->nr_premapped);
> One concern here.
> If tdx_gmem_post_populate() is called when kvm_tdx->nr_premapped is 0, it will
> trigger the WARN_ON here, indicating that something has gone wrong.
> Should KVM refuse to start the TD in this case?
>
> If we don't decrease kvm_tdx->nr_premapped in that case, it will remain 0,
> allowing it to pass the check in tdx_td_finalize().

Let's make it a KVM_BUG_ON then.

Paolo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 22/24] KVM: TDX: Finalize VM initialization
  2025-01-07 14:02       ` Paolo Bonzini
@ 2025-01-08  2:18         ` Yan Zhao
  0 siblings, 0 replies; 49+ messages in thread
From: Yan Zhao @ 2025-01-08  2:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata, nik.borisov,
	linux-kernel, x86

On Tue, Jan 07, 2025 at 03:02:28PM +0100, Paolo Bonzini wrote:
> On Tue, Jan 7, 2025 at 8:45 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > @@ -1715,8 +1715,8 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > >               goto out;
> > >       }
> > > -     WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
> > > -     atomic64_dec(&kvm_tdx->nr_premapped);
> > > +     if (!WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped)))
> > > +             atomic64_dec(&kvm_tdx->nr_premapped);
> > One concern here.
> > If tdx_gmem_post_populate() is called when kvm_tdx->nr_premapped is 0, it will
> > trigger the WARN_ON here, indicating that something has gone wrong.
> > Should KVM refuse to start the TD in this case?
> >
> > If we don't decrease kvm_tdx->nr_premapped in that case, it will remain 0,
> > allowing it to pass the check in tdx_td_finalize().
> 
> Let's make it a KVM_BUG_ON then.

Ok. Fair enough.

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2025-01-08  2:20 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-12  7:33 [PATCH v2 00/24] TDX MMU Part 2 Yan Zhao
2024-11-12  7:34 ` [PATCH v2 01/24] KVM: x86/mmu: Implement memslot deletion for TDX Yan Zhao
2024-11-12  7:34 ` [PATCH v2 02/24] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU Yan Zhao
2024-11-12  7:35 ` [PATCH v2 03/24] KVM: x86/mmu: Do not enable page track for TD guest Yan Zhao
2024-11-12  7:35 ` [PATCH v2 04/24] KVM: VMX: Split out guts of EPT violation to common/exposed function Yan Zhao
2024-11-12  7:35 ` [PATCH v2 05/24] KVM: VMX: Teach EPT violation helper about private mem Yan Zhao
2024-11-12  7:35 ` [PATCH v2 06/24] KVM: TDX: Add accessors VMX VMCS helpers Yan Zhao
2024-11-12  7:36 ` [PATCH v2 07/24] KVM: TDX: Add load_mmu_pgd method for TDX Yan Zhao
2024-11-12  7:36 ` [PATCH v2 08/24] KVM: TDX: Set gfn_direct_bits to shared bit Yan Zhao
2024-11-12  7:36 ` [PATCH v2 09/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT pages Yan Zhao
2024-11-12  7:36 ` [PATCH v2 10/24] x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages Yan Zhao
2024-11-12  7:36 ` [PATCH v2 11/24] x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking Yan Zhao
2024-11-12  7:36 ` [PATCH v2 12/24] x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page Yan Zhao
2024-11-12  7:37 ` [PATCH v2 13/24] x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial contents Yan Zhao
2024-11-12  7:37 ` [PATCH v2 14/24] KVM: TDX: Require TDP MMU and mmio caching for TDX Yan Zhao
2024-11-12  7:37 ` [PATCH v2 15/24] KVM: x86/mmu: Add setter for shadow_mmio_value Yan Zhao
2024-11-12  7:37 ` [PATCH v2 16/24] KVM: TDX: Set per-VM shadow_mmio_value to 0 Yan Zhao
2024-11-12  7:37 ` [PATCH v2 17/24] KVM: TDX: Handle TLB tracking for TDX Yan Zhao
2024-11-12  7:38 ` [PATCH v2 18/24] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Yan Zhao
2024-11-12  7:38 ` [PATCH v2 19/24] KVM: TDX: Implement hook to get max mapping level of private pages Yan Zhao
2024-11-12  7:38 ` [PATCH v2 20/24] KVM: x86/mmu: Export kvm_tdp_map_page() Yan Zhao
2024-11-12  7:38 ` [PATCH v2 21/24] KVM: TDX: Add an ioctl to create initial guest memory Yan Zhao
2024-11-27 18:08   ` Nikolay Borisov
2024-11-28  2:20     ` Yan Zhao
2024-11-12  7:38 ` [PATCH v2 22/24] KVM: TDX: Finalize VM initialization Yan Zhao
2024-12-24 14:31   ` Paolo Bonzini
2025-01-07  7:44     ` Yan Zhao
2025-01-07 14:02       ` Paolo Bonzini
2025-01-08  2:18         ` Yan Zhao
2024-11-12  7:38 ` [PATCH v2 23/24] KVM: TDX: Handle vCPU dissociation Yan Zhao
2024-11-12  7:39 ` [PATCH v2 24/24] [HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT Yan Zhao
2024-11-21 11:51 ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Yan Zhao
2024-11-21 11:56   ` [RFC PATCH 1/2] KVM: TDX: Retry in TDX when installing TD private/sept pages Yan Zhao
2024-11-21 11:57   ` [RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal Yan Zhao
2024-11-26  0:47     ` Edgecombe, Rick P
2024-11-26  6:39       ` Yan Zhao
2024-12-17 23:29     ` Sean Christopherson
2024-12-18  5:45       ` Yan Zhao
2024-12-18 16:10         ` Sean Christopherson
2024-12-19  1:52           ` Yan Zhao
2024-12-19  2:39             ` Sean Christopherson
2024-12-19  3:03               ` Yan Zhao
2024-11-26  0:46   ` [RFC PATCH 0/2] SEPT SEAMCALL retry proposal Edgecombe, Rick P
2024-11-26  6:24     ` Yan Zhao
2024-12-13  1:01   ` Yan Zhao
2024-12-17 17:00   ` Edgecombe, Rick P
2024-12-17 23:18     ` Sean Christopherson
2024-12-10 18:21 ` [PATCH v2 00/24] TDX MMU Part 2 Paolo Bonzini
2024-12-24 14:33 ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).