[PATCH 00/18] KVM: TDX: TDX "the rest" part

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/18] KVM: TDX: TDX "the rest" part
@ 2024-12-10  0:49 Binbin Wu
  2024-12-10  0:49 ` [PATCH 01/18] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior Binbin Wu
                   ` (18 more replies)
  0 siblings, 19 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

Hi,

This patch series is the last part needed of the basic support to run a
functioning TD VM, including DRs switch, EPT violation/misconfig handling,
the support for several TDVMCALL leaves, a bunch of wrappers to ignore the
operations which are not supported by TDX guests, and the document.

We would like get maintainers/reviewers feedback of the implementation,
considering there are several pre-req series ongoing, we don't think it is
ready for kvm-coco-queue yet.

Base of this series
===================
This series is based off of a kvm-coco-queue commit and some pre-req series:
1. commit ee69eb746754 ("KVM: x86/mmu: Prevent aliased memslot GFNs") (in
   kvm-coco-queue).
2. v7 of "TDX host: metadata reading tweaks and bug fixes and info dump" [0]
3. v1 of "KVM: VMX: Initialize TDX when loading KVM module" [1]
4. v2 of “TDX vCPU/VM creation” [2]
5. v2 of "TDX KVM MMU part 2" [3]
6  v1 of "TDX vcpu enter/exit" [4] with a few fixups based on review feedbacks.
7. v4 of "KVM: x86: Prep KVM hypercall handling for TDX" [5]
8. v1 of "KVM: TDX: TDX hypercalls may exit to userspace" [6]
9. v1 of "KVM: TDX: TDX interrupts" [7]

Debug Register switch
=====================
Skip guest DRs save/restore for TDX guest because TDX-SEAM unconditionally
saves/restores guest DRs on TD exit/enter, and resets DRs to architectural
INIT state on TD exit.

EPT violation
=============
EPT violation for TDX will trigger X86 MMU code.
Note that instruction fetch from shared memory is not allowed for TDX guests,
if it occurs, treat it as broken hardware, bug the VM and return error.

EPT misconfiguration
====================
EPT misconfiguration shouldn't happen for TDX guests. If it occurs, bug the
VM and return error.

TDVMCALL support
================
Supports are added to allow TDX guests to issue CPUID, HLT, RDMSR/WRMSR and
GetTdVmCallInfo via TDVMCALL.

CPUID
-----
For TDX, most CPUID leaf/sub-leaf combinations are virtualized by the TDX
module while some trigger #VE.  On #VE, TDX guest can issue a TDVMCALL with
the leaf Instruction.CPUID to request VMM to emulate CPUID operation.

HLT
---
TDX guest can issue a TDVMCALL with HLT, which passes the interrupt blocked
flag. Whether the interrupt is allowed or not is depending on the interrupt
blocked flag.  For NMI, KVM can't get the NMI blocked status of TDX guest,
it always assumes NMI is allowed.

MSRs
----
Some MSRs are virtualized by TDX module directly, while some MSRs will
trigger #VE when guest accesses them.  On #VE, TDX guests can issue a
TDVMCALL with WRMSR or RDMSR to request emulation in VMM.

Operations ignored
==================
TDX protects TDX guest state from VMM, and some features are not supported
by TDX guest, a bunch of operations are ignored for TDX guests, including:
accesses to CPU state, VMX preemption timer, accesses to TSC offset or 
multiplier, setup MCE for LMCE enable/disable, and hypercall patching.

Notable changes since v19 [8]
=========================
There are a several minor changes across the patches. A few more structural
changes are highlighted below.

MTRR MSR support is dropped
---------------------------
TDX forces the MTRR CPUID feature bit on when exposed to TDX guests, but
VMM/TDX module can't fully architecturally virtualize it. Changing the
setting of MTRR will trigger setting of CR0.CD to 1, which causes #VE in
TDX guest. 
Previously in v19, there is some incomplete support for MTRR MSRs
emulation in KVM because TDVF and Linux guest will still need to access some
MTRR MSRs, and it requires Linux guest kernel to boot with an additional
option "clearcpuid=mtrr" as a workaround to avoid triggering #VE.
After commit 3a3b12c [9] of edk2, TDVF will not use MTRR MSRs at all.
After commit 8e690b8 [10] of Linux kernel, Linux TDX guest will not use MTRR
MSRs at all.
So, MTRR MSR emulation in KVM for TDX guests has been dropped.
It will require TDVF built after commit 3a3b12c.
For Linux guest:
- If it is built before commit 8e690b8, "clearcpuid=mtrr" is needed as a
  workaround.
- If it is built after commit 8e690b8, "clearcpuid=mtrr" is not needed.

Detect unexpected SEPT violations due to pending SPTEs
------------------------------------------------------
This is new added.
A TDX guest can be configured not to receive #VE by setting SEPT_VE_DISABLE
to 1 in tdh_mng_init() or modifying pending_ve_disable to 1 in TDCS when
flexible_pending_ve is permitted.  If so, the TDX guest is supposed to
accept a private page before accessing it.  If such SEPT violation due to
pending SPTEs occurs, it is the TDX guest's fault, make the VM dead.

Repos
=====
The full KVM branch is here:
https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-12-09-the-rest

A matching QEMU is here:
https://github.com/intel-staging/qemu-tdx/tree/tdx-qemu-upstream-v6.1

Testing 
=======
It has been tested as part of the development branch for the TDX base
series. The testing consisted of TDX kvm-unit-tests and booting a Linux
TD, and TDX enhanced KVM selftests.

[0] https://lore.kernel.org/kvm/cover.1731318868.git.kai.huang@intel.com
[1] https://lore.kernel.org/kvm/cover.1730120881.git.kai.huang@intel.com
[2] https://lore.kernel.org/kvm/20241030190039.77971-1-rick.p.edgecombe@intel.com
[3] https://lore.kernel.org/kvm/20241112073327.21979-1-yan.y.zhao@intel.com
[4] https://lore.kernel.org/kvm/20241121201448.36170-1-adrian.hunter@intel.com
[5] https://lore.kernel.org/kvm/20241128004344.4072099-1-seanjc@google.com
[6] https://lore.kernel.org/kvm/20241201035358.2193078-1-binbin.wu@linux.intel.com
[7] https://lore.kernel.org/kvm/20241209010734.3543481-1-binbin.wu@linux.intel.com
[8] https://lore.kernel.org/kvm/cover.1708933498.git.isaku.yamahata@intel.com
[9] https://github.com/tianocore/edk2/commit/3a3b12cbdae2e89b0e365eb01c378891d0d9037c
[10] https://github.com/torvalds/linux/commit/8e690b817e38769dc2fa0e7473e5a5dc1fc25795

Isaku Yamahata (17):
  KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
    behavior
  KVM: TDX: Handle EPT violation/misconfig exit
  KVM: TDX: Handle TDX PV CPUID hypercall
  KVM: TDX: Handle TDX PV HLT hypercall
  KVM: x86: Move KVM_MAX_MCE_BANKS to header file
  KVM: TDX: Implement callbacks for MSR operations
  KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
  KVM: TDX: Enable guest access to LMCE related MSRs
  KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
  KVM: TDX: Add methods to ignore accesses to CPU state
  KVM: TDX: Add method to ignore guest instruction emulation
  KVM: TDX: Add methods to ignore VMX preemption timer
  KVM: TDX: Add methods to ignore accesses to TSC
  KVM: TDX: Ignore setting up mce
  KVM: TDX: Add a method to ignore hypercall patching
  KVM: TDX: Make TDX VM type supported
  Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)

Yan Zhao (1):
  KVM: TDX: Detect unexpected SEPT violations due to pending SPTEs

 Documentation/virt/kvm/api.rst           |   9 +-
 Documentation/virt/kvm/x86/index.rst     |   1 +
 Documentation/virt/kvm/x86/intel-tdx.rst | 357 ++++++++++++++++++
 arch/x86/include/asm/kvm_host.h          |  11 +-
 arch/x86/include/asm/shared/tdx.h        |   1 +
 arch/x86/include/asm/vmx.h               |   2 +
 arch/x86/kvm/vmx/main.c                  | 453 +++++++++++++++++++++--
 arch/x86/kvm/vmx/posted_intr.c           |   3 +-
 arch/x86/kvm/vmx/tdx.c                   | 286 +++++++++++++-
 arch/x86/kvm/vmx/tdx.h                   |   6 +
 arch/x86/kvm/vmx/tdx_arch.h              |  13 +
 arch/x86/kvm/vmx/x86_ops.h               |  10 +
 arch/x86/kvm/x86.c                       |   5 +-
 arch/x86/kvm/x86.h                       |   2 +
 14 files changed, 1109 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/virt/kvm/x86/intel-tdx.rst

-- 
2.46.0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/18] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 02/18] KVM: TDX: Handle EPT violation/misconfig exit Binbin Wu
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest
DRs.

TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter,
and resets DRs to architectural INIT state on TD exit.  Use the new
flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to
save/restore guest DRs.  KVM still needs to restore host DRs after TD
exit if there are active breakpoints in the host, which is covered by
the existing code.

MOV-DR exiting is always cleared for TDX guests, so the handler for DR
access is never called, and KVM_DEBUGREG_WONT_EXIT is never set.  Add
a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH
are set.

Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().

Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Update the comment about KVM_DEBUGREG_AUTO_SWITCH.
- Check explicitly KVM_DEBUGREG_AUTO_SWITCH is not set in switch_db_regs
  before restoring guest DRs, because KVM_DEBUGREG_BP_ENABLED could be set
  by userspace. (Paolo)
  https://lore.kernel.org/lkml/ea136ac6-53cf-cdc5-a741-acfb437819b1@redhat.com/
- Fix the issue that host DRs are not restored in v19 (Binbin)
  https://lore.kernel.org/kvm/20240413002026.GP3039520@ls.amr.corp.intel.com/
- Update the changelog a bit.
---
 arch/x86/include/asm/kvm_host.h | 11 +++++++++--
 arch/x86/kvm/vmx/tdx.c          |  1 +
 arch/x86/kvm/x86.c              |  4 +++-
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index df535f08e004..a0814079777f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -604,8 +604,15 @@ struct kvm_pmu {
 struct kvm_pmu_ops;
 
 enum {
-	KVM_DEBUGREG_BP_ENABLED = 1,
-	KVM_DEBUGREG_WONT_EXIT = 2,
+	KVM_DEBUGREG_BP_ENABLED		= BIT(0),
+	KVM_DEBUGREG_WONT_EXIT		= BIT(1),
+	/*
+	 * Guest debug registers (DR0-3, DR6 and DR7) are saved/restored by
+	 * hardware on exit from or enter to guest. KVM needn't switch them.
+	 * DR0-3, DR6 and DR7 are set to their architectural INIT value on VM
+	 * exit, host values need to be restored.
+	 */
+	KVM_DEBUGREG_AUTO_SWITCH	= BIT(2),
 };
 
 struct kvm_mtrr {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3cf8a4e1fc29..b87daa643e6e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -727,6 +727,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
 
+	vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH;
 	vcpu->arch.cr0_guest_owned_bits = -1ul;
 	vcpu->arch.cr4_guest_owned_bits = -1ul;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e155ae90e9fa..2b4bd56e9fb4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10965,7 +10965,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.guest_fpu.xfd_err)
 		wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
 
-	if (unlikely(vcpu->arch.switch_db_regs)) {
+	if (unlikely(vcpu->arch.switch_db_regs &&
+		     !(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH))) {
 		set_debugreg(0, 7);
 		set_debugreg(vcpu->arch.eff_db[0], 0);
 		set_debugreg(vcpu->arch.eff_db[1], 1);
@@ -11012,6 +11013,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 */
 	if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
 		WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
+		WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH);
 		kvm_x86_call(sync_dirty_debug_regs)(vcpu);
 		kvm_update_dr0123(vcpu);
 		kvm_update_dr7(vcpu);
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 02/18] KVM: TDX: Handle EPT violation/misconfig exit
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
  2024-12-10  0:49 ` [PATCH 01/18] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 03/18] KVM: TDX: Detect unexpected SEPT violations due to pending SPTEs Binbin Wu
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

For TDX, on EPT violation, call common __vmx_handle_ept_violation() to
trigger x86 MMU code; on EPT misconfiguration, bug the VM since it
shouldn't happen.

EPT violation due to instruction fetch should never be triggered from
shared memory in TDX guest.  If such EPT violation occurs, treat it as
broken hardware.

EPT misconfiguration shouldn't happen on neither shared nor secure EPT for
TDX guests.
- TDX module guarantees no EPT misconfiguration on secure EPT.  Per TDX
  module v1.5 spec section 9.4 "Secure EPT Induced TD Exits":
  "By design, since secure EPT is fully controlled by the TDX module, an
  EPT misconfiguration on a private GPA indicates a TDX module bug and is
  handled as a fatal error."
- For shared EPT, the MMIO caching optimization, which is the only case
  where current KVM configures EPT entries to generate EPT misconfiguration,
  is implemented in a different way for TDX guests.  KVM configures EPT
  entries to non-present value without suppressing #VE bit.  It causes #VE
  in the TDX guest and the guest will call TDG.VP.VMCALL to request MMIO
  emulation.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Renamed from "KVM: TDX: Handle ept violation/misconfig exit" to
  "KVM: TDX: Handle EPT violation/misconfig exit" (Reinette)
- Removed WARN_ON_ONCE(1) in tdx_handle_ept_misconfig(). (Rick)
- Add comment above EPT_VIOLATION_ACC_INSTR check. (Chao)
  https://lore.kernel.org/lkml/Zgoz0sizgEZhnQ98@chao-email/
  https://lore.kernel.org/lkml/ZjiE+O9fct5zI4Sf@chao-email/
- Remove unnecessary define of TDX_SEPT_VIOLATION_EXIT_QUAL. (Sean)
- Replace pr_warn() and KVM_EXIT_EXCEPTION with KVM_BUG_ON(). (Sean)
- KVM_BUG_ON() for EPT misconfig. (Sean)
- Rework changelog.

v14 -> v15:
- use PFERR_GUEST_ENC_MASK to tell the fault is private
---
 arch/x86/kvm/vmx/tdx.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b87daa643e6e..aecf52dda00d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1770,6 +1770,36 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
 }
 
+static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qual;
+
+	if (vt_is_tdx_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
+		/*
+		 * Always treat SEPT violations as write faults.  Ignore the
+		 * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
+		 * TD private pages are always RWX in the SEPT tables,
+		 * i.e. they're always mapped writable.  Just as importantly,
+		 * treating SEPT violations as write faults is necessary to
+		 * avoid COW allocations, which will cause TDAUGPAGE failures
+		 * due to aliasing a single HPA to multiple GPAs.
+		 */
+		exit_qual = EPT_VIOLATION_ACC_WRITE;
+	} else {
+		exit_qual = tdexit_exit_qual(vcpu);
+		/*
+		 * EPT violation due to instruction fetch should never be
+		 * triggered from shared memory in TDX guest.  If such EPT
+		 * violation occurs, treat it as broken hardware.
+		 */
+		if (KVM_BUG_ON(exit_qual & EPT_VIOLATION_ACC_INSTR, vcpu->kvm))
+			return -EIO;
+	}
+
+	trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
+	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+}
+
 int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -1814,6 +1844,11 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 		return tdx_handle_external_interrupt(vcpu);
 	case EXIT_REASON_TDCALL:
 		return handle_tdvmcall(vcpu);
+	case EXIT_REASON_EPT_VIOLATION:
+		return tdx_handle_ept_violation(vcpu);
+	case EXIT_REASON_EPT_MISCONFIG:
+		KVM_BUG_ON(1, vcpu->kvm);
+		return -EIO;
 	case EXIT_REASON_OTHER_SMI:
 		/*
 		 * Unlike VMX, SMI in SEAM non-root mode (i.e. when
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 03/18] KVM: TDX: Detect unexpected SEPT violations due to pending SPTEs
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
  2024-12-10  0:49 ` [PATCH 01/18] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior Binbin Wu
  2024-12-10  0:49 ` [PATCH 02/18] KVM: TDX: Handle EPT violation/misconfig exit Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 04/18] KVM: TDX: Handle TDX PV CPUID hypercall Binbin Wu
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Yan Zhao <yan.y.zhao@intel.com>

Detect SEPT violations that occur when an SEPT entry is in PENDING state
while the TD is configured not to receive #VE on SEPT violations.

A TD guest can be configured not to receive #VE by setting SEPT_VE_DISABLE
to 1 in tdh_mng_init() or modifying pending_ve_disable to 1 in TDCS when
flexible_pending_ve is permitted. In such cases, the TDX module will not
inject #VE into the TD upon encountering an EPT violation caused by an SEPT
entry in the PENDING state. Instead, TDX module will exit to VMM and set
extended exit qualification type to PENDING_EPT_VIOLATION and exit
qualification bit 6:3 to 0.

Since #VE will not be injected to such TDs, they are not able to be
notified to accept a GPA. TD accessing before accepting a private GPA
is regarded as an error within the guest.

Detect such guest error by inspecting the (extended) exit qualification
bits and make such VM dead.

Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- New patch
---
 arch/x86/include/asm/vmx.h  |  2 ++
 arch/x86/kvm/vmx/tdx.c      | 20 +++++++++++++++++++-
 arch/x86/kvm/vmx/tdx_arch.h |  2 ++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 9298fb9d4bb3..028f3b8db2af 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -585,12 +585,14 @@ enum vm_entry_failure_code {
 #define EPT_VIOLATION_ACC_WRITE_BIT	1
 #define EPT_VIOLATION_ACC_INSTR_BIT	2
 #define EPT_VIOLATION_RWX_SHIFT		3
+#define EPT_VIOLATION_EXEC_R3_LIN_BIT	6
 #define EPT_VIOLATION_GVA_IS_VALID_BIT	7
 #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
 #define EPT_VIOLATION_ACC_READ		(1 << EPT_VIOLATION_ACC_READ_BIT)
 #define EPT_VIOLATION_ACC_WRITE		(1 << EPT_VIOLATION_ACC_WRITE_BIT)
 #define EPT_VIOLATION_ACC_INSTR		(1 << EPT_VIOLATION_ACC_INSTR_BIT)
 #define EPT_VIOLATION_RWX_MASK		(VMX_EPT_RWX_MASK << EPT_VIOLATION_RWX_SHIFT)
+#define EPT_VIOLATION_EXEC_FOR_RING3_LIN (1 << EPT_VIOLATION_EXEC_R3_LIN_BIT)
 #define EPT_VIOLATION_GVA_IS_VALID	(1 << EPT_VIOLATION_GVA_IS_VALID_BIT)
 #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index aecf52dda00d..96b05e445837 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1770,11 +1770,29 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
 }
 
+static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcpu)
+{
+	u64 eeq_type = tdexit_ext_exit_qual(vcpu) & TDX_EXT_EXIT_QUAL_TYPE_MASK;
+	u64 eq = tdexit_exit_qual(vcpu);
+
+	if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION)
+		return false;
+
+	return !(eq & EPT_VIOLATION_RWX_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
+}
+
 static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 {
+	gpa_t gpa = tdexit_gpa(vcpu);
 	unsigned long exit_qual;
 
-	if (vt_is_tdx_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
+	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
+		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
+			pr_warn("Guest access before accepting 0x%llx on vCPU %d\n",
+				gpa, vcpu->vcpu_id);
+			kvm_vm_dead(vcpu->kvm);
+			return -EIO;
+		}
 		/*
 		 * Always treat SEPT violations as write faults.  Ignore the
 		 * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 289728f1611f..2f9e88f497bc 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -104,6 +104,8 @@ struct tdx_cpuid_value {
 #define TDX_TD_ATTR_KL			BIT_ULL(31)
 #define TDX_TD_ATTR_PERFMON		BIT_ULL(63)
 
+#define TDX_EXT_EXIT_QUAL_TYPE_MASK	GENMASK(3, 0)
+#define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION  6
 /*
  * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
  */
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 04/18] KVM: TDX: Handle TDX PV CPUID hypercall
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (2 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 03/18] KVM: TDX: Detect unexpected SEPT violations due to pending SPTEs Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 05/18] KVM: TDX: Handle TDX PV HLT hypercall Binbin Wu
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Handle TDX PV CPUID hypercall for the CPUIDs virtualized by VMM
according to TDX Guest Host Communication Interface (GHCI).

For TDX, most CPUID leaf/sub-leaf combinations are virtualized by
the TDX module while some trigger #VE.  On #VE, TDX guest can issue
TDG.VP.VMCALL<INSTRUCTION.CPUID> (same value as EXIT_REASON_CPUID)
to request VMM to emulate CPUID operation.

Wire up TDX PV CPUID hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rewrite changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Rewrite changelog.
- Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)
---
 arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 96b05e445837..62dbb47ead21 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1274,6 +1274,26 @@ static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
+{
+	u32 eax, ebx, ecx, edx;
+
+	/* EAX and ECX for cpuid is stored in R12 and R13. */
+	eax = tdvmcall_a0_read(vcpu);
+	ecx = tdvmcall_a1_read(vcpu);
+
+	kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, false);
+
+	tdvmcall_a0_write(vcpu, eax);
+	tdvmcall_a1_write(vcpu, ebx);
+	tdvmcall_a2_write(vcpu, ecx);
+	tdvmcall_a3_write(vcpu, edx);
+
+	tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
+
+	return 1;
+}
+
 static int tdx_complete_pio_out(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.pio.count = 0;
@@ -1455,6 +1475,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_map_gpa(vcpu);
 	case TDVMCALL_REPORT_FATAL_ERROR:
 		return tdx_report_fatal_error(vcpu);
+	case EXIT_REASON_CPUID:
+		return tdx_emulate_cpuid(vcpu);
 	case EXIT_REASON_IO_INSTRUCTION:
 		return tdx_emulate_io(vcpu);
 	case EXIT_REASON_EPT_VIOLATION:
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 05/18] KVM: TDX: Handle TDX PV HLT hypercall
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (3 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 04/18] KVM: TDX: Handle TDX PV CPUID hypercall Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 06/18] KVM: x86: Move KVM_MAX_MCE_BANKS to header file Binbin Wu
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Handle TDX PV HLT hypercall and the interrupt status due to it.

TDX guest status is protected, KVM can't get the interrupt status
of TDX guest and it assumes interrupt is always allowed unless TDX
guest calls TDVMCALL with HLT, which passes the interrupt blocked flag.

Update vt_interrupt_allowed() for TDX based on interrupt blocked flag
passed by HLT TDVMCALL.  Do not wakeup TD vCPU if interrupt is blocked
for VT-d PI.

For NMIs, KVM cannot determine the NMI blocking status for TDX guests,
so KVM always assumes NMIs are not blocked.  In the unlikely scenario
where a guest invokes the PV HLT hypercall within an NMI handler, this
could result in a spurious wakeup.  The guest should implement the PV
HLT hypercall within a loop if it truly requires no interruptions, since
NMI could be unblocked by an IRET due to an exception occurring before
the PV HLT is executed in the NMI handler.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Update the changelog.
- Remove the interrupt_disabled_hlt field (Sean)
  https://lore.kernel.org/kvm/Zg1seIaTmM94IyR8@google.com/
- Move the logic of interrupt status to vt_interrupt_allowed() (Chao)
  https://lore.kernel.org/kvm/ZhIX7K0WK+gYtcan@chao-email/
- Add suggested-by tag.
- Use tdx_check_exit_reason()
- Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)

v19:
- move tdvps_state_non_arch_check() to this patch

v18:
- drop buggy_hlt_workaround and use TDH.VP.RD(TD_VCPU_STATE_DETAILS)
---
 arch/x86/kvm/vmx/main.c        |  2 +-
 arch/x86/kvm/vmx/posted_intr.c |  3 ++-
 arch/x86/kvm/vmx/tdx.c         | 32 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h         |  6 ++++++
 arch/x86/kvm/vmx/tdx_arch.h    | 11 +++++++++++
 5 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 305425b19cb5..bfe848083eb9 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -424,7 +424,7 @@ static void vt_cancel_injection(struct kvm_vcpu *vcpu)
 static int vt_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	if (is_td_vcpu(vcpu))
-		return true;
+		return tdx_interrupt_allowed(vcpu);
 
 	return vmx_interrupt_allowed(vcpu, for_injection);
 }
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 87a6964c662a..1ce9b9e93a26 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -223,7 +223,8 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
 		return;
 
 	if (kvm_vcpu_is_blocking(vcpu) &&
-	    (is_td_vcpu(vcpu) || !vmx_interrupt_blocked(vcpu)))
+	    ((is_td_vcpu(vcpu) && tdx_interrupt_allowed(vcpu)) ||
+	     (!is_td_vcpu(vcpu) && !vmx_interrupt_blocked(vcpu))))
 		pi_enable_wakeup_handler(vcpu);
 
 	/*
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 62dbb47ead21..2b64652a0d05 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -771,9 +771,31 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	local_irq_enable();
 }
 
+bool tdx_interrupt_allowed(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * KVM can't get the interrupt status of TDX guest and it assumes
+	 * interrupt is always allowed unless TDX guest calls TDVMCALL with HLT,
+	 * which passes the interrupt blocked flag.
+	 */
+	if (!tdx_check_exit_reason(vcpu, EXIT_REASON_TDCALL) ||
+	    tdvmcall_exit_type(vcpu) || tdvmcall_leaf(vcpu) != EXIT_REASON_HLT)
+	    return true;
+
+	return !tdvmcall_a0_read(vcpu);
+}
+
 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
-	return pi_has_pending_interrupt(vcpu);
+	u64 vcpu_state_details;
+
+	if (pi_has_pending_interrupt(vcpu))
+		return true;
+
+	vcpu_state_details =
+		td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
+
+	return tdx_vcpu_state_details_intr_pending(vcpu_state_details);
 }
 
 /*
@@ -1294,6 +1316,12 @@ static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
+{
+	tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
+	return kvm_emulate_halt_noskip(vcpu);
+}
+
 static int tdx_complete_pio_out(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.pio.count = 0;
@@ -1477,6 +1505,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_report_fatal_error(vcpu);
 	case EXIT_REASON_CPUID:
 		return tdx_emulate_cpuid(vcpu);
+	case EXIT_REASON_HLT:
+		return tdx_emulate_hlt(vcpu);
 	case EXIT_REASON_IO_INSTRUCTION:
 		return tdx_emulate_io(vcpu);
 	case EXIT_REASON_EPT_VIOLATION:
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index b553dd9b0b06..008180c0c30f 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -152,6 +152,7 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
 }
 
 static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_state_non_arch_check(u64 field, u8 bits) {}
 
 #define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass)				\
 static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx,	\
@@ -199,11 +200,15 @@ static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx,	\
 		tdh_vp_wr_failed(tdx, #uclass, " &= ~", field, bit, err);\
 }
 
+
+bool tdx_interrupt_allowed(struct kvm_vcpu *vcpu);
+
 TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
 TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
 TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
 
 TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
+TDX_BUILD_TDVPS_ACCESSORS(64, STATE_NON_ARCH, state_non_arch);
 
 #else
 static inline void tdx_bringup(void) {}
@@ -223,6 +228,7 @@ static inline bool is_td(struct kvm *kvm) { return false; }
 static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
 static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return NULL; }
 static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) { return NULL; }
+static inline bool tdx_interrupt_allowed(struct kvm_vcpu *vcpu) { return false; }
 
 #endif
 
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 2f9e88f497bc..861c0f649b69 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -71,6 +71,17 @@ enum tdx_tdcs_execution_control {
 	TD_TDCS_EXEC_TSC_OFFSET = 10,
 };
 
+enum tdx_vcpu_guest_other_state {
+	TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
+};
+
+#define TDX_VCPU_STATE_DETAILS_INTR_PENDING	BIT_ULL(0)
+
+static inline bool tdx_vcpu_state_details_intr_pending(u64 vcpu_state_details)
+{
+	return !!(vcpu_state_details & TDX_VCPU_STATE_DETAILS_INTR_PENDING);
+}
+
 /* @field is any of enum tdx_tdcs_execution_control */
 #define TDCS_EXEC(field)		BUILD_TDX_FIELD(TD_CLASS_EXECUTION_CONTROLS, (field))
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 06/18] KVM: x86: Move KVM_MAX_MCE_BANKS to header file
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (4 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 05/18] KVM: TDX: Handle TDX PV HLT hypercall Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 07/18] KVM: TDX: Implement callbacks for MSR operations Binbin Wu
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Move KVM_MAX_MCE_BANKS to header file so that it can be used for TDX in
a future patch.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- New patch. Split from "KVM: TDX: Implement callbacks for MSR operations for TDX". (Sean)
  https://lore.kernel.org/kvm/Zg1yPIV6cVJrwGxX@google.com/
---
 arch/x86/kvm/x86.c | 1 -
 arch/x86/kvm/x86.h | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2b4bd56e9fb4..5eacdb5b9737 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -90,7 +90,6 @@
 #include "trace.h"
 
 #define MAX_IO_MSRS 256
-#define KVM_MAX_MCE_BANKS 32
 
 /*
  * Note, kvm_caps fields should *never* have default values, all fields must be
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d69074a05779..0a1946368439 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -9,6 +9,8 @@
 #include "kvm_cache_regs.h"
 #include "kvm_emulate.h"
 
+#define KVM_MAX_MCE_BANKS 32
+
 struct kvm_caps {
 	/* control of guest tsc rate supported? */
 	bool has_tsc_control;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 07/18] KVM: TDX: Implement callbacks for MSR operations
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (5 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 06/18] KVM: x86: Move KVM_MAX_MCE_BANKS to header file Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2025-01-15 11:26   ` Huang, Kai
  2024-12-10  0:49 ` [PATCH 08/18] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall Binbin Wu
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add functions to implement MSR related callbacks, .set_msr(), .get_msr(),
and .has_emulated_msr(), for preparation of handling hypercalls from TDX
guest for para-virtualized RDMSR and WRMSR.  Ignore KVM_REQ_MSR_FILTER_CHANGED
for TDX.

There are three classes of MSRs virtualization for TDX.
- Non-configurable: TDX module directly virtualizes it. VMM can't configure
  it, the value set by KVM_SET_MSRS is ignored.
- Configurable: TDX module directly virtualizes it. VMM can configure at the
  VM creation time.  The value set by KVM_SET_MSRS is used.
- #VE case: TDX guest would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>
  and VMM handles the MSR hypercall. The value set by KVM_SET_MSRS is used.

For the MSRs belonging to the #VE case, the TDX module injects #VE to the
TDX guest upon RDMSR or WRMSR.  The exact list of such MSRs are defined in
TDX Module ABI Spec.

Upon #VE, the TDX guest may call TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>,
which are defined in GHCI (Guest-Host Communication Interface) so that the
host VMM (e.g. KVM) can virtualize the MSRs.

TDX doesn't allow VMM to configure interception of MSR accesses.  Ignore
KVM_REQ_MSR_FILTER_CHANGED for TDX guest.  If the userspace has set any
MSR filters, it will be applied when handling
TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> in a later patch.

Suggested-by: Sean Christopherson<seanjc@google.com>
Signed-off-by: Isaku Yamahata<isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu<binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu<binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini<pbonzini@redhat.com>
---
TDX "the rest" breakout:
- Renamed from "KVM: TDX: Implement callbacks for MSR operations for TDX"
  to "KVM: TDX: Implement callbacks for MSR operations"
- Update changelog.
- Remove the @write parameter of tdx_has_emulated_msr(), check whether the
  MSR is readonly or not in tdx_set_msr(). (Sean)
  https://lore.kernel.org/kvm/Zg1yPIV6cVJrwGxX@google.com/
- Change the code align with the patten "make the happy path the
  not-taken path". (Sean)
- Open code the handling for an emulated KVM MSR MSR_KVM_POLL_CONTROL,
  and let others go through the default statement. (Sean)
- Split macro KVM_MAX_MCE_BANKS move to a separate patch. (Sean)
- Add comments in vt_msr_filter_changed().
- Add Suggested-by tag.
---
 arch/x86/kvm/vmx/main.c    | 50 +++++++++++++++++++++++++---
 arch/x86/kvm/vmx/tdx.c     | 67 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  6 ++++
 3 files changed, 119 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index bfe848083eb9..a224e9d32701 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -191,6 +191,48 @@ static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 	vmx_handle_exit_irqoff(vcpu);
 }
 
+static int vt_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_set_msr(vcpu, msr_info);
+
+	return vmx_set_msr(vcpu, msr_info);
+}
+
+/*
+ * The kvm parameter can be NULL (module initialization, or invocation before
+ * VM creation). Be sure to check the kvm parameter before using it.
+ */
+static bool vt_has_emulated_msr(struct kvm *kvm, u32 index)
+{
+	if (kvm && is_td(kvm))
+		return tdx_has_emulated_msr(index);
+
+	return vmx_has_emulated_msr(kvm, index);
+}
+
+static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_get_msr(vcpu, msr_info);
+
+	return vmx_get_msr(vcpu, msr_info);
+}
+
+static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * TDX doesn't allow VMM to configure interception of MSR accesses.
+	 * TDX guest requests MSR accesses by calling TDVMCALL.  The MSR
+	 * filters will be applied when handling the TDVMCALL for RDMSR/WRMSR
+	 * if the userspace has set any.
+	 */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_msr_filter_changed(vcpu);
+}
+
 #ifdef CONFIG_KVM_SMM
 static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
@@ -510,7 +552,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.disable_virtualization_cpu = vt_disable_virtualization_cpu,
 	.emergency_disable_virtualization_cpu = vmx_emergency_disable_virtualization_cpu,
 
-	.has_emulated_msr = vmx_has_emulated_msr,
+	.has_emulated_msr = vt_has_emulated_msr,
 
 	.vm_size = sizeof(struct kvm_vmx),
 
@@ -529,8 +571,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.update_exception_bitmap = vmx_update_exception_bitmap,
 	.get_feature_msr = vmx_get_feature_msr,
-	.get_msr = vmx_get_msr,
-	.set_msr = vmx_set_msr,
+	.get_msr = vt_get_msr,
+	.set_msr = vt_set_msr,
 	.get_segment_base = vmx_get_segment_base,
 	.get_segment = vmx_get_segment,
 	.set_segment = vmx_set_segment,
@@ -637,7 +679,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.apic_init_signal_blocked = vt_apic_init_signal_blocked,
 	.migrate_timers = vmx_migrate_timers,
 
-	.msr_filter_changed = vmx_msr_filter_changed,
+	.msr_filter_changed = vt_msr_filter_changed,
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2b64652a0d05..770e3b847cd6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1984,6 +1984,73 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 	*error_code = 0;
 }
 
+bool tdx_has_emulated_msr(u32 index)
+{
+	switch (index) {
+	case MSR_IA32_UCODE_REV:
+	case MSR_IA32_ARCH_CAPABILITIES:
+	case MSR_IA32_POWER_CTL:
+	case MSR_IA32_CR_PAT:
+	case MSR_IA32_TSC_DEADLINE:
+	case MSR_IA32_MISC_ENABLE:
+	case MSR_PLATFORM_INFO:
+	case MSR_MISC_FEATURES_ENABLES:
+	case MSR_IA32_APICBASE:
+	case MSR_EFER:
+	case MSR_IA32_MCG_CAP:
+	case MSR_IA32_MCG_STATUS:
+	case MSR_IA32_MCG_CTL:
+	case MSR_IA32_MCG_EXT_CTL:
+	case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
+	case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
+		/* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */
+	case MSR_KVM_POLL_CONTROL:
+		return true;
+	case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
+		/*
+		 * x2APIC registers that are virtualized by the CPU can't be
+		 * emulated, KVM doesn't have access to the virtual APIC page.
+		 */
+		switch (index) {
+		case X2APIC_MSR(APIC_TASKPRI):
+		case X2APIC_MSR(APIC_PROCPRI):
+		case X2APIC_MSR(APIC_EOI):
+		case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
+		case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
+		case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
+			return false;
+		default:
+			return true;
+		}
+	default:
+		return false;
+	}
+}
+
+static bool tdx_is_read_only_msr(u32 index)
+{
+	return  index == MSR_IA32_APICBASE || index == MSR_EFER;
+}
+
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	if (!tdx_has_emulated_msr(msr->index))
+		return 1;
+
+	return kvm_get_msr_common(vcpu, msr);
+}
+
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	if (tdx_is_read_only_msr(msr->index))
+		return 1;
+
+	if (!tdx_has_emulated_msr(msr->index))
+		return 1;
+
+	return kvm_set_msr_common(vcpu, msr);
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 28dbef77700c..7fb1bbf12b39 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -142,6 +142,9 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 void tdx_inject_nmi(struct kvm_vcpu *vcpu);
 void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+bool tdx_has_emulated_msr(u32 index);
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
@@ -186,6 +189,9 @@ static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mo
 static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
 static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
 				     u64 *info2, u32 *intr_info, u32 *error_code) {}
+static inline bool tdx_has_emulated_msr(u32 index) { return false; }
+static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 07/18] KVM: TDX: Implement callbacks for MSR operations
  2024-12-10  0:49 ` [PATCH 07/18] KVM: TDX: Implement callbacks for MSR operations Binbin Wu
@ 2025-01-15 11:26   ` Huang, Kai
  0 siblings, 0 replies; 26+ messages in thread
From: Huang, Kai @ 2025-01-15 11:26 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	binbin.wu@linux.intel.com
  Cc: Gao, Chao, Edgecombe, Rick P, Li, Xiaoyao, Chatre, Reinette,
	Zhao, Yan Y, Hunter, Adrian, tony.lindgren@linux.intel.com,
	Yamahata, Isaku, linux-kernel@vger.kernel.org

On Tue, 2024-12-10 at 08:49 +0800, Binbin Wu wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add functions to implement MSR related callbacks, .set_msr(), .get_msr(),
> and .has_emulated_msr(), for preparation of handling hypercalls from TDX
> guest for para-virtualized RDMSR and WRMSR.  Ignore KVM_REQ_MSR_FILTER_CHANGED
> for TDX.
> 
> There are three classes of MSRs virtualization for TDX.
			     ^
			     MSR

> - Non-configurable: TDX module directly virtualizes it. VMM can't configure
>   it, the value set by KVM_SET_MSRS is ignored.
> - Configurable: TDX module directly virtualizes it. VMM can configure at the
>   VM creation time.  The value set by KVM_SET_MSRS is used.
> - #VE case: TDX guest would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>
>   and VMM handles the MSR hypercall. The value set by KVM_SET_MSRS is used.
> 
> For the MSRs belonging to the #VE case, the TDX module injects #VE to the
> TDX guest upon RDMSR or WRMSR.  The exact list of such MSRs are defined in
							      ^
							      is
> TDX Module ABI Spec.
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 08/18] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (6 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 07/18] KVM: TDX: Implement callbacks for MSR operations Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 09/18] KVM: TDX: Enable guest access to LMCE related MSRs Binbin Wu
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV rdmsr/wrmsr hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX "the rest" breakout:
- Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)
---
 arch/x86/kvm/vmx/tdx.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 770e3b847cd6..d5343e2ba509 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1493,6 +1493,41 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
+{
+	u32 index = tdvmcall_a0_read(vcpu);
+	u64 data;
+
+	if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ) ||
+	    kvm_get_msr(vcpu, index, &data)) {
+		trace_kvm_msr_read_ex(index);
+		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
+		return 1;
+	}
+	trace_kvm_msr_read(index, data);
+
+	tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
+	tdvmcall_set_return_val(vcpu, data);
+	return 1;
+}
+
+static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
+{
+	u32 index = tdvmcall_a0_read(vcpu);
+	u64 data = tdvmcall_a1_read(vcpu);
+
+	if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE) ||
+	    kvm_set_msr(vcpu, index, data)) {
+		trace_kvm_msr_write_ex(index, data);
+		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
+		return 1;
+	}
+
+	trace_kvm_msr_write(index, data);
+	tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	if (tdvmcall_exit_type(vcpu))
@@ -1511,6 +1546,10 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_io(vcpu);
 	case EXIT_REASON_EPT_VIOLATION:
 		return tdx_emulate_mmio(vcpu);
+	case EXIT_REASON_MSR_READ:
+		return tdx_emulate_rdmsr(vcpu);
+	case EXIT_REASON_MSR_WRITE:
+		return tdx_emulate_wrmsr(vcpu);
 	default:
 		break;
 	}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 09/18] KVM: TDX: Enable guest access to LMCE related MSRs
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (7 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 08/18] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 10/18] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall Binbin Wu
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Allow TDX guest to configure LMCE (Local Machine Check Event) by handling
MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL.

MCE and MCA are advertised via cpuid based on the TDX module spec.  Guest
kernel can access IA32_FEAT_CTL to check whether LMCE is opted-in by the
platform or not.  If LMCE is opted-in by the platform, guest kernel can
access IA32_MCG_EXT_CTL to enable/disable LMCE.

Handle MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL for TDX guests to avoid
failure when a guest accesses them with TDG.VP.VMCALL<MSR> on #VE.  E.g.,
Linux guest will treat the failure as a #GP(0).

Userspace VMM may not opt-in LMCE by default, e.g., QEMU disables it by
default, "-cpu lmce=on" is needed in QEMU command line to opt-in it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Renamed from "KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL"
  to "KVM: TDX: Enable guest access to LMCE related MSRs".
- Update changelog.
- Check reserved bits are not set when set MSR_IA32_MCG_EXT_CTL.
---
 arch/x86/kvm/vmx/tdx.c | 46 +++++++++++++++++++++++++++++++++---------
 1 file changed, 37 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d5343e2ba509..b5aae9d784f7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2036,6 +2036,7 @@ bool tdx_has_emulated_msr(u32 index)
 	case MSR_MISC_FEATURES_ENABLES:
 	case MSR_IA32_APICBASE:
 	case MSR_EFER:
+	case MSR_IA32_FEAT_CTL:
 	case MSR_IA32_MCG_CAP:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MCG_CTL:
@@ -2068,26 +2069,53 @@ bool tdx_has_emulated_msr(u32 index)
 
 static bool tdx_is_read_only_msr(u32 index)
 {
-	return  index == MSR_IA32_APICBASE || index == MSR_EFER;
+	return  index == MSR_IA32_APICBASE || index == MSR_EFER ||
+		index == MSR_IA32_FEAT_CTL;
 }
 
 int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
-	if (!tdx_has_emulated_msr(msr->index))
-		return 1;
+	switch (msr->index) {
+	case MSR_IA32_FEAT_CTL:
+		/*
+		 * MCE and MCA are advertised via cpuid. Guest kernel could
+		 * check if LMCE is enabled or not.
+		 */
+		msr->data = FEAT_CTL_LOCKED;
+		if (vcpu->arch.mcg_cap & MCG_LMCE_P)
+			msr->data |= FEAT_CTL_LMCE_ENABLED;
+		return 0;
+	case MSR_IA32_MCG_EXT_CTL:
+		if (!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P))
+			return 1;
+		msr->data = vcpu->arch.mcg_ext_ctl;
+		return 0;
+	default:
+		if (!tdx_has_emulated_msr(msr->index))
+			return 1;
 
-	return kvm_get_msr_common(vcpu, msr);
+		return kvm_get_msr_common(vcpu, msr);
+	}
 }
 
 int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
-	if (tdx_is_read_only_msr(msr->index))
-		return 1;
+	switch (msr->index) {
+	case MSR_IA32_MCG_EXT_CTL:
+		if ((!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P)) ||
+		    (msr->data & ~MCG_EXT_CTL_LMCE_EN))
+			return 1;
+		vcpu->arch.mcg_ext_ctl = msr->data;
+		return 0;
+	default:
+		if (tdx_is_read_only_msr(msr->index))
+			return 1;
 
-	if (!tdx_has_emulated_msr(msr->index))
-		return 1;
+		if (!tdx_has_emulated_msr(msr->index))
+			return 1;
 
-	return kvm_set_msr_common(vcpu, msr);
+		return kvm_set_msr_common(vcpu, msr);
+	}
 }
 
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 10/18] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (8 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 09/18] KVM: TDX: Enable guest access to LMCE related MSRs Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 11/18] KVM: TDX: Add methods to ignore accesses to CPU state Binbin Wu
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall.  If the input value is
zero, return success code and zero in output registers.

TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to
enumerate which TDG.VP.VMCALL sub leaves are supported.  This hypercall is
for future enhancement of the Guest-Host-Communication Interface (GHCI)
specification.  The GHCI version of 344426-001US defines it to require
input R12 to be zero and to return zero in output registers, R11, R12, R13,
and R14 so that guest TD enumerates no enhancement.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)

v19:
- rename TDG_VP_VMCALL_GET_TD_VM_CALL_INFO => TDVMCALL_GET_TD_VM_CALL_INFO
---
 arch/x86/include/asm/shared/tdx.h |  1 +
 arch/x86/kvm/vmx/tdx.c            | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index a602d081cf1c..192ae798b214 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -22,6 +22,7 @@
 #define TDCS_NOTIFY_ENABLES		0x9100000000000010
 
 /* TDX hypercall Leaf IDs */
+#define TDVMCALL_GET_TD_VM_CALL_INFO	0x10000
 #define TDVMCALL_MAP_GPA		0x10001
 #define TDVMCALL_GET_QUOTE		0x10002
 #define TDVMCALL_REPORT_FATAL_ERROR	0x10003
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b5aae9d784f7..413359741085 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1528,6 +1528,20 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_get_td_vm_call_info(struct kvm_vcpu *vcpu)
+{
+	if (tdvmcall_a0_read(vcpu))
+		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
+	else {
+		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
+		kvm_r11_write(vcpu, 0);
+		tdvmcall_a0_write(vcpu, 0);
+		tdvmcall_a1_write(vcpu, 0);
+		tdvmcall_a2_write(vcpu, 0);
+	}
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	if (tdvmcall_exit_type(vcpu))
@@ -1550,6 +1564,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_rdmsr(vcpu);
 	case EXIT_REASON_MSR_WRITE:
 		return tdx_emulate_wrmsr(vcpu);
+	case TDVMCALL_GET_TD_VM_CALL_INFO:
+		return tdx_get_td_vm_call_info(vcpu);
 	default:
 		break;
 	}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 11/18] KVM: TDX: Add methods to ignore accesses to CPU state
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (9 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 10/18] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2025-01-15  6:51   ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 12/18] KVM: TDX: Add method to ignore guest instruction emulation Binbin Wu
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX protects TDX guest state from VMM.  Implement access methods for TDX
guest state to ignore them or return zero.  Because those methods can be
called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Dropped KVM_BUG_ON() in vt_sync_dirty_debug_regs(). (Rick)
  Since the KVM_BUG_ON() is removed, change the changlog accordingly.
- Remove unnecessary wrappers. (Binbin)
---
 arch/x86/kvm/vmx/main.c    | 288 +++++++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/tdx.c     |  28 +++-
 arch/x86/kvm/vmx/x86_ops.h |   4 +
 3 files changed, 291 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a224e9d32701..f6b449ae1ef7 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -333,6 +333,200 @@ static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 	vmx_deliver_interrupt(apic, delivery_mode, trig_mode, vector);
 }
 
+static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_vcpu_after_set_cpuid(vcpu);
+}
+
+static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_update_exception_bitmap(vcpu);
+}
+
+static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+	if (is_td_vcpu(vcpu))
+		return 0;
+
+	return vmx_get_segment_base(vcpu, seg);
+}
+
+static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+			      int seg)
+{
+	if (is_td_vcpu(vcpu)) {
+		memset(var, 0, sizeof(*var));
+		return;
+	}
+
+	vmx_get_segment(vcpu, var, seg);
+}
+
+static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+			      int seg)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_segment(vcpu, var, seg);
+}
+
+static int vt_get_cpl(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return 0;
+
+	return vmx_get_cpl(vcpu);
+}
+
+static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+{
+	if (is_td_vcpu(vcpu)) {
+		*db = 0;
+		*l = 0;
+		return;
+	}
+
+	vmx_get_cs_db_l_bits(vcpu, db, l);
+}
+
+static bool vt_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+	if (is_td_vcpu(vcpu))
+		return true;
+
+	return vmx_is_valid_cr0(vcpu, cr0);
+}
+
+static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_cr0(vcpu, cr0);
+}
+
+static bool vt_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+	if (is_td_vcpu(vcpu))
+		return true;
+
+	return vmx_is_valid_cr4(vcpu, cr4);
+}
+
+static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_cr4(vcpu, cr4);
+}
+
+static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
+{
+	if (is_td_vcpu(vcpu))
+		return 0;
+
+	return vmx_set_efer(vcpu, efer);
+}
+
+static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu)) {
+		memset(dt, 0, sizeof(*dt));
+		return;
+	}
+
+	vmx_get_idt(vcpu, dt);
+}
+
+static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_idt(vcpu, dt);
+}
+
+static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu)) {
+		memset(dt, 0, sizeof(*dt));
+		return;
+	}
+
+	vmx_get_gdt(vcpu, dt);
+}
+
+static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_gdt(vcpu, dt);
+}
+
+static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_dr7(vcpu, val);
+}
+
+static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * MOV-DR exiting is always cleared for TD guest, even in debug mode.
+	 * Thus KVM_DEBUGREG_WONT_EXIT can never be set and it should never
+	 * reach here for TD vcpu.
+	 */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_sync_dirty_debug_regs(vcpu);
+}
+
+static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_cache_reg(vcpu, reg);
+		return;
+	}
+
+	vmx_cache_reg(vcpu, reg);
+}
+
+static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return 0;
+
+	return vmx_get_rflags(vcpu);
+}
+
+static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_rflags(vcpu, rflags);
+}
+
+static bool vt_get_if_flag(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return false;
+
+	return vmx_get_if_flag(vcpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu)) {
@@ -455,6 +649,14 @@ static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
 	vmx_inject_irq(vcpu, reinjected);
 }
 
+static void vt_inject_exception(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_inject_exception(vcpu);
+}
+
 static void vt_cancel_injection(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -491,6 +693,14 @@ static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 	vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
 }
 
+static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_update_cr8_intercept(vcpu, tpr, irr);
+}
+
 static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -507,6 +717,30 @@ static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 	vmx_refresh_apicv_exec_ctrl(vcpu);
 }
 
+static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap);
+}
+
+static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr)
+{
+	if (is_td(kvm))
+		return 0;
+
+	return vmx_set_tss_addr(kvm, addr);
+}
+
+static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+{
+	if (is_td(kvm))
+		return 0;
+
+	return vmx_set_identity_map_addr(kvm, ident_addr);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -569,30 +803,30 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_load = vt_vcpu_load,
 	.vcpu_put = vt_vcpu_put,
 
-	.update_exception_bitmap = vmx_update_exception_bitmap,
+	.update_exception_bitmap = vt_update_exception_bitmap,
 	.get_feature_msr = vmx_get_feature_msr,
 	.get_msr = vt_get_msr,
 	.set_msr = vt_set_msr,
-	.get_segment_base = vmx_get_segment_base,
-	.get_segment = vmx_get_segment,
-	.set_segment = vmx_set_segment,
-	.get_cpl = vmx_get_cpl,
-	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
-	.is_valid_cr0 = vmx_is_valid_cr0,
-	.set_cr0 = vmx_set_cr0,
-	.is_valid_cr4 = vmx_is_valid_cr4,
-	.set_cr4 = vmx_set_cr4,
-	.set_efer = vmx_set_efer,
-	.get_idt = vmx_get_idt,
-	.set_idt = vmx_set_idt,
-	.get_gdt = vmx_get_gdt,
-	.set_gdt = vmx_set_gdt,
-	.set_dr7 = vmx_set_dr7,
-	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
-	.cache_reg = vmx_cache_reg,
-	.get_rflags = vmx_get_rflags,
-	.set_rflags = vmx_set_rflags,
-	.get_if_flag = vmx_get_if_flag,
+	.get_segment_base = vt_get_segment_base,
+	.get_segment = vt_get_segment,
+	.set_segment = vt_set_segment,
+	.get_cpl = vt_get_cpl,
+	.get_cs_db_l_bits = vt_get_cs_db_l_bits,
+	.is_valid_cr0 = vt_is_valid_cr0,
+	.set_cr0 = vt_set_cr0,
+	.is_valid_cr4 = vt_is_valid_cr4,
+	.set_cr4 = vt_set_cr4,
+	.set_efer = vt_set_efer,
+	.get_idt = vt_get_idt,
+	.set_idt = vt_set_idt,
+	.get_gdt = vt_get_gdt,
+	.set_gdt = vt_set_gdt,
+	.set_dr7 = vt_set_dr7,
+	.sync_dirty_debug_regs = vt_sync_dirty_debug_regs,
+	.cache_reg = vt_cache_reg,
+	.get_rflags = vt_get_rflags,
+	.set_rflags = vt_set_rflags,
+	.get_if_flag = vt_get_if_flag,
 
 	.flush_tlb_all = vt_flush_tlb_all,
 	.flush_tlb_current = vt_flush_tlb_current,
@@ -609,7 +843,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.patch_hypercall = vmx_patch_hypercall,
 	.inject_irq = vt_inject_irq,
 	.inject_nmi = vt_inject_nmi,
-	.inject_exception = vmx_inject_exception,
+	.inject_exception = vt_inject_exception,
 	.cancel_injection = vt_cancel_injection,
 	.interrupt_allowed = vt_interrupt_allowed,
 	.nmi_allowed = vt_nmi_allowed,
@@ -617,13 +851,13 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_nmi_mask = vt_set_nmi_mask,
 	.enable_nmi_window = vt_enable_nmi_window,
 	.enable_irq_window = vt_enable_irq_window,
-	.update_cr8_intercept = vmx_update_cr8_intercept,
+	.update_cr8_intercept = vt_update_cr8_intercept,
 
 	.x2apic_icr_is_split = false,
 	.set_virtual_apic_mode = vt_set_virtual_apic_mode,
 	.set_apic_access_page_addr = vt_set_apic_access_page_addr,
 	.refresh_apicv_exec_ctrl = vt_refresh_apicv_exec_ctrl,
-	.load_eoi_exitmap = vmx_load_eoi_exitmap,
+	.load_eoi_exitmap = vt_load_eoi_exitmap,
 	.apicv_pre_state_restore = vt_apicv_pre_state_restore,
 	.required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
 	.hwapic_irr_update = vt_hwapic_irr_update,
@@ -633,13 +867,13 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
 	.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt,
 
-	.set_tss_addr = vmx_set_tss_addr,
-	.set_identity_map_addr = vmx_set_identity_map_addr,
+	.set_tss_addr = vt_set_tss_addr,
+	.set_identity_map_addr = vt_set_identity_map_addr,
 	.get_mt_mask = vmx_get_mt_mask,
 
 	.get_exit_info = vt_get_exit_info,
 
-	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+	.vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid,
 
 	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 413359741085..4bf3a6dc66fc 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -733,8 +733,15 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
 	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
-	vcpu->arch.guest_state_protected =
-		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTR_DEBUG);
+	/*
+	 * TODO: support off-TD debug.  If TD DEBUG is enabled, guest state
+	 * can be accessed. guest_state_protected = false. and kvm ioctl to
+	 * access CPU states should be usable for user space VMM (e.g. qemu).
+	 *
+	 * vcpu->arch.guest_state_protected =
+	 *	!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTR_DEBUG);
+	 */
+	vcpu->arch.guest_state_protected = true;
 
 	if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
 		vcpu->arch.xfd_no_write_intercept = true;
@@ -2134,6 +2141,23 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	}
 }
 
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+	kvm_register_mark_available(vcpu, reg);
+	switch (reg) {
+	case VCPU_REGS_RSP:
+	case VCPU_REGS_RIP:
+	case VCPU_EXREG_PDPTR:
+	case VCPU_EXREG_CR0:
+	case VCPU_EXREG_CR3:
+	case VCPU_EXREG_CR4:
+		break;
+	default:
+		KVM_BUG_ON(1, vcpu->kvm);
+		break;
+	}
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 7fb1bbf12b39..5b79bb72b13a 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -146,6 +146,8 @@ bool tdx_has_emulated_msr(u32 index);
 int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
 int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
@@ -193,6 +195,8 @@ static inline bool tdx_has_emulated_msr(u32 index) { return false; }
 static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 
+static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
+
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
 static inline int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 11/18] KVM: TDX: Add methods to ignore accesses to CPU state
  2024-12-10  0:49 ` [PATCH 11/18] KVM: TDX: Add methods to ignore accesses to CPU state Binbin Wu
@ 2025-01-15  6:51   ` Binbin Wu
  0 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2025-01-15  6:51 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel




On 12/10/2024 8:49 AM, Binbin Wu wrote:
[...]
>   
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 413359741085..4bf3a6dc66fc 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -733,8 +733,15 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>   
>   	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
>   	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
> -	vcpu->arch.guest_state_protected =
> -		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTR_DEBUG);
> +	/*
> +	 * TODO: support off-TD debug.  If TD DEBUG is enabled, guest state
> +	 * can be accessed. guest_state_protected = false. and kvm ioctl to
> +	 * access CPU states should be usable for user space VMM (e.g. qemu).
> +	 *
> +	 * vcpu->arch.guest_state_protected =
> +	 *	!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTR_DEBUG);
> +	 */
> +	vcpu->arch.guest_state_protected = true;
This was mistakenly introduced during re-base and should be discarded.


>   
>   	if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
>   		vcpu->arch.xfd_no_write_intercept = true;
> @@ -2134,6 +2141,23 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
>   	}
>   }
>   
[...]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 12/18] KVM: TDX: Add method to ignore guest instruction emulation
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (10 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 11/18] KVM: TDX: Add methods to ignore accesses to CPU state Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 13/18] KVM: TDX: Add methods to ignore VMX preemption timer Binbin Wu
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Skip instruction emulation and let the TDX guest retry for MMIO emulation
after installing the MMIO SPTE with suppress #VE bit cleared.

TDX protects TDX guest state from VMM, instructions in guest memory cannot
be emulated.  MMIO emulation is the only case that triggers the instruction
emulation code path for TDX guest.

The MMIO emulation handling flow as following:
- The TDX guest issues a vMMIO instruction. (The GPA must be shared and is
  not covered by KVM memory slot.)
- The default SPTE entry for shared-EPT by KVM has suppress #VE bit set. So
  EPT violation causes TD exit to KVM.
- Trigger KVM page fault handler and install a new SPTE with suppress #VE
  bit cleared.
- Skip instruction emulation and return X86EMU_RETRY_INSTR to let the vCPU
  retry.
- TDX guest re-executes the vMMIO instruction.
- TDX guest gets #VE because KVM has cleared #VE suppress bit.
- TDX guest #VE handler converts MMIO into TDG.VP.VMCALL<MMIO>

Return X86EMU_RETRY_INSTR in the callback check_emulate_instruction() for
TDX guests to retry the MMIO instruction.  Also, the instruction emulation
handling will be skipped, so that the callback check_intercept() will never
be called for TDX guest.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Dropped vt_check_intercept().
- Add a comment in vt_check_emulate_instruction().
- Update the changelog.
---
 arch/x86/kvm/vmx/main.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index f6b449ae1ef7..c97d0540a385 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -268,6 +268,22 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
 }
 #endif
 
+static int vt_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+					void *insn, int insn_len)
+{
+	/*
+	 * For TDX, this can only be triggered for MMIO emulation.  Let the
+	 * guest retry after installing the SPTE with suppress #VE bit cleared,
+	 * so that the guest will receive #VE when retry.  The guest is expected
+	 * to call TDG.VP.VMCALL<MMIO> to request VMM to do MMIO emulation on
+	 * #VE.
+	 */
+	if (is_td_vcpu(vcpu))
+		return X86EMUL_RETRY_INSTR;
+
+	return vmx_check_emulate_instruction(vcpu, emul_type, insn, insn_len);
+}
+
 static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
 {
 	/*
@@ -909,7 +925,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.enable_smi_window = vt_enable_smi_window,
 #endif
 
-	.check_emulate_instruction = vmx_check_emulate_instruction,
+	.check_emulate_instruction = vt_check_emulate_instruction,
 	.apic_init_signal_blocked = vt_apic_init_signal_blocked,
 	.migrate_timers = vmx_migrate_timers,
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 13/18] KVM: TDX: Add methods to ignore VMX preemption timer
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (11 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 12/18] KVM: TDX: Add method to ignore guest instruction emulation Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 14/18] KVM: TDX: Add methods to ignore accesses to TSC Binbin Wu
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX doesn't support VMX preemption timer.  Implement access methods for VMM
to ignore VMX preemption timer.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Dropped KVM_BUG_ON() in vt_cancel_hv_timer(). (Rick)
---
 arch/x86/kvm/vmx/main.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index c97d0540a385..4a9b176b8a36 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -757,6 +757,27 @@ static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
 	return vmx_set_identity_map_addr(kvm, ident_addr);
 }
 
+#ifdef CONFIG_X86_64
+static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+			      bool *expired)
+{
+	/* VMX-preemption timer isn't available for TDX. */
+	if (is_td_vcpu(vcpu))
+		return -EINVAL;
+
+	return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired);
+}
+
+static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
+{
+	/* VMX-preemption timer can't be set.  See vt_set_hv_timer(). */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_cancel_hv_timer(vcpu);
+}
+#endif
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -912,8 +933,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.pi_start_assignment = vmx_pi_start_assignment,
 
 #ifdef CONFIG_X86_64
-	.set_hv_timer = vmx_set_hv_timer,
-	.cancel_hv_timer = vmx_cancel_hv_timer,
+	.set_hv_timer = vt_set_hv_timer,
+	.cancel_hv_timer = vt_cancel_hv_timer,
 #endif
 
 	.setup_mce = vmx_setup_mce,
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 14/18] KVM: TDX: Add methods to ignore accesses to TSC
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (12 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 13/18] KVM: TDX: Add methods to ignore VMX preemption timer Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 15/18] KVM: TDX: Ignore setting up mce Binbin Wu
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX protects TDX guest TSC state from VMM.  Implement access methods to
ignore guest TSC.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Dropped KVM_BUG_ON() in vt_get_l2_tsc_offset(). (Rick)
---
 arch/x86/kvm/vmx/main.c | 44 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 4a9b176b8a36..81ca5acb9964 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -757,6 +757,42 @@ static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
 	return vmx_set_identity_map_addr(kvm, ident_addr);
 }
 
+static u64 vt_get_l2_tsc_offset(struct kvm_vcpu *vcpu)
+{
+	/* TDX doesn't support L2 guest at the moment. */
+	if (is_td_vcpu(vcpu))
+		return 0;
+
+	return vmx_get_l2_tsc_offset(vcpu);
+}
+
+static u64 vt_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
+{
+	/* TDX doesn't support L2 guest at the moment. */
+	if (is_td_vcpu(vcpu))
+		return 0;
+
+	return vmx_get_l2_tsc_multiplier(vcpu);
+}
+
+static void vt_write_tsc_offset(struct kvm_vcpu *vcpu)
+{
+	/* In TDX, tsc offset can't be changed. */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_write_tsc_offset(vcpu);
+}
+
+static void vt_write_tsc_multiplier(struct kvm_vcpu *vcpu)
+{
+	/* In TDX, tsc multiplier can't be changed. */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_write_tsc_multiplier(vcpu);
+}
+
 #ifdef CONFIG_X86_64
 static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 			      bool *expired)
@@ -914,10 +950,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
 
-	.get_l2_tsc_offset = vmx_get_l2_tsc_offset,
-	.get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
-	.write_tsc_offset = vmx_write_tsc_offset,
-	.write_tsc_multiplier = vmx_write_tsc_multiplier,
+	.get_l2_tsc_offset = vt_get_l2_tsc_offset,
+	.get_l2_tsc_multiplier = vt_get_l2_tsc_multiplier,
+	.write_tsc_offset = vt_write_tsc_offset,
+	.write_tsc_multiplier = vt_write_tsc_multiplier,
 
 	.load_mmu_pgd = vt_load_mmu_pgd,
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 15/18] KVM: TDX: Ignore setting up mce
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (13 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 14/18] KVM: TDX: Add methods to ignore accesses to TSC Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 16/18] KVM: TDX: Add a method to ignore hypercall patching Binbin Wu
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because vmx_set_mce function is VMX specific and it cannot be used for TDX.
Add vt stub to ignore setting up mce for TDX.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
 arch/x86/kvm/vmx/main.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 81ca5acb9964..01ad3865d54f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -814,6 +814,14 @@ static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
 }
 #endif
 
+static void vt_setup_mce(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_setup_mce(vcpu);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -973,7 +981,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.cancel_hv_timer = vt_cancel_hv_timer,
 #endif
 
-	.setup_mce = vmx_setup_mce,
+	.setup_mce = vt_setup_mce,
 
 #ifdef CONFIG_KVM_SMM
 	.smi_allowed = vt_smi_allowed,
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 16/18] KVM: TDX: Add a method to ignore hypercall patching
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (14 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 15/18] KVM: TDX: Ignore setting up mce Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 17/18] KVM: TDX: Make TDX VM type supported Binbin Wu
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because guest TD memory is protected, VMM patching guest binary for
hypercall instruction isn't possible.  Add a method to ignore hypercall
patching.  Note: guest TD kernel needs to be modified to use
TDG.VP.VMCALL for hypercall.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Renamed from
  "KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch"
  to "KVM: TDX: Add a method to ignore hypercall patching".
- Dropped KVM_BUG_ON() in vt_patch_hypercall(). (Rick)
- Remove "with a warning" from "Add a method to ignore hypercall
  patching with a warning." in changelog to reflect code change.
---
 arch/x86/kvm/vmx/main.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 01ad3865d54f..81b9d2379a74 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -657,6 +657,19 @@ static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
 	return vmx_get_interrupt_shadow(vcpu);
 }
 
+static void vt_patch_hypercall(struct kvm_vcpu *vcpu,
+				  unsigned char *hypercall)
+{
+	/*
+	 * Because guest memory is protected, guest can't be patched. TD kernel
+	 * is modified to use TDG.VP.VMCALL for hypercall.
+	 */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_patch_hypercall(vcpu, hypercall);
+}
+
 static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
 {
 	if (is_td_vcpu(vcpu))
@@ -921,7 +934,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.update_emulated_instruction = vmx_update_emulated_instruction,
 	.set_interrupt_shadow = vt_set_interrupt_shadow,
 	.get_interrupt_shadow = vt_get_interrupt_shadow,
-	.patch_hypercall = vmx_patch_hypercall,
+	.patch_hypercall = vt_patch_hypercall,
 	.inject_irq = vt_inject_irq,
 	.inject_nmi = vt_inject_nmi,
 	.inject_exception = vt_inject_exception,
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 17/18] KVM: TDX: Make TDX VM type supported
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (15 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 16/18] KVM: TDX: Add a method to ignore hypercall patching Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2024-12-10  0:49 ` [PATCH 18/18] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) Binbin Wu
  2024-12-10 18:25 ` [PATCH 00/18] KVM: TDX: TDX "the rest" part Paolo Bonzini
  18 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Now all the necessary code for TDX is in place, it's ready to run TDX
guest.  Advertise the VM type of KVM_X86_TDX_VM so that the user space
VMM like QEMU can start to use it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Move down to the end of patch series.
---
 arch/x86/kvm/vmx/main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 81b9d2379a74..d1f58f9552ea 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -1060,6 +1060,7 @@ static int __init vt_init(void)
 				sizeof(struct vcpu_tdx));
 		vcpu_align = max_t(unsigned, vcpu_align,
 				__alignof__(struct vcpu_tdx));
+		kvm_caps.supported_vm_types |= BIT(KVM_X86_TDX_VM);
 	}
 
 	/*
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 18/18] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (16 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 17/18] KVM: TDX: Make TDX VM type supported Binbin Wu
@ 2024-12-10  0:49 ` Binbin Wu
  2025-02-19 10:23   ` Huang, Kai
  2024-12-10 18:25 ` [PATCH 00/18] KVM: TDX: TDX "the rest" part Paolo Bonzini
  18 siblings, 1 reply; 26+ messages in thread
From: Binbin Wu @ 2024-12-10  0:49 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add documentation to Intel Trusted Domain Extensions(TDX) support.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" breakout:
- Updates to match code changes (Tony)
---
 Documentation/virt/kvm/api.rst           |   9 +-
 Documentation/virt/kvm/x86/index.rst     |   1 +
 Documentation/virt/kvm/x86/intel-tdx.rst | 357 +++++++++++++++++++++++
 3 files changed, 366 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/virt/kvm/x86/intel-tdx.rst

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index bb39da72c647..c5da37565e1e 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1394,6 +1394,9 @@ the memory region are automatically reflected into the guest.  For example, an
 mmap() that affects the region will be made visible immediately.  Another
 example is madvise(MADV_DROP).
 
+For TDX guest, deleting/moving memory region loses guest memory contents.
+Read only region isn't supported.  Only as-id 0 is supported.
+
 Note: On arm64, a write generated by the page-table walker (to update
 the Access and Dirty flags, for example) never results in a
 KVM_EXIT_MMIO exit when the slot has the KVM_MEM_READONLY flag. This
@@ -4758,7 +4761,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.
 
 :Capability: basic
 :Architectures: x86
-:Type: vm
+:Type: vm ioctl, vcpu ioctl
 :Parameters: an opaque platform specific structure (in/out)
 :Returns: 0 on success; -1 on error
 
@@ -4770,6 +4773,10 @@ Currently, this ioctl is used for issuing Secure Encrypted Virtualization
 (SEV) commands on AMD Processors. The SEV commands are defined in
 Documentation/virt/kvm/x86/amd-memory-encryption.rst.
 
+Currently, this ioctl is used for issuing Trusted Domain Extensions
+(TDX) commands on Intel Processors. The TDX commands are defined in
+Documentation/virt/kvm/x86/intel-tdx.rst.
+
 4.111 KVM_MEMORY_ENCRYPT_REG_REGION
 -----------------------------------
 
diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
index 9ece6b8dc817..851e99174762 100644
--- a/Documentation/virt/kvm/x86/index.rst
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -11,6 +11,7 @@ KVM for x86 systems
    cpuid
    errata
    hypercalls
+   intel-tdx
    mmu
    msr
    nested-vmx
diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
new file mode 100644
index 000000000000..12531c4c09e1
--- /dev/null
+++ b/Documentation/virt/kvm/x86/intel-tdx.rst
@@ -0,0 +1,357 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Domain Extensions (TDX)
+===================================
+
+Overview
+========
+TDX stands for Trust Domain Extensions which isolates VMs from
+the virtual-machine manager (VMM)/hypervisor and any other software on
+the platform. For details, see the specifications [1]_, whitepaper [2]_,
+architectural extensions specification [3]_, module documentation [4]_,
+loader interface specification [5]_, guest-hypervisor communication
+interface [6]_, virtual firmware design guide [7]_, and other resources
+([8]_, [9]_, [10]_, [11]_, and [12]_).
+
+
+API description
+===============
+
+KVM_MEMORY_ENCRYPT_OP
+---------------------
+:Type: vm ioctl, vcpu ioctl
+
+For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
+ioctl with TDX specific sub ioctl command.
+
+::
+
+  /* Trust Domain eXtension sub-ioctl() commands. */
+  enum kvm_tdx_cmd_id {
+          KVM_TDX_CAPABILITIES = 0,
+          KVM_TDX_INIT_VM,
+          KVM_TDX_INIT_VCPU,
+          KVM_TDX_INIT_MEM_REGION,
+          KVM_TDX_FINALIZE_VM,
+          KVM_TDX_GET_CPUID,
+
+          KVM_TDX_CMD_NR_MAX,
+  };
+
+  struct kvm_tdx_cmd {
+        /* enum kvm_tdx_cmd_id */
+        __u32 id;
+        /* flags for sub-commend. If sub-command doesn't use this, set zero. */
+        __u32 flags;
+        /*
+         * data for each sub-command. An immediate or a pointer to the actual
+         * data in process virtual address.  If sub-command doesn't use it,
+         * set zero.
+         */
+        __u64 data;
+        /*
+         * Auxiliary error code.  The sub-command may return TDX SEAMCALL
+         * status code in addition to -Exxx.
+         * Defined for consistency with struct kvm_sev_cmd.
+         */
+        __u64 hw_error;
+  };
+
+KVM_TDX_CAPABILITIES
+--------------------
+:Type: vm ioctl
+
+Subset of TDSYSINFO_STRUCT retrieved by TDH.SYS.INFO TDX SEAM call will be
+returned. It describes the Intel TDX module.
+
+- id: KVM_TDX_CAPABILITIES
+- flags: must be 0
+- data: pointer to struct kvm_tdx_capabilities
+- error: must be 0
+- unused: must be 0
+
+::
+
+  struct kvm_tdx_capabilities {
+        __u64 supported_attrs;
+        __u64 supported_xfam;
+        __u64 reserved[254];
+        struct kvm_cpuid2 cpuid;
+  };
+
+
+KVM_TDX_INIT_VM
+---------------
+:Type: vm ioctl
+
+Does additional VM initialization specific to TDX which corresponds to
+TDH.MNG.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VM
+- flags: must be 0
+- data: pointer to struct kvm_tdx_init_vm
+- error: must be 0
+- unused: must be 0
+
+::
+
+  struct kvm_tdx_init_vm {
+          __u64 attributes;
+          __u64 xfam;
+          __u64 mrconfigid[6];          /* sha384 digest */
+          __u64 mrowner[6];             /* sha384 digest */
+          __u64 mrownerconfig[6];       /* sha384 digest */
+
+          /* The total space for TD_PARAMS before the CPUIDs is 256 bytes */
+          __u64 reserved[12];
+
+        /*
+         * Call KVM_TDX_INIT_VM before vcpu creation, thus before
+         * KVM_SET_CPUID2.
+         * This configuration supersedes KVM_SET_CPUID2s for VCPUs because the
+         * TDX module directly virtualizes those CPUIDs without VMM.  The user
+         * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with
+         * those values.  If it doesn't, KVM may have wrong idea of vCPUIDs of
+         * the guest, and KVM may wrongly emulate CPUIDs or MSRs that the TDX
+         * module doesn't virtualize.
+         */
+          struct kvm_cpuid2 cpuid;
+  };
+
+
+KVM_TDX_INIT_VCPU
+-----------------
+:Type: vcpu ioctl
+
+Does additional VCPU initialization specific to TDX which corresponds to
+TDH.VP.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- flags: must be 0
+- data: initial value of the guest TD VCPU RCX
+- error: must be 0
+- unused: must be 0
+
+KVM_TDX_INIT_MEM_REGION
+-----------------------
+:Type: vcpu ioctl
+
+Encrypt a memory continuous region which corresponding to TDH.MEM.PAGE.ADD
+TDX SEAM call.
+If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement
+which corresponds to TDH.MR.EXTEND TDX SEAM call.
+
+- id: KVM_TDX_INIT_MEM_REGION
+- flags: flags
+            currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
+- data: pointer to struct kvm_tdx_init_mem_region
+- error: must be 0
+- unused: must be 0
+
+::
+
+  #define KVM_TDX_MEASURE_MEMORY_REGION   (1UL << 0)
+
+  struct kvm_tdx_init_mem_region {
+          __u64 source_addr;
+          __u64 gpa;
+          __u64 nr_pages;
+  };
+
+
+KVM_TDX_FINALIZE_VM
+-------------------
+:Type: vm ioctl
+
+Complete measurement of the initial TD contents and mark it ready to run
+which corresponds to TDH.MR.FINALIZE
+
+- id: KVM_TDX_FINALIZE_VM
+- flags: must be 0
+- data: must be 0
+- error: must be 0
+- unused: must be 0
+
+KVM TDX creation flow
+=====================
+In addition to KVM normal flow, new TDX ioctls need to be called.  The control flow
+looks like as follows.
+
+#. system wide capability check
+
+   * KVM_CAP_VM_TYPES: check if VM type is supported and if KVM_X86_TDX_VM
+     is supported.
+
+#. creating VM
+
+   * KVM_CREATE_VM
+   * KVM_TDX_CAPABILITIES: query if TDX is supported on the platform.
+   * KVM_ENABLE_CAP_VM(KVM_CAP_MAX_VCPUS): set max_vcpus. KVM_MAX_VCPUS by
+     default.  KVM_MAX_VCPUS is not a part of ABI, but kernel internal constant
+     that is subject to change.  Because max vcpus is a part of attestation, max
+     vcpus should be explicitly set.
+   * KVM_SET_TSC_KHZ for vm. optional
+   * KVM_TDX_INIT_VM: pass TDX specific VM parameters.
+
+#. creating VCPU
+
+   * KVM_CREATE_VCPU
+   * KVM_TDX_INIT_VCPU: pass TDX specific VCPU parameters.
+   * KVM_SET_CPUID2: Enable CPUID[0x1].ECX.X2APIC(bit 21)=1 so that the following
+     setting of MSR_IA32_APIC_BASE success. Without this,
+     KVM_SET_MSRS(MSR_IA32_APIC_BASE) fails.
+   * KVM_SET_MSRS: Set the initial reset value of MSR_IA32_APIC_BASE to
+     APIC_DEFAULT_ADDRESS(0xfee00000) | XAPIC_ENABLE(bit 10) |
+     X2APIC_ENABLE(bit 11) [| MSR_IA32_APICBASE_BSP(bit 8) optional]
+
+#. initializing guest memory
+
+   * allocate guest memory and initialize page same to normal KVM case
+     In TDX case, parse and load TDVF into guest memory in addition.
+   * KVM_TDX_INIT_MEM_REGION to add and measure guest pages.
+     If the pages has contents above, those pages need to be added.
+     Otherwise the contents will be lost and guest sees zero pages.
+   * KVM_TDX_FINALIAZE_VM: Finalize VM and measurement
+     This must be after KVM_TDX_INIT_MEM_REGION.
+
+#. run vcpu
+
+Design discussion
+=================
+
+Coexistence of normal(VMX) VM and TD VM
+---------------------------------------
+It's required to allow both legacy(normal VMX) VMs and new TD VMs to
+coexist. Otherwise the benefits of VM flexibility would be eliminated.
+The main issue for it is that the logic of kvm_x86_ops callbacks for
+TDX is different from VMX. On the other hand, the variable,
+kvm_x86_ops, is global single variable. Not per-VM, not per-vcpu.
+
+Several points to be considered:
+
+  * No or minimal overhead when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+  * Avoid overhead of indirect call via function pointers.
+  * Contain the changes under arch/x86/kvm/vmx directory and share logic
+    with VMX for maintenance.
+    Even though the ways to operation on VM (VMX instruction vs TDX
+    SEAM call) are different, the basic idea remains the same. So, many
+    logic can be shared.
+  * Future maintenance
+    The huge change of kvm_x86_ops in (near) future isn't expected.
+    a centralized file is acceptable.
+
+- Wrapping kvm x86_ops: The current choice
+
+  Introduce dedicated file for arch/x86/kvm/vmx/main.c (the name,
+  main.c, is just chosen to show main entry points for callbacks.) and
+  wrapper functions around all the callbacks with
+  "if (is-tdx) tdx-callback() else vmx-callback()".
+
+  Pros:
+
+  - No major change in common x86 KVM code. The change is (mostly)
+    contained under arch/x86/kvm/vmx/.
+  - When TDX is disabled(CONFIG_INTEL_TDX_HOST=n), the overhead is
+    optimized out.
+  - Micro optimization by avoiding function pointer.
+
+  Cons:
+
+  - Many boiler plates in arch/x86/kvm/vmx/main.c.
+
+KVM MMU Changes
+---------------
+KVM MMU needs to be enhanced to handle Secure/Shared-EPT. The
+high-level execution flow is mostly same to normal EPT case.
+EPT violation/misconfiguration -> invoke TDP fault handler ->
+resolve TDP fault -> resume execution. (or emulate MMIO)
+The difference is, that S-EPT is operated(read/write) via TDX SEAM
+call which is expensive instead of direct read/write EPT entry.
+One bit of GPA (51 or 47 bit) is repurposed so that it means shared
+with host(if set to 1) or private to TD(if cleared to 0).
+
+- The current implementation
+
+  * Reuse the existing MMU code with minimal update.  Because the
+    execution flow is mostly same. But additional operation, TDX call
+    for S-EPT, is needed. So add hooks for it to kvm_x86_ops.
+  * For performance, minimize TDX SEAM call to operate on S-EPT. When
+    getting corresponding S-EPT pages/entry from faulting GPA, don't
+    use TDX SEAM call to read S-EPT entry. Instead create shadow copy
+    in host memory.
+    Repurpose the existing kvm_mmu_page as shadow copy of S-EPT and
+    associate S-EPT to it.
+  * Treats share bit as attributes. mask/unmask the bit where
+    necessary to keep the existing traversing code works.
+    Introduce kvm.arch.gfn_shared_mask and use "if (gfn_share_mask)"
+    for special case.
+
+    * 0 : for non-TDX case
+    * 51 or 47 bit set for TDX case.
+
+  Pros:
+
+  - Large code reuse with minimal new hooks.
+  - Execution path is same.
+
+  Cons:
+
+  - Complicates the existing code.
+  - Repurpose kvm_mmu_page as shadow of Secure-EPT can be confusing.
+
+New KVM API, ioctl (sub)command, to manage TD VMs
+-------------------------------------------------
+Additional KVM APIs are needed to control TD VMs. The operations on TD
+VMs are specific to TDX.
+
+- Piggyback and repurpose KVM_MEMORY_ENCRYPT_OP
+
+  Although operations for TD VMs aren't necessarily related to memory
+  encryption, define sub operations of KVM_MEMORY_ENCRYPT_OP for TDX specific
+  ioctls.
+
+  Pros:
+
+  - No major change in common x86 KVM code.
+  - Follows the SEV case.
+
+  Cons:
+
+  - The sub operations of KVM_MEMORY_ENCRYPT_OP aren't necessarily memory
+    encryption, but operations on TD VMs.
+
+References
+==========
+
+.. [1] TDX specification
+   https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
+.. [2] Intel Trust Domain Extensions (Intel TDX)
+   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
+.. [3] Intel CPU Architectural Extensions Specification
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-cpu-architectural-specification.pdf
+.. [4] Intel TDX Module 1.0 EAS
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf
+.. [5] Intel TDX Loader Interface Specification
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-seamldr-interface-specification.pdf
+.. [6] Intel TDX Guest-Hypervisor Communication Interface
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
+.. [7] Intel TDX Virtual Firmware Design Guide
+   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.
+.. [8] intel public github
+
+   * kvm TDX branch: https://github.com/intel/tdx/tree/kvm
+   * TDX guest branch: https://github.com/intel/tdx/tree/guest
+
+.. [9] tdvf
+    https://github.com/tianocore/edk2-staging/tree/TDVF
+.. [10] KVM forum 2020: Intel Virtualization Technology Extensions to
+     Enable Hardware Isolated VMs
+     https://osseu2020.sched.com/event/eDzm/intel-virtualization-technology-extensions-to-enable-hardware-isolated-vms-sean-christopherson-intel
+.. [11] Linux Security Summit EU 2020:
+     Architectural Extensions for Hardware Virtual Machine Isolation
+     to Advance Confidential Computing in Public Clouds - Ravi Sahita
+     & Jun Nakajima, Intel Corporation
+     https://osseu2020.sched.com/event/eDOx/architectural-extensions-for-hardware-virtual-machine-isolation-to-advance-confidential-computing-in-public-clouds-ravi-sahita-jun-nakajima-intel-corporation
+.. [12] [RFCv2,00/16] KVM protected memory extension
+     https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com/
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 18/18] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)
  2024-12-10  0:49 ` [PATCH 18/18] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) Binbin Wu
@ 2025-02-19 10:23   ` Huang, Kai
  2025-02-20  5:50     ` Xiaoyao Li
  0 siblings, 1 reply; 26+ messages in thread
From: Huang, Kai @ 2025-02-19 10:23 UTC (permalink / raw)
  To: Binbin Wu, pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, adrian.hunter, reinette.chatre, xiaoyao.li,
	tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao, linux-kernel



On 10/12/2024 1:49 pm, Binbin Wu wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add documentation to Intel Trusted Domain Extensions(TDX) support.

Add a whitespace before (TDX).

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> TDX "the rest" breakout:
> - Updates to match code changes (Tony)
> ---
>   Documentation/virt/kvm/api.rst           |   9 +-
>   Documentation/virt/kvm/x86/index.rst     |   1 +
>   Documentation/virt/kvm/x86/intel-tdx.rst | 357 +++++++++++++++++++++++
>   3 files changed, 366 insertions(+), 1 deletion(-)
>   create mode 100644 Documentation/virt/kvm/x86/intel-tdx.rst
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index bb39da72c647..c5da37565e1e 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1394,6 +1394,9 @@ the memory region are automatically reflected into the guest.  For example, an
>   mmap() that affects the region will be made visible immediately.  Another
>   example is madvise(MADV_DROP).
>   
> +For TDX guest, deleting/moving memory region loses guest memory contents.
> +Read only region isn't supported.  Only as-id 0 is supported.
> +
>   Note: On arm64, a write generated by the page-table walker (to update
>   the Access and Dirty flags, for example) never results in a
>   KVM_EXIT_MMIO exit when the slot has the KVM_MEM_READONLY flag. This
> @@ -4758,7 +4761,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.
>   
>   :Capability: basic
>   :Architectures: x86
> -:Type: vm
> +:Type: vm ioctl, vcpu ioctl
>   :Parameters: an opaque platform specific structure (in/out)
>   :Returns: 0 on success; -1 on error
>   
> @@ -4770,6 +4773,10 @@ Currently, this ioctl is used for issuing Secure Encrypted Virtualization
>   (SEV) commands on AMD Processors. The SEV commands are defined in
>   Documentation/virt/kvm/x86/amd-memory-encryption.rst.
>   
> +Currently, this ioctl is used for issuing Trusted Domain Extensions
> +(TDX) commands on Intel Processors. The TDX commands are defined in
> +Documentation/virt/kvm/x86/intel-tdx.rst.
> +

This paragraph duplicates the SEV paragraph right above.  I would 
consolidate them together as:

Currently, this ioctl is used for issuing both Secure Encrypted 
Virtualization (SEV) commands on AMD platforms and Trusted Domain 
Extensions (TDX) commands on Intel Processors.  The detailed commands 
are defined in Documentation/virt/kvm/x86/amd-memory-encryption.rst and 
Documentation/virt/kvm/x86/intel-tdx.rst repectively.

>   4.111 KVM_MEMORY_ENCRYPT_REG_REGION
>   -----------------------------------
>   
> diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
> index 9ece6b8dc817..851e99174762 100644
> --- a/Documentation/virt/kvm/x86/index.rst
> +++ b/Documentation/virt/kvm/x86/index.rst
> @@ -11,6 +11,7 @@ KVM for x86 systems
>      cpuid
>      errata
>      hypercalls
> +   intel-tdx
>      mmu
>      msr
>      nested-vmx
> diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
> new file mode 100644
> index 000000000000..12531c4c09e1
> --- /dev/null
> +++ b/Documentation/virt/kvm/x86/intel-tdx.rst
> @@ -0,0 +1,357 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===================================
> +Intel Trust Domain Extensions (TDX)
> +===================================
> +
> +Overview
> +========
> +TDX stands for Trust Domain Extensions which isolates VMs from
> +the virtual-machine manager (VMM)/hypervisor and any other software on
> +the platform. For details, see the specifications [1]_, whitepaper [2]_,
> +architectural extensions specification [3]_, module documentation [4]_,
> +loader interface specification [5]_, guest-hypervisor communication
> +interface [6]_, virtual firmware design guide [7]_, and other resources
> +([8]_, [9]_, [10]_, [11]_, and [12]_).

Some links (e.g., 4) is not accessible anymore.

I didn't check all links, but I don't think it's necessary to put links 
of all specs here.  We can have a link which contains all the specs 
(assuming this link doesn't change at least for quite long time).

And there's already an Documentation/arch/x86/tdx.rst which describes 
TDX support in the host core-kernel and the guest.  We can mention it 
here.  And it also has some introduction sentences that we can borrow here.

So how bout:

Overview
========

Intel's Trust Domain Extensions (TDX) protect confidential guest VMs 
from the host and physical attacks.  A CPU-attested software module 
called 'the TDX module' runs inside a new CPU isolated range to provide 
the functionalities to manage and run protected VMs, a.k.a, TDX guests 
or TDs.

Please refer to [1] for the whitepaper, specifications and other resources.

This documentation describes TDX-specific KVM ABIs.  The TDX module 
needs to be initialized before it can be used by KVM to run any TDX 
guests.  The host core-kernel provides the support of initializing the 
TDX module, which is described in the Documentation/arch/x86/tdx.rst.

......

[1]: 
https://www.intel.com/content/www/us/en/developer/tools/trust-domain-extensions/documentation.html

> +
> +
> +API description
> +===============
> +
> +KVM_MEMORY_ENCRYPT_OP
> +---------------------
> +:Type: vm ioctl, vcpu ioctl
> +
> +For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
> +ioctl with TDX specific sub ioctl command.

command -> commands.

> +
> +::
> +
> +  /* Trust Domain eXtension sub-ioctl() commands. */

I think "Extensions" is used in every place in the kernel, so 
"eXtension" -> "Extensions".

And lack of consistency between "sub ioctl commands" and "sub-ioctl() 
commands".  Perhaps just use "sub-commands" for all the places.

> +  enum kvm_tdx_cmd_id {
> +          KVM_TDX_CAPABILITIES = 0,
> +          KVM_TDX_INIT_VM,
> +          KVM_TDX_INIT_VCPU,
> +          KVM_TDX_INIT_MEM_REGION,
> +          KVM_TDX_FINALIZE_VM,
> +          KVM_TDX_GET_CPUID,
> +
> +          KVM_TDX_CMD_NR_MAX,
> +  };
> +
> +  struct kvm_tdx_cmd {
> +        /* enum kvm_tdx_cmd_id */
> +        __u32 id;
> +        /* flags for sub-commend. If sub-command doesn't use this, set zero. */

commend -> command.

> +        __u32 flags;
> +        /*
> +         * data for each sub-command. An immediate or a pointer to the actual
> +         * data in process virtual address.  If sub-command doesn't use it,
> +         * set zero.
> +         */
> +        __u64 data;
> +        /*
> +         * Auxiliary error code.  The sub-command may return TDX SEAMCALL
> +         * status code in addition to -Exxx.
> +         * Defined for consistency with struct kvm_sev_cmd.
> +         */

"Defined for consistency with struct kvm_sev_cmd" got removed in the 
code.  It should be removed here too.

> +        __u64 hw_error;
> +  };
> +
> +KVM_TDX_CAPABILITIES
> +--------------------
> +:Type: vm ioctl
> +
> +Subset of TDSYSINFO_STRUCT retrieved by TDH.SYS.INFO TDX SEAM call will be
> +returned. It describes the Intel TDX module.

We are not using TDH.SYS.INFO and TDSYSINFO_STRUCT anymore.  Perhaps:

	Retrive TDX moduel global capabilities for running TDX guests.

> +
> +- id: KVM_TDX_CAPABILITIES
> +- flags: must be 0
> +- data: pointer to struct kvm_tdx_capabilities
> +- error: must be 0
> +- unused: must be 0

@error should be @hw_error.

And I don't see @unused anyware in the 'struct kvm_tdx_cmd'.  Should be 
removed.

The same to all below sub-commands.

> +
> +::
> +
> +  struct kvm_tdx_capabilities {
> +        __u64 supported_attrs;
> +        __u64 supported_xfam;
> +        __u64 reserved[254];
> +        struct kvm_cpuid2 cpuid;
> +  };
> +
> +
> +KVM_TDX_INIT_VM
> +---------------
> +:Type: vm ioctl

I would add description for return values:

	:Returns: 0 on success, <0 on error.

We can also repeat this in all sub-commands, but I am not sure it's 
necessary.

> +
> +Does additional VM initialization specific to TDX which corresponds to
> +TDH.MNG.INIT TDX SEAM call.

"SEAM call" -> "SEAMCALL" for consistency.  And the same to below all 
sub-commands.

Nit:

I am not sure whether we need to, or should, mention the detailed 
SEAMCALL here.  To me the ABI doesn't need to document the detailed 
implementation (i.e., which SEAMCALL is used) in the underneath kernel.

> +
> +- id: KVM_TDX_INIT_VM
> +- flags: must be 0
> +- data: pointer to struct kvm_tdx_init_vm
> +- error: must be 0
> +- unused: must be 0
> +
> +::
> +
> +  struct kvm_tdx_init_vm {
> +          __u64 attributes;
> +          __u64 xfam;
> +          __u64 mrconfigid[6];          /* sha384 digest */
> +          __u64 mrowner[6];             /* sha384 digest */
> +          __u64 mrownerconfig[6];       /* sha384 digest */
> +
> +          /* The total space for TD_PARAMS before the CPUIDs is 256 bytes */
> +          __u64 reserved[12];
> +
> +        /*
> +         * Call KVM_TDX_INIT_VM before vcpu creation, thus before
> +         * KVM_SET_CPUID2.
> +         * This configuration supersedes KVM_SET_CPUID2s for VCPUs because the
> +         * TDX module directly virtualizes those CPUIDs without VMM.  The user
> +         * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with
> +         * those values.  If it doesn't, KVM may have wrong idea of vCPUIDs of
> +         * the guest, and KVM may wrongly emulate CPUIDs or MSRs that the TDX
> +         * module doesn't virtualize.
> +         */
> +          struct kvm_cpuid2 cpuid;
> +  };
> +
> +
> +KVM_TDX_INIT_VCPU
> +-----------------
> +:Type: vcpu ioctl
> +
> +Does additional VCPU initialization specific to TDX which corresponds to
> +TDH.VP.INIT TDX SEAM call.
> +
> +- id: KVM_TDX_INIT_VCPU
> +- flags: must be 0
> +- data: initial value of the guest TD VCPU RCX
> +- error: must be 0
> +- unused: must be 0
> +
> +KVM_TDX_INIT_MEM_REGION
> +-----------------------
> +:Type: vcpu ioctl
> +
> +Encrypt a memory continuous region which corresponding to TDH.MEM.PAGE.ADD
> +TDX SEAM call.

"a contiguous guest memory region"?

And "which corresponding to .." has grammar issue.

How about:

	Load and encrypt a contiguous memory region from the source
	memory which corresponds to the TDH.MEM.PAGE.ADD TDX SEAMCALL.

> +If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement
> +which corresponds to TDH.MR.EXTEND TDX SEAM call.
> +
> +- id: KVM_TDX_INIT_MEM_REGION
> +- flags: flags
> +            currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
> +- data: pointer to struct kvm_tdx_init_mem_region
> +- error: must be 0
> +- unused: must be 0
> +
> +::
> +
> +  #define KVM_TDX_MEASURE_MEMORY_REGION   (1UL << 0)
> +
> +  struct kvm_tdx_init_mem_region {
> +          __u64 source_addr;
> +          __u64 gpa;
> +          __u64 nr_pages;
> +  };
> +
> +
> +KVM_TDX_FINALIZE_VM
> +-------------------
> +:Type: vm ioctl
> +
> +Complete measurement of the initial TD contents and mark it ready to run
> +which corresponds to TDH.MR.FINALIZE

Missing period at the end of the sentence.

And nit again: I don't like the "which corresponds to TDH.MR.FINALIZE".

> +
> +- id: KVM_TDX_FINALIZE_VM
> +- flags: must be 0
> +- data: must be 0
> +- error: must be 0
> +- unused: must be 0


This patch doesn't contain KVM_TDX_GET_CPUID.  I saw in internal dev 
branch we have it.


> +
> +KVM TDX creation flow
> +=====================
> +In addition to KVM normal flow, new TDX ioctls need to be called.  The control flow
> +looks like as follows.
> +
> +#. system wide capability check

To make all consistent:

Check system wide capability

> +
> +   * KVM_CAP_VM_TYPES: check if VM type is supported and if KVM_X86_TDX_VM
> +     is supported.
> +
> +#. creating VM

Create VM
> +
> +   * KVM_CREATE_VM
> +   * KVM_TDX_CAPABILITIES: query if TDX is supported on the platform.

"TDX is supported or not" is already checked in step 1.

I think we should say:

query TDX global capabilities for creating TDX guests.

> +   * KVM_ENABLE_CAP_VM(KVM_CAP_MAX_VCPUS): set max_vcpus. KVM_MAX_VCPUS by
> +     default.  KVM_MAX_VCPUS is not a part of ABI, but kernel internal constant
> +     that is subject to change.  Because max vcpus is a part of attestation, max
> +     vcpus should be explicitly set.

This is out-of-date.

       * KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPUS): query maximum vcpus the
	TDX guest can support (TDX has its own limitation on this).

> +   * KVM_SET_TSC_KHZ for vm. optional

For consistency:

       * KVM_SET_TSC_KHZ: optional

> +   * KVM_TDX_INIT_VM: pass TDX specific VM parameters.
> +
> +#. creating VCPU

Create vCPUs

> +
> +   * KVM_CREATE_VCPU
> +   * KVM_TDX_INIT_VCPU: pass TDX specific VCPU parameters.
> +   * KVM_SET_CPUID2: Enable CPUID[0x1].ECX.X2APIC(bit 21)=1 so that the following
> +     setting of MSR_IA32_APIC_BASE success. Without this,
> +     KVM_SET_MSRS(MSR_IA32_APIC_BASE) fails.

I would prefer to put X2APIC specific to a note:

       * KVM_SET_CPUID2: configure guest's CPUIDs.  Note: Enable ...

> +   * KVM_SET_MSRS: Set the initial reset value of MSR_IA32_APIC_BASE to
> +     APIC_DEFAULT_ADDRESS(0xfee00000) | XAPIC_ENABLE(bit 10) |
> +     X2APIC_ENABLE(bit 11) [| MSR_IA32_APICBASE_BSP(bit 8) optional]

Ditto, I believe there are other MSRs to be set too.

> +
> +#. initializing guest memory

Initialize initial guest memory

> +
> +   * allocate guest memory and initialize page same to normal KVM case

Cannot parse this.

> +     In TDX case, parse and load TDVF into guest memory in addition.

Don't understand "parse TDVF" either.

> +   * KVM_TDX_INIT_MEM_REGION to add and measure guest pages.
> +     If the pages has contents above, those pages need to be added.
> +     Otherwise the contents will be lost and guest sees zero pages.
> +   * KVM_TDX_FINALIAZE_VM: Finalize VM and measurement
> +     This must be after KVM_TDX_INIT_MEM_REGION.

Perhaps refine the above to:


       * Allocate guest memory in the same way as allocating memory for
	normal VMs.
       * KVM_TDX_INIT_MEM_REGION to add initial guest memory.  Note for
	now TDX guests only works with TDVF, thus the TDVF needs to be
	included in the initial guest memory.
       * KVM_TDX_FINALIZE_VM: Finalize the measurement of the TDX guest.

> +
> +#. run vcpu

Run vCPUs

> +
> +Design discussion
> +=================

"discussion" won't be appropriate after merge.  Let's just use "Design 
details".

> +
> +Coexistence of normal(VMX) VM and TD VM

normal (VMX) VM

> +---------------------------------------
> +It's required to allow both legacy(normal VMX) VMs and new TD VMs to
> +coexist. Otherwise the benefits of VM flexibility would be eliminated.
> +The main issue for it is that the logic of kvm_x86_ops callbacks for
> +TDX is different from VMX. On the other hand, the variable,
> +kvm_x86_ops, is global single variable. Not per-VM, not per-vcpu.
> +
> +Several points to be considered:
> +
> +  * No or minimal overhead when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
> +  * Avoid overhead of indirect call via function pointers.
> +  * Contain the changes under arch/x86/kvm/vmx directory and share logic
> +    with VMX for maintenance.
> +    Even though the ways to operation on VM (VMX instruction vs TDX
> +    SEAM call) are different, the basic idea remains the same. So, many
> +    logic can be shared.
> +  * Future maintenance
> +    The huge change of kvm_x86_ops in (near) future isn't expected.
> +    a centralized file is acceptable.
> +
> +- Wrapping kvm x86_ops: The current choice
> +
> +  Introduce dedicated file for arch/x86/kvm/vmx/main.c (the name,
> +  main.c, is just chosen to show main entry points for callbacks.) and
> +  wrapper functions around all the callbacks with
> +  "if (is-tdx) tdx-callback() else vmx-callback()".
> +
> +  Pros:
> +
> +  - No major change in common x86 KVM code. The change is (mostly)
> +    contained under arch/x86/kvm/vmx/.
> +  - When TDX is disabled(CONFIG_INTEL_TDX_HOST=n), the overhead is
> +    optimized out.
> +  - Micro optimization by avoiding function pointer.
> +
> +  Cons:
> +
> +  - Many boiler plates in arch/x86/kvm/vmx/main.c.
> +
> +KVM MMU Changes
> +---------------
> +KVM MMU needs to be enhanced to handle Secure/Shared-EPT. The
> +high-level execution flow is mostly same to normal EPT case.
> +EPT violation/misconfiguration -> invoke TDP fault handler ->
> +resolve TDP fault -> resume execution. (or emulate MMIO)
> +The difference is, that S-EPT is operated(read/write) via TDX SEAM
> +call which is expensive instead of direct read/write EPT entry.
> +One bit of GPA (51 or 47 bit) is repurposed so that it means shared
> +with host(if set to 1) or private to TD(if cleared to 0).
> +
> +- The current implementation
> +
> +  * Reuse the existing MMU code with minimal update.  Because the
> +    execution flow is mostly same. But additional operation, TDX call
> +    for S-EPT, is needed. So add hooks for it to kvm_x86_ops.
> +  * For performance, minimize TDX SEAM call to operate on S-EPT. When
> +    getting corresponding S-EPT pages/entry from faulting GPA, don't
> +    use TDX SEAM call to read S-EPT entry. Instead create shadow copy
> +    in host memory.
> +    Repurpose the existing kvm_mmu_page as shadow copy of S-EPT and
> +    associate S-EPT to it.
> +  * Treats share bit as attributes. mask/unmask the bit where
> +    necessary to keep the existing traversing code works.
> +    Introduce kvm.arch.gfn_shared_mask and use "if (gfn_share_mask)"
> +    for special case.
> +
> +    * 0 : for non-TDX case
> +    * 51 or 47 bit set for TDX case.
> +
> +  Pros:
> +
> +  - Large code reuse with minimal new hooks.
> +  - Execution path is same.
> +
> +  Cons:
> +
> +  - Complicates the existing code.
> +  - Repurpose kvm_mmu_page as shadow of Secure-EPT can be confusing.
> +
> +New KVM API, ioctl (sub)command, to manage TD VMs
> +-------------------------------------------------
> +Additional KVM APIs are needed to control TD VMs. The operations on TD
> +VMs are specific to TDX.
> +
> +- Piggyback and repurpose KVM_MEMORY_ENCRYPT_OP
> +
> +  Although operations for TD VMs aren't necessarily related to memory
> +  encryption, define sub operations of KVM_MEMORY_ENCRYPT_OP for TDX specific
> +  ioctls.
> +
> +  Pros:
> +
> +  - No major change in common x86 KVM code.
> +  - Follows the SEV case.
> +
> +  Cons:
> +
> +  - The sub operations of KVM_MEMORY_ENCRYPT_OP aren't necessarily memory
> +    encryption, but operations on TD VMs.

I vote to get rid of the above "design discussion" completely.

amd-memory-encryption.rst doesn't seem to have such details.

> +
> +References
> +==========
> +
> +.. [1] TDX specification
> +   https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
> +.. [2] Intel Trust Domain Extensions (Intel TDX)
> +   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
> +.. [3] Intel CPU Architectural Extensions Specification
> +   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-cpu-architectural-specification.pdf
> +.. [4] Intel TDX Module 1.0 EAS
> +   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf
> +.. [5] Intel TDX Loader Interface Specification
> +   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-seamldr-interface-specification.pdf
> +.. [6] Intel TDX Guest-Hypervisor Communication Interface
> +   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
> +.. [7] Intel TDX Virtual Firmware Design Guide
> +   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.

As said above, I think we can just provide a link which contains all -- 
whitepaper, specs, and other things like source code.

> +.. [8] intel public github
> +
> +   * kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> +   * TDX guest branch: https://github.com/intel/tdx/tree/guest

I don't think this should be included.

> +
> +.. [9] tdvf
> +    https://github.com/tianocore/edk2-staging/tree/TDVF
> +.. [10] KVM forum 2020: Intel Virtualization Technology Extensions to
> +     Enable Hardware Isolated VMs
> +     https://osseu2020.sched.com/event/eDzm/intel-virtualization-technology-extensions-to-enable-hardware-isolated-vms-sean-christopherson-intel
> +.. [11] Linux Security Summit EU 2020:
> +     Architectural Extensions for Hardware Virtual Machine Isolation
> +     to Advance Confidential Computing in Public Clouds - Ravi Sahita
> +     & Jun Nakajima, Intel Corporation
> +     https://osseu2020.sched.com/event/eDOx/architectural-extensions-for-hardware-virtual-machine-isolation-to-advance-confidential-computing-in-public-clouds-ravi-sahita-jun-nakajima-intel-corporation

[...]

> +.. [12] [RFCv2,00/16] KVM protected memory extension
> +     https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com/

Why putting this RFC here.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 18/18] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)
  2025-02-19 10:23   ` Huang, Kai
@ 2025-02-20  5:50     ` Xiaoyao Li
  2025-02-20 23:45       ` Huang, Kai
  0 siblings, 1 reply; 26+ messages in thread
From: Xiaoyao Li @ 2025-02-20  5:50 UTC (permalink / raw)
  To: Huang, Kai, Binbin Wu, pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, adrian.hunter, reinette.chatre, tony.lindgren,
	isaku.yamahata, yan.y.zhao, chao.gao, linux-kernel

On 2/19/2025 6:23 PM, Huang, Kai wrote:
> 
> 
> On 10/12/2024 1:49 pm, Binbin Wu wrote:
...
>> +
>> +
>> +API description
>> +===============
>> +
>> +KVM_MEMORY_ENCRYPT_OP
>> +---------------------
>> +:Type: vm ioctl, vcpu ioctl
>> +
>> +For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
>> +ioctl with TDX specific sub ioctl command.
> 
> command -> commands.
> 
>> +
>> +::
>> +
>> +  /* Trust Domain eXtension sub-ioctl() commands. */
> 
> I think "Extensions" is used in every place in the kernel, so 
> "eXtension" -> "Extensions".
> 
> And lack of consistency between "sub ioctl commands" and "sub-ioctl() 
> commands".  Perhaps just use "sub-commands" for all the places.

It's copied from the kernel header file, we need to fix there at first.

>> +  enum kvm_tdx_cmd_id {
>> +          KVM_TDX_CAPABILITIES = 0,
>> +          KVM_TDX_INIT_VM,
>> +          KVM_TDX_INIT_VCPU,
>> +          KVM_TDX_INIT_MEM_REGION,
>> +          KVM_TDX_FINALIZE_VM,
>> +          KVM_TDX_GET_CPUID,
>> +
>> +          KVM_TDX_CMD_NR_MAX,
>> +  };
>> +
>> +  struct kvm_tdx_cmd {
>> +        /* enum kvm_tdx_cmd_id */
>> +        __u32 id;
>> +        /* flags for sub-commend. If sub-command doesn't use this, 
>> set zero. */
> 
> commend -> command.

Ditto.

>> +        __u32 flags;
>> +        /*
>> +         * data for each sub-command. An immediate or a pointer to 
>> the actual
>> +         * data in process virtual address.  If sub-command doesn't 
>> use it,
>> +         * set zero.
>> +         */
>> +        __u64 data;
>> +        /*
>> +         * Auxiliary error code.  The sub-command may return TDX 
>> SEAMCALL
>> +         * status code in addition to -Exxx.
>> +         * Defined for consistency with struct kvm_sev_cmd.
>> +         */
> 
> "Defined for consistency with struct kvm_sev_cmd" got removed in the 
> code.  It should be removed here too.
> 
>> +        __u64 hw_error;
>> +  };
>> +
>> +KVM_TDX_CAPABILITIES
>> +--------------------
>> +:Type: vm ioctl
>> +
>> +Subset of TDSYSINFO_STRUCT retrieved by TDH.SYS.INFO TDX SEAM call 
>> will be
>> +returned. It describes the Intel TDX module.
> 
> We are not using TDH.SYS.INFO and TDSYSINFO_STRUCT anymore.  Perhaps:
> 
>      Retrive TDX moduel global capabilities for running TDX guests.

(some typos: Retrive => Retrieve, moduel => module)

It's not TDX module global capabilities. It has been adjuested by KVM to 
mask off the bits that are not supported by the KVM.

How about:

Return the TDX capabilities that current KVM supports with the specific 
TDX module loaeded in the system. It reports what features/capabilities 
are allowed to be configured to the TDX guest.

>> +
>> +- id: KVM_TDX_CAPABILITIES
>> +- flags: must be 0
>> +- data: pointer to struct kvm_tdx_capabilities
>> +- error: must be 0
>> +- unused: must be 0
> 
> @error should be @hw_error.
> 
> And I don't see @unused anyware in the 'struct kvm_tdx_cmd'.  Should be 
> removed.
> 
> The same to all below sub-commands.
> 
>> +
>> +::
>> +
>> +  struct kvm_tdx_capabilities {
>> +        __u64 supported_attrs;
>> +        __u64 supported_xfam;
>> +        __u64 reserved[254];
>> +        struct kvm_cpuid2 cpuid;
>> +  };
>> +
>> +
>> +KVM_TDX_INIT_VM
>> +---------------
>> +:Type: vm ioctl
> 
> I would add description for return values:
> 
>      :Returns: 0 on success, <0 on error.

+1,

I have added it for KVM_TDX_GET_CPUID internally.

> We can also repeat this in all sub-commands, but I am not sure it's 
> necessary.
> 
>> +
>> +Does additional VM initialization specific to TDX which corresponds to
>> +TDH.MNG.INIT TDX SEAM call.
> 
> "SEAM call" -> "SEAMCALL" for consistency.  And the same to below all 
> sub-commands.
> 
> Nit:
> 
> I am not sure whether we need to, or should, mention the detailed 
> SEAMCALL here.  To me the ABI doesn't need to document the detailed 
> implementation (i.e., which SEAMCALL is used) in the underneath kernel.

Agreed to drop mentioning the detailed SEAMCALL.

Maybe, we can mention the sequence requirement as well, that it needs to 
be called after KVM_CREATE_VM and before creating any VCPUs.

>> +
>> +- id: KVM_TDX_INIT_VM
>> +- flags: must be 0
>> +- data: pointer to struct kvm_tdx_init_vm
>> +- error: must be 0
>> +- unused: must be 0
>> +
>> +::
>> +
>> +  struct kvm_tdx_init_vm {
>> +          __u64 attributes;
>> +          __u64 xfam;
>> +          __u64 mrconfigid[6];          /* sha384 digest */
>> +          __u64 mrowner[6];             /* sha384 digest */
>> +          __u64 mrownerconfig[6];       /* sha384 digest */
>> +
>> +          /* The total space for TD_PARAMS before the CPUIDs is 256 
>> bytes */
>> +          __u64 reserved[12];
>> +
>> +        /*
>> +         * Call KVM_TDX_INIT_VM before vcpu creation, thus before
>> +         * KVM_SET_CPUID2.
>> +         * This configuration supersedes KVM_SET_CPUID2s for VCPUs 
>> because the
>> +         * TDX module directly virtualizes those CPUIDs without VMM.  
>> The user
>> +         * space VMM, e.g. qemu, should make KVM_SET_CPUID2 
>> consistent with
>> +         * those values.  If it doesn't, KVM may have wrong idea of 
>> vCPUIDs of
>> +         * the guest, and KVM may wrongly emulate CPUIDs or MSRs that 
>> the TDX
>> +         * module doesn't virtualize.
>> +         */
>> +          struct kvm_cpuid2 cpuid;
>> +  };
>> +
>> +
>> +KVM_TDX_INIT_VCPU
>> +-----------------
>> +:Type: vcpu ioctl
>> +
>> +Does additional VCPU initialization specific to TDX which corresponds to
>> +TDH.VP.INIT TDX SEAM call.

Same for the TDH.VP.INIT SEAMCALL, I don't think we need to mention it.
And KVM_TDX_INIT_VCPU does more things other than TDH.VP.INIT.

How about just:

   Perform TDX specific VCPU initialization

>> +- id: KVM_TDX_INIT_VCPU
>> +- flags: must be 0
>> +- data: initial value of the guest TD VCPU RCX
>> +- error: must be 0
>> +- unused: must be 0
>> +
>> +KVM_TDX_INIT_MEM_REGION
>> +-----------------------
>> +:Type: vcpu ioctl
>> +
>> +Encrypt a memory continuous region which corresponding to 
>> TDH.MEM.PAGE.ADD
>> +TDX SEAM call.
> 
> "a contiguous guest memory region"?
> 
> And "which corresponding to .." has grammar issue.
> 
> How about:
> 
>      Load and encrypt a contiguous memory region from the source
>      memory which corresponds to the TDH.MEM.PAGE.ADD TDX SEAMCALL.

As sugguested above, I prefer to drop mentionting the detailed SEAMCALL.

Besides, it's better to call out the dependence that the gpa needs to be 
set to private attribute before calling KVM_TDX_INIT_MEM_REGION to 
initialize the priviate memory.

How about:

   Initialize @nr_pages TDX guest private memory starting from @gpa with
   userspace provided data from @source_addr.

   Note, before calling this sub command, memory attribute of the range
   [gpa, gpa + nr_pages] needs to be private. Userspace can use
   KVM_SET_MEMORY_ATTRIBUTES to set the attribute.


>> +If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends 
>> measurement
>> +which corresponds to TDH.MR.EXTEND TDX SEAM call.
>> +
>> +- id: KVM_TDX_INIT_MEM_REGION
>> +- flags: flags
>> +            currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
>> +- data: pointer to struct kvm_tdx_init_mem_region
>> +- error: must be 0
>> +- unused: must be 0
>> +
>> +::
>> +
>> +  #define KVM_TDX_MEASURE_MEMORY_REGION   (1UL << 0)
>> +
>> +  struct kvm_tdx_init_mem_region {
>> +          __u64 source_addr;
>> +          __u64 gpa;
>> +          __u64 nr_pages;
>> +  };
>> +
>> +
>> +KVM_TDX_FINALIZE_VM
>> +-------------------
>> +:Type: vm ioctl
>> +
>> +Complete measurement of the initial TD contents and mark it ready to run
>> +which corresponds to TDH.MR.FINALIZE
> 
> Missing period at the end of the sentence.
> 
> And nit again: I don't like the "which corresponds to TDH.MR.FINALIZE".
> 
>> +
>> +- id: KVM_TDX_FINALIZE_VM
>> +- flags: must be 0
>> +- data: must be 0
>> +- error: must be 0
>> +- unused: must be 0
> 
> 
> This patch doesn't contain KVM_TDX_GET_CPUID.  I saw in internal dev 
> branch we have it.
> 

I added it in the internal branch. Next version post should have it.

>> +
>> +KVM TDX creation flow
>> +=====================
>> +In addition to KVM normal flow, new TDX ioctls need to be called.  
>> The control flow
>> +looks like as follows.
>> +
>> +#. system wide capability check
> 
> To make all consistent:
> 
> Check system wide capability
> 
>> +
>> +   * KVM_CAP_VM_TYPES: check if VM type is supported and if 
>> KVM_X86_TDX_VM
>> +     is supported.
>> +
>> +#. creating VM
> 
> Create VM
>> +
>> +   * KVM_CREATE_VM
>> +   * KVM_TDX_CAPABILITIES: query if TDX is supported on the platform.
> 
> "TDX is supported or not" is already checked in step 1.
> 
> I think we should say:
> 
> query TDX global capabilities for creating TDX guests.

I don't like the "global" term and I don't think it's correct.

KVM_TDX_CAPABILITIES was requested to change from the platform-scope 
ioctl to VM-scope. It should only report the per-TD capabilities, though 
it's identical for all TDs currently.

>> +   * KVM_ENABLE_CAP_VM(KVM_CAP_MAX_VCPUS): set max_vcpus. 
>> KVM_MAX_VCPUS by
>> +     default.  KVM_MAX_VCPUS is not a part of ABI, but kernel 
>> internal constant
>> +     that is subject to change.  Because max vcpus is a part of 
>> attestation, max
>> +     vcpus should be explicitly set.
> 
> This is out-of-date.
> 
>        * KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPUS): query maximum vcpus the
>      TDX guest can support (TDX has its own limitation on this).

More precisely, KVM_CHECK_EXTESION(KVM_CAP_MAX_VCPUS) at vm level

>> +   * KVM_SET_TSC_KHZ for vm. optional
> 
> For consistency:
> 
>        * KVM_SET_TSC_KHZ: optional

More precisely, KVM_SET_TSC_KHZ at VM-level if userspace desires a 
different TSC frequency than the host. Otherwise, host's TSC frequency 
will be configured for the TD.

> 
>> +   * KVM_TDX_INIT_VM: pass TDX specific VM parameters.
>> +
>> +#. creating VCPU
> 
> Create vCPUs
> 
>> +
>> +   * KVM_CREATE_VCPU
>> +   * KVM_TDX_INIT_VCPU: pass TDX specific VCPU parameters.
>> +   * KVM_SET_CPUID2: Enable CPUID[0x1].ECX.X2APIC(bit 21)=1 so that 
>> the following
>> +     setting of MSR_IA32_APIC_BASE success. Without this,
>> +     KVM_SET_MSRS(MSR_IA32_APIC_BASE) fails.
> 
> I would prefer to put X2APIC specific to a note:
> 
>        * KVM_SET_CPUID2: configure guest's CPUIDs.  Note: Enable ...
> 
>> +   * KVM_SET_MSRS: Set the initial reset value of MSR_IA32_APIC_BASE to
>> +     APIC_DEFAULT_ADDRESS(0xfee00000) | XAPIC_ENABLE(bit 10) |
>> +     X2APIC_ENABLE(bit 11) [| MSR_IA32_APICBASE_BSP(bit 8) optional]

This is not true now. MSR_IA32_APICBASE is read-only MSR for TDX. KVM 
disallows userspace to set MSR_IA32_APICBASE and 
ioctl(KVM_TDX_INIT_VCPU) initialize a correct value for 
MSR_IA32_APICBASE internally.

> Ditto, I believe there are other MSRs to be set too.

It seems no must-to-be-set MSR for TDX guest.

>> +
>> +#. initializing guest memory
> 
> Initialize initial guest memory
> 
>> +
>> +   * allocate guest memory and initialize page same to normal KVM case
> 
> Cannot parse this.
> 
>> +     In TDX case, parse and load TDVF into guest memory in addition.
> 
> Don't understand "parse TDVF" either.
> 
>> +   * KVM_TDX_INIT_MEM_REGION to add and measure guest pages.
>> +     If the pages has contents above, those pages need to be added.
>> +     Otherwise the contents will be lost and guest sees zero pages.
>> +   * KVM_TDX_FINALIAZE_VM: Finalize VM and measurement
>> +     This must be after KVM_TDX_INIT_MEM_REGION.
> 
> Perhaps refine the above to:
> 
> 
>        * Allocate guest memory in the same way as allocating memory for
>      normal VMs.
>        * KVM_TDX_INIT_MEM_REGION to add initial guest memory.  Note for
>      now TDX guests only works with TDVF, thus the TDVF needs to be
>      included in the initial guest memory.
>        * KVM_TDX_FINALIZE_VM: Finalize the measurement of the TDX guest.

I'm not sure if we need to mention TDVF here. I think KVM doesn't care 
about it. People can always create a new virtual bios for TDX guest 
themselves.

>> +
>> +#. run vcpu
> 
> Run vCPUs
> 
>> +
>> +Design discussion
>> +=================
> 
> "discussion" won't be appropriate after merge.  Let's just use "Design 
> details".
> 
>> +
>> +Coexistence of normal(VMX) VM and TD VM
> 
> normal (VMX) VM
> 
>> +---------------------------------------
>> +It's required to allow both legacy(normal VMX) VMs and new TD VMs to
>> +coexist. Otherwise the benefits of VM flexibility would be eliminated.
>> +The main issue for it is that the logic of kvm_x86_ops callbacks for
>> +TDX is different from VMX. On the other hand, the variable,
>> +kvm_x86_ops, is global single variable. Not per-VM, not per-vcpu.
>> +
>> +Several points to be considered:
>> +
>> +  * No or minimal overhead when TDX is 
>> disabled(CONFIG_INTEL_TDX_HOST=n).
>> +  * Avoid overhead of indirect call via function pointers.
>> +  * Contain the changes under arch/x86/kvm/vmx directory and share logic
>> +    with VMX for maintenance.
>> +    Even though the ways to operation on VM (VMX instruction vs TDX
>> +    SEAM call) are different, the basic idea remains the same. So, many
>> +    logic can be shared.
>> +  * Future maintenance
>> +    The huge change of kvm_x86_ops in (near) future isn't expected.
>> +    a centralized file is acceptable.
>> +
>> +- Wrapping kvm x86_ops: The current choice
>> +
>> +  Introduce dedicated file for arch/x86/kvm/vmx/main.c (the name,
>> +  main.c, is just chosen to show main entry points for callbacks.) and
>> +  wrapper functions around all the callbacks with
>> +  "if (is-tdx) tdx-callback() else vmx-callback()".
>> +
>> +  Pros:
>> +
>> +  - No major change in common x86 KVM code. The change is (mostly)
>> +    contained under arch/x86/kvm/vmx/.
>> +  - When TDX is disabled(CONFIG_INTEL_TDX_HOST=n), the overhead is
>> +    optimized out.
>> +  - Micro optimization by avoiding function pointer.
>> +
>> +  Cons:
>> +
>> +  - Many boiler plates in arch/x86/kvm/vmx/main.c.
>> +
>> +KVM MMU Changes
>> +---------------
>> +KVM MMU needs to be enhanced to handle Secure/Shared-EPT. The
>> +high-level execution flow is mostly same to normal EPT case.
>> +EPT violation/misconfiguration -> invoke TDP fault handler ->
>> +resolve TDP fault -> resume execution. (or emulate MMIO)
>> +The difference is, that S-EPT is operated(read/write) via TDX SEAM
>> +call which is expensive instead of direct read/write EPT entry.
>> +One bit of GPA (51 or 47 bit) is repurposed so that it means shared
>> +with host(if set to 1) or private to TD(if cleared to 0).
>> +
>> +- The current implementation
>> +
>> +  * Reuse the existing MMU code with minimal update.  Because the
>> +    execution flow is mostly same. But additional operation, TDX call
>> +    for S-EPT, is needed. So add hooks for it to kvm_x86_ops.
>> +  * For performance, minimize TDX SEAM call to operate on S-EPT. When
>> +    getting corresponding S-EPT pages/entry from faulting GPA, don't
>> +    use TDX SEAM call to read S-EPT entry. Instead create shadow copy
>> +    in host memory.
>> +    Repurpose the existing kvm_mmu_page as shadow copy of S-EPT and
>> +    associate S-EPT to it.
>> +  * Treats share bit as attributes. mask/unmask the bit where
>> +    necessary to keep the existing traversing code works.
>> +    Introduce kvm.arch.gfn_shared_mask and use "if (gfn_share_mask)"
>> +    for special case.
>> +
>> +    * 0 : for non-TDX case
>> +    * 51 or 47 bit set for TDX case.
>> +
>> +  Pros:
>> +
>> +  - Large code reuse with minimal new hooks.
>> +  - Execution path is same.
>> +
>> +  Cons:
>> +
>> +  - Complicates the existing code.
>> +  - Repurpose kvm_mmu_page as shadow of Secure-EPT can be confusing.
>> +
>> +New KVM API, ioctl (sub)command, to manage TD VMs
>> +-------------------------------------------------
>> +Additional KVM APIs are needed to control TD VMs. The operations on TD
>> +VMs are specific to TDX.
>> +
>> +- Piggyback and repurpose KVM_MEMORY_ENCRYPT_OP
>> +
>> +  Although operations for TD VMs aren't necessarily related to memory
>> +  encryption, define sub operations of KVM_MEMORY_ENCRYPT_OP for TDX 
>> specific
>> +  ioctls.
>> +
>> +  Pros:
>> +
>> +  - No major change in common x86 KVM code.
>> +  - Follows the SEV case.
>> +
>> +  Cons:
>> +
>> +  - The sub operations of KVM_MEMORY_ENCRYPT_OP aren't necessarily 
>> memory
>> +    encryption, but operations on TD VMs.
> 
> I vote to get rid of the above "design discussion" completely.
> 
> amd-memory-encryption.rst doesn't seem to have such details.
> 
>> +
>> +References
>> +==========
>> +
>> +.. [1] TDX specification
>> +   https://software.intel.com/content/www/us/en/develop/articles/ 
>> intel-trust-domain-extensions.html
>> +.. [2] Intel Trust Domain Extensions (Intel TDX)
>> +   https://software.intel.com/content/dam/develop/external/us/en/ 
>> documents/tdx-whitepaper-final9-17.pdf
>> +.. [3] Intel CPU Architectural Extensions Specification
>> +   https://software.intel.com/content/dam/develop/external/us/en/ 
>> documents/intel-tdx-cpu-architectural-specification.pdf
>> +.. [4] Intel TDX Module 1.0 EAS
>> +   https://software.intel.com/content/dam/develop/external/us/en/ 
>> documents/intel-tdx-module-1eas.pdf
>> +.. [5] Intel TDX Loader Interface Specification
>> +   https://software.intel.com/content/dam/develop/external/us/en/ 
>> documents/intel-tdx-seamldr-interface-specification.pdf
>> +.. [6] Intel TDX Guest-Hypervisor Communication Interface
>> +   https://software.intel.com/content/dam/develop/external/us/en/ 
>> documents/intel-tdx-guest-hypervisor-communication-interface.pdf
>> +.. [7] Intel TDX Virtual Firmware Design Guide
>> +   https://software.intel.com/content/dam/develop/external/us/en/ 
>> documents/tdx-virtual-firmware-design-guide-rev-1.
> 
> As said above, I think we can just provide a link which contains all -- 
> whitepaper, specs, and other things like source code.
> 
>> +.. [8] intel public github
>> +
>> +   * kvm TDX branch: https://github.com/intel/tdx/tree/kvm
>> +   * TDX guest branch: https://github.com/intel/tdx/tree/guest
> 
> I don't think this should be included.
> 
>> +
>> +.. [9] tdvf
>> +    https://github.com/tianocore/edk2-staging/tree/TDVF
>> +.. [10] KVM forum 2020: Intel Virtualization Technology Extensions to
>> +     Enable Hardware Isolated VMs
>> +     https://osseu2020.sched.com/event/eDzm/intel-virtualization- 
>> technology-extensions-to-enable-hardware-isolated-vms-sean- 
>> christopherson-intel
>> +.. [11] Linux Security Summit EU 2020:
>> +     Architectural Extensions for Hardware Virtual Machine Isolation
>> +     to Advance Confidential Computing in Public Clouds - Ravi Sahita
>> +     & Jun Nakajima, Intel Corporation
>> +     https://osseu2020.sched.com/event/eDOx/architectural-extensions- 
>> for-hardware-virtual-machine-isolation-to-advance-confidential- 
>> computing-in-public-clouds-ravi-sahita-jun-nakajima-intel-corporation
> 
> [...]
> 
>> +.. [12] [RFCv2,00/16] KVM protected memory extension
>> +     https://lore.kernel.org/all/20201020061859.18385-1- 
>> kirill.shutemov@linux.intel.com/
> 
> Why putting this RFC here.
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 18/18] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)
  2025-02-20  5:50     ` Xiaoyao Li
@ 2025-02-20 23:45       ` Huang, Kai
  0 siblings, 0 replies; 26+ messages in thread
From: Huang, Kai @ 2025-02-20 23:45 UTC (permalink / raw)
  To: Li, Xiaoyao, kvm@vger.kernel.org, pbonzini@redhat.com,
	seanjc@google.com, binbin.wu@linux.intel.com
  Cc: Gao, Chao, Edgecombe, Rick P, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Hunter, Adrian,
	tony.lindgren@linux.intel.com, Yamahata, Isaku

On Thu, 2025-02-20 at 13:50 +0800, Xiaoyao Li wrote:
> On 2/19/2025 6:23 PM, Huang, Kai wrote:
> > 
> > 
> > On 10/12/2024 1:49 pm, Binbin Wu wrote:
> ...
> > > +
> > > +
> > > +API description
> > > +===============
> > > +
> > > +KVM_MEMORY_ENCRYPT_OP
> > > +---------------------
> > > +:Type: vm ioctl, vcpu ioctl
> > > +
> > > +For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
> > > +ioctl with TDX specific sub ioctl command.
> > 
> > command -> commands.
> > 
> > > +
> > > +::
> > > +
> > > +  /* Trust Domain eXtension sub-ioctl() commands. */
> > 
> > I think "Extensions" is used in every place in the kernel, so 
> > "eXtension" -> "Extensions".
> > 
> > And lack of consistency between "sub ioctl commands" and "sub-ioctl() 
> > commands".  Perhaps just use "sub-commands" for all the places.
> 
> It's copied from the kernel header file, we need to fix there at first.

I see.  So let's keep the code and the doc synced.

I'll leave to Rick to decide whether we should fix in the code.

> 
> > > +  enum kvm_tdx_cmd_id {
> > > +          KVM_TDX_CAPABILITIES = 0,
> > > +          KVM_TDX_INIT_VM,
> > > +          KVM_TDX_INIT_VCPU,
> > > +          KVM_TDX_INIT_MEM_REGION,
> > > +          KVM_TDX_FINALIZE_VM,
> > > +          KVM_TDX_GET_CPUID,
> > > +
> > > +          KVM_TDX_CMD_NR_MAX,
> > > +  };
> > > +
> > > +  struct kvm_tdx_cmd {
> > > +        /* enum kvm_tdx_cmd_id */
> > > +        __u32 id;
> > > +        /* flags for sub-commend. If sub-command doesn't use this, 
> > > set zero. */
> > 
> > commend -> command.
> 
> Ditto.

Ditto.

> 
> > > +        __u32 flags;
> > > +        /*
> > > +         * data for each sub-command. An immediate or a pointer to 
> > > the actual
> > > +         * data in process virtual address.  If sub-command doesn't 
> > > use it,
> > > +         * set zero.
> > > +         */
> > > +        __u64 data;
> > > +        /*
> > > +         * Auxiliary error code.  The sub-command may return TDX 
> > > SEAMCALL
> > > +         * status code in addition to -Exxx.
> > > +         * Defined for consistency with struct kvm_sev_cmd.
> > > +         */
> > 
> > "Defined for consistency with struct kvm_sev_cmd" got removed in the 
> > code.  It should be removed here too.

This can be done since I see it got removed in the code.

> > 
> > > +        __u64 hw_error;
> > > +  };
> > > +
> > > +KVM_TDX_CAPABILITIES
> > > +--------------------
> > > +:Type: vm ioctl
> > > +
> > > +Subset of TDSYSINFO_STRUCT retrieved by TDH.SYS.INFO TDX SEAM call 
> > > will be
> > > +returned. It describes the Intel TDX module.
> > 
> > We are not using TDH.SYS.INFO and TDSYSINFO_STRUCT anymore.  Perhaps:
> > 
> >      Retrive TDX moduel global capabilities for running TDX guests.
> 
> (some typos: Retrive => Retrieve, moduel => module)
> 
> It's not TDX module global capabilities. It has been adjuested by KVM to 
> mask off the bits that are not supported by the KVM.
> 
> How about:
> 
> Return the TDX capabilities that current KVM supports with the specific 
> TDX module loaeded in the system. It reports what features/capabilities 
> are allowed to be configured to the TDX guest.

Fine to me.  Thanks.

> 
> > > +
> > > +- id: KVM_TDX_CAPABILITIES
> > > +- flags: must be 0
> > > +- data: pointer to struct kvm_tdx_capabilities
> > > +- error: must be 0
> > > +- unused: must be 0
> > 
> > @error should be @hw_error.
> > 
> > And I don't see @unused anyware in the 'struct kvm_tdx_cmd'.  Should be 
> > removed.
> > 
> > The same to all below sub-commands.
> > 
> > > +
> > > +::
> > > +
> > > +  struct kvm_tdx_capabilities {
> > > +        __u64 supported_attrs;
> > > +        __u64 supported_xfam;
> > > +        __u64 reserved[254];
> > > +        struct kvm_cpuid2 cpuid;
> > > +  };
> > > +
> > > +
> > > +KVM_TDX_INIT_VM
> > > +---------------
> > > +:Type: vm ioctl
> > 
> > I would add description for return values:
> > 
> >      :Returns: 0 on success, <0 on error.
> 
> +1,
> 
> I have added it for KVM_TDX_GET_CPUID internally.

Let's add it for all sub-commands then.

> 
> > We can also repeat this in all sub-commands, but I am not sure it's 
> > necessary.
> > 
> > > +
> > > +Does additional VM initialization specific to TDX which corresponds to
> > > +TDH.MNG.INIT TDX SEAM call.
> > 
> > "SEAM call" -> "SEAMCALL" for consistency.  And the same to below all 
> > sub-commands.
> > 
> > Nit:
> > 
> > I am not sure whether we need to, or should, mention the detailed 
> > SEAMCALL here.  To me the ABI doesn't need to document the detailed 
> > implementation (i.e., which SEAMCALL is used) in the underneath kernel.
> 
> Agreed to drop mentioning the detailed SEAMCALL.
> 
> Maybe, we can mention the sequence requirement as well, that it needs to 
> be called after KVM_CREATE_VM and before creating any VCPUs.

Yeah agreed.  

Then perhaps we can drop the "KVM TDX creation flow" section entirely to avoid
the potential debate.

> 
> > > +
> > > +- id: KVM_TDX_INIT_VM
> > > +- flags: must be 0
> > > +- data: pointer to struct kvm_tdx_init_vm
> > > +- error: must be 0
> > > +- unused: must be 0
> > > +
> > > +::
> > > +
> > > +  struct kvm_tdx_init_vm {
> > > +          __u64 attributes;
> > > +          __u64 xfam;
> > > +          __u64 mrconfigid[6];          /* sha384 digest */
> > > +          __u64 mrowner[6];             /* sha384 digest */
> > > +          __u64 mrownerconfig[6];       /* sha384 digest */
> > > +
> > > +          /* The total space for TD_PARAMS before the CPUIDs is 256 
> > > bytes */
> > > +          __u64 reserved[12];
> > > +
> > > +        /*
> > > +         * Call KVM_TDX_INIT_VM before vcpu creation, thus before
> > > +         * KVM_SET_CPUID2.
> > > +         * This configuration supersedes KVM_SET_CPUID2s for VCPUs 
> > > because the
> > > +         * TDX module directly virtualizes those CPUIDs without VMM.  
> > > The user
> > > +         * space VMM, e.g. qemu, should make KVM_SET_CPUID2 
> > > consistent with
> > > +         * those values.  If it doesn't, KVM may have wrong idea of 
> > > vCPUIDs of
> > > +         * the guest, and KVM may wrongly emulate CPUIDs or MSRs that 
> > > the TDX
> > > +         * module doesn't virtualize.
> > > +         */
> > > +          struct kvm_cpuid2 cpuid;
> > > +  };
> > > +
> > > +
> > > +KVM_TDX_INIT_VCPU
> > > +-----------------
> > > +:Type: vcpu ioctl
> > > +
> > > +Does additional VCPU initialization specific to TDX which corresponds to
> > > +TDH.VP.INIT TDX SEAM call.
> 
> Same for the TDH.VP.INIT SEAMCALL, I don't think we need to mention it.
> And KVM_TDX_INIT_VCPU does more things other than TDH.VP.INIT.
> 
> How about just:
> 
>    Perform TDX specific VCPU initialization

Fine to me.

> 
> > > +- id: KVM_TDX_INIT_VCPU
> > > +- flags: must be 0
> > > +- data: initial value of the guest TD VCPU RCX
> > > +- error: must be 0
> > > +- unused: must be 0
> > > +
> > > +KVM_TDX_INIT_MEM_REGION
> > > +-----------------------
> > > +:Type: vcpu ioctl
> > > +
> > > +Encrypt a memory continuous region which corresponding to 
> > > TDH.MEM.PAGE.ADD
> > > +TDX SEAM call.
> > 
> > "a contiguous guest memory region"?
> > 
> > And "which corresponding to .." has grammar issue.
> > 
> > How about:
> > 
> >      Load and encrypt a contiguous memory region from the source
> >      memory which corresponds to the TDH.MEM.PAGE.ADD TDX SEAMCALL.
> 
> As sugguested above, I prefer to drop mentionting the detailed SEAMCALL.
> 
> Besides, it's better to call out the dependence that the gpa needs to be 
> set to private attribute before calling KVM_TDX_INIT_MEM_REGION to 
> initialize the priviate memory.
> 
> How about:
> 
>    Initialize @nr_pages TDX guest private memory starting from @gpa with
>    userspace provided data from @source_addr.
> 
>    Note, before calling this sub command, memory attribute of the range
>    [gpa, gpa + nr_pages] needs to be private. Userspace can use
>    KVM_SET_MEMORY_ATTRIBUTES to set the attribute.

Fine to me.

> 
> 
> > > +If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends 
> > > measurement
> > > +which corresponds to TDH.MR.EXTEND TDX SEAM call.
> > > +
> > > +- id: KVM_TDX_INIT_MEM_REGION
> > > +- flags: flags
> > > +            currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
> > > +- data: pointer to struct kvm_tdx_init_mem_region
> > > +- error: must be 0
> > > +- unused: must be 0
> > > +
> > > +::
> > > +
> > > +  #define KVM_TDX_MEASURE_MEMORY_REGION   (1UL << 0)
> > > +
> > > +  struct kvm_tdx_init_mem_region {
> > > +          __u64 source_addr;
> > > +          __u64 gpa;
> > > +          __u64 nr_pages;
> > > +  };
> > > +
> > > +
> > > +KVM_TDX_FINALIZE_VM
> > > +-------------------
> > > +:Type: vm ioctl
> > > +
> > > +Complete measurement of the initial TD contents and mark it ready to run
> > > +which corresponds to TDH.MR.FINALIZE
> > 
> > Missing period at the end of the sentence.
> > 
> > And nit again: I don't like the "which corresponds to TDH.MR.FINALIZE".
> > 
> > > +
> > > +- id: KVM_TDX_FINALIZE_VM
> > > +- flags: must be 0
> > > +- data: must be 0
> > > +- error: must be 0
> > > +- unused: must be 0
> > 
> > 
> > This patch doesn't contain KVM_TDX_GET_CPUID.  I saw in internal dev 
> > branch we have it.
> > 
> 
> I added it in the internal branch. Next version post should have it.

Great.  Perhaps you can send out a fixup patch internally.

> 
> > > +
> > > +KVM TDX creation flow
> > > +=====================
> > > +In addition to KVM normal flow, new TDX ioctls need to be called.  
> > > The control flow
> > > +looks like as follows.
> > > +
> > > +#. system wide capability check
> > 
> > To make all consistent:
> > 
> > Check system wide capability
> > 
> > > +
> > > +   * KVM_CAP_VM_TYPES: check if VM type is supported and if 
> > > KVM_X86_TDX_VM
> > > +     is supported.
> > > +
> > > +#. creating VM
> > 
> > Create VM
> > > +
> > > +   * KVM_CREATE_VM
> > > +   * KVM_TDX_CAPABILITIES: query if TDX is supported on the platform.
> > 
> > "TDX is supported or not" is already checked in step 1.
> > 
> > I think we should say:
> > 
> > query TDX global capabilities for creating TDX guests.
> 
> I don't like the "global" term and I don't think it's correct.
> 
> KVM_TDX_CAPABILITIES was requested to change from the platform-scope 
> ioctl to VM-scope. It should only report the per-TD capabilities, though 
> it's identical for all TDs currently.

Oh I didn't know this.  Thanks for pointing out.  Could you suggest the
preferred words?

(Btw, as mentioned above if we document the calling sequence of the commands in
each command then perhaps we can get rid of this entire "TDX TDX creation flow"
in which case we won't need to fix those up).

> 
> > > +   * KVM_ENABLE_CAP_VM(KVM_CAP_MAX_VCPUS): set max_vcpus. 
> > > KVM_MAX_VCPUS by
> > > +     default.  KVM_MAX_VCPUS is not a part of ABI, but kernel 
> > > internal constant
> > > +     that is subject to change.  Because max vcpus is a part of 
> > > attestation, max
> > > +     vcpus should be explicitly set.
> > 
> > This is out-of-date.
> > 
> >        * KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPUS): query maximum vcpus the
> >      TDX guest can support (TDX has its own limitation on this).
> 
> More precisely, KVM_CHECK_EXTESION(KVM_CAP_MAX_VCPUS) at vm level

Sure:

KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPUS): Query maximum vCPUs the TD can support
at vm level (TDX has its own limitation on this).

> 
> > > +   * KVM_SET_TSC_KHZ for vm. optional
> > 
> > For consistency:
> > 
> >        * KVM_SET_TSC_KHZ: optional
> 
> More precisely, KVM_SET_TSC_KHZ at VM-level if userspace desires a 
> different TSC frequency than the host. Otherwise, host's TSC frequency 
> will be configured for the TD.

Sure:

KVM_SET_TSC_KHZ: Configure TD's TSC frequency if a different TSC frequency than
host is desired.  This is Optional.

> 
> > 
> > > +   * KVM_TDX_INIT_VM: pass TDX specific VM parameters.
> > > +
> > > +#. creating VCPU
> > 
> > Create vCPUs
> > 
> > > +
> > > +   * KVM_CREATE_VCPU
> > > +   * KVM_TDX_INIT_VCPU: pass TDX specific VCPU parameters.
> > > +   * KVM_SET_CPUID2: Enable CPUID[0x1].ECX.X2APIC(bit 21)=1 so that 
> > > the following
> > > +     setting of MSR_IA32_APIC_BASE success. Without this,
> > > +     KVM_SET_MSRS(MSR_IA32_APIC_BASE) fails.
> > 
> > I would prefer to put X2APIC specific to a note:
> > 
> >        * KVM_SET_CPUID2: configure guest's CPUIDs.  Note: Enable ...
> > 
> > > +   * KVM_SET_MSRS: Set the initial reset value of MSR_IA32_APIC_BASE to
> > > +     APIC_DEFAULT_ADDRESS(0xfee00000) | XAPIC_ENABLE(bit 10) |
> > > +     X2APIC_ENABLE(bit 11) [| MSR_IA32_APICBASE_BSP(bit 8) optional]
> 
> This is not true now. MSR_IA32_APICBASE is read-only MSR for TDX. KVM 
> disallows userspace to set MSR_IA32_APICBASE and 
> ioctl(KVM_TDX_INIT_VCPU) initialize a correct value for 
> MSR_IA32_APICBASE internally.

Perhaps we can just say:

Configure TD's CPUIDs.
> 
> > Ditto, I believe there are other MSRs to be set too.
> 
> It seems no must-to-be-set MSR for TDX guest.

We can just say: Configure TD's MSRs.

> 
> > > +
> > > +#. initializing guest memory
> > 
> > Initialize initial guest memory
> > 
> > > +
> > > +   * allocate guest memory and initialize page same to normal KVM case
> > 
> > Cannot parse this.
> > 
> > > +     In TDX case, parse and load TDVF into guest memory in addition.
> > 
> > Don't understand "parse TDVF" either.
> > 
> > > +   * KVM_TDX_INIT_MEM_REGION to add and measure guest pages.
> > > +     If the pages has contents above, those pages need to be added.
> > > +     Otherwise the contents will be lost and guest sees zero pages.
> > > +   * KVM_TDX_FINALIAZE_VM: Finalize VM and measurement
> > > +     This must be after KVM_TDX_INIT_MEM_REGION.
> > 
> > Perhaps refine the above to:
> > 
> > 
> >        * Allocate guest memory in the same way as allocating memory for
> >      normal VMs.
> >        * KVM_TDX_INIT_MEM_REGION to add initial guest memory.  Note for
> >      now TDX guests only works with TDVF, thus the TDVF needs to be
> >      included in the initial guest memory.
> >        * KVM_TDX_FINALIZE_VM: Finalize the measurement of the TDX guest.
> 
> I'm not sure if we need to mention TDVF here. I think KVM doesn't care 
> about it. People can always create a new virtual bios for TDX guest 
> themselves.

Fine to remove it.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/18] KVM: TDX: TDX "the rest" part
  2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
                   ` (17 preceding siblings ...)
  2024-12-10  0:49 ` [PATCH 18/18] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) Binbin Wu
@ 2024-12-10 18:25 ` Paolo Bonzini
  2024-12-11  1:31   ` Binbin Wu
  18 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2024-12-10 18:25 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	yan.y.zhao, chao.gao, linux-kernel

Applied to kvm-coco-queue, thanks.  For now I used v1 of "TDX vCPU
enter/exit" as it was posted, but I will check out the review comments
later.

Paolo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/18] KVM: TDX: TDX "the rest" part
  2024-12-10 18:25 ` [PATCH 00/18] KVM: TDX: TDX "the rest" part Paolo Bonzini
@ 2024-12-11  1:31   ` Binbin Wu
  0 siblings, 0 replies; 26+ messages in thread
From: Binbin Wu @ 2024-12-11  1:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	yan.y.zhao, chao.gao, linux-kernel




On 12/11/2024 2:25 AM, Paolo Bonzini wrote:
> Applied to kvm-coco-queue, thanks.  For now I used v1 of "TDX vCPU
> enter/exit" as it was posted, but I will check out the review comments
> later.
>
> Paolo
Hi Paolo,

The the following two fixup patches to v1 of "TDX vCPU enter/exit" related
to the later sections.

One is https://github.com/intel/tdx/commit/22b7001fbb58771bf133a64e1b22fb9e47d8a11f
, make tdx_vcpu_enter_exit() noinstr based on the discussion:
https://lore.kernel.org/kvm/Z0SVf8bqGej_-7Sj@google.com/


The other is https://github.com/intel/tdx/commit/13828e0b586eed6618ccdef9e4f58b09358564d2
, move the check of VCPU_TD_STATE_INITIALIZED from tdx_vcpu_run() to
tdx_vcpu_pre_run() based on the discussion:
https://lore.kernel.org/kvm/837bbbc7-e7f3-4362-a745-310fe369f43d@intel.com/
So the check for VCPU_TD_STATE_INITIALIZED in tdx_handle_exit() is dropped in
"TDX hypercalls may exit to userspace"

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-02-20 23:45 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-10  0:49 [PATCH 00/18] KVM: TDX: TDX "the rest" part Binbin Wu
2024-12-10  0:49 ` [PATCH 01/18] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior Binbin Wu
2024-12-10  0:49 ` [PATCH 02/18] KVM: TDX: Handle EPT violation/misconfig exit Binbin Wu
2024-12-10  0:49 ` [PATCH 03/18] KVM: TDX: Detect unexpected SEPT violations due to pending SPTEs Binbin Wu
2024-12-10  0:49 ` [PATCH 04/18] KVM: TDX: Handle TDX PV CPUID hypercall Binbin Wu
2024-12-10  0:49 ` [PATCH 05/18] KVM: TDX: Handle TDX PV HLT hypercall Binbin Wu
2024-12-10  0:49 ` [PATCH 06/18] KVM: x86: Move KVM_MAX_MCE_BANKS to header file Binbin Wu
2024-12-10  0:49 ` [PATCH 07/18] KVM: TDX: Implement callbacks for MSR operations Binbin Wu
2025-01-15 11:26   ` Huang, Kai
2024-12-10  0:49 ` [PATCH 08/18] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall Binbin Wu
2024-12-10  0:49 ` [PATCH 09/18] KVM: TDX: Enable guest access to LMCE related MSRs Binbin Wu
2024-12-10  0:49 ` [PATCH 10/18] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall Binbin Wu
2024-12-10  0:49 ` [PATCH 11/18] KVM: TDX: Add methods to ignore accesses to CPU state Binbin Wu
2025-01-15  6:51   ` Binbin Wu
2024-12-10  0:49 ` [PATCH 12/18] KVM: TDX: Add method to ignore guest instruction emulation Binbin Wu
2024-12-10  0:49 ` [PATCH 13/18] KVM: TDX: Add methods to ignore VMX preemption timer Binbin Wu
2024-12-10  0:49 ` [PATCH 14/18] KVM: TDX: Add methods to ignore accesses to TSC Binbin Wu
2024-12-10  0:49 ` [PATCH 15/18] KVM: TDX: Ignore setting up mce Binbin Wu
2024-12-10  0:49 ` [PATCH 16/18] KVM: TDX: Add a method to ignore hypercall patching Binbin Wu
2024-12-10  0:49 ` [PATCH 17/18] KVM: TDX: Make TDX VM type supported Binbin Wu
2024-12-10  0:49 ` [PATCH 18/18] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) Binbin Wu
2025-02-19 10:23   ` Huang, Kai
2025-02-20  5:50     ` Xiaoyao Li
2025-02-20 23:45       ` Huang, Kai
2024-12-10 18:25 ` [PATCH 00/18] KVM: TDX: TDX "the rest" part Paolo Bonzini
2024-12-11  1:31   ` Binbin Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox