kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
@ 2024-11-21 20:14 Adrian Hunter
  2024-11-21 20:14 ` [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest Adrian Hunter
                   ` (8 more replies)
  0 siblings, 9 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-11-21 20:14 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, x86, yan.y.zhao, chao.gao,
	weijiang.yang

Hi

This patch series introduces callbacks to facilitate the entry of a TD VCPU
and the corresponding save/restore of host state.

There are some outstanding things still to do (see below), so we expect to
post future revisions of this patch set, but please do review the current
patches so that they can be made ready for hand off to Paolo.  Also
direction is needed for "x86/virt/tdx: Add SEAMCALL wrapper to enter/exit
TDX guest" because it will affect KVM.

This patch set is one of several patch sets that are all needed to provide
the ability to run a functioning TD VM.  They have been split from the
"marker" sections of patch set "[PATCH v19 000/130] KVM TDX basic feature
support":

  https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/

The recent patch sets are:

  TDX host: metadata reading
  TDX vCPU/VM creation
  TDX KVM MMU part 2
  TD vcpu enter/exit			<- this one
  TD vcpu exits/interrupts/hypercalls   <- still to come

Notably, a later patch sets deal with VCPU exits, interrupts and
hypercalls.

For x86 maintainers

This series has 1 commit that is an RFC that needs input from x86
maintainers:

  x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest

This is because wrapping TDH.VP.ENTER means dealing with multiple input and
output formats for the data in argument registers. We would like
maintainers to comment on the discussion that we will have on it.

Overview

A TD VCPU is entered via the SEAMCALL TDH.VP.ENTER. The TDX Module manages
the save/restore of guest state and, in conjunction with the SEAMCALL
interface, handles certain aspects of host state. However, there are
specific elements of the host state that require additional attention, as
detailed in the Intel TDX ABI documentation for TDH.VP.ENTER.

TDX is quite different from VMX in this regard.  For VMX, the host VMM is
heavily involved in restoring, managing and saving guest CPU state, whereas
for TDX this is handled by the TDX Module.  In that way, the TDX Module can
protect the confidentiality and integrity of TD CPU state.

The TDX Module does not save/restore all host CPU state because the host
VMM can do it more efficiently and selectively.  CPU state referred to
below is host CPU state.  Often values are already held in memory so no
explicit save is needed, and restoration may not be needed if the kernel
is not using a feature.

Outstanding things still to do:

  - how to wrap TDH.VP.ENTER SEAMCALL, refer to patch "x86/virt/tdx:
    Add SEAMCALL wrapper to enter/exit TDX guest"
  - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
    and guest_state_exit_irqoff() which comments say should be
    called from non-instrumentable code but noinst was removed
    at Sean's suggestion:
  	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
    noinstr is also needed to retain NMI-blocking by avoiding
    instrumented code that leads to an IRET which unblocks NMIs.
    A later patch set will deal with NMI VM-exits.
  - disallow TDX guest to use Intel PT
  	I think Tony will fix tdx_get_supported_xfam()
  - disallow PERFMON (TD attribute bit 63)
  - save/restore MSR IA32_UMWAIT_CONTROL or disallow guest
    CPUID(7,0).ECX.WAITPKG[5]
  - save/restore IA32_DEBUGCTL
  	VMX does:
  		vmx_vcpu_load() -> get_debugctlmsr()
  		vmx_vcpu_run() -> update_debugctlmsr()
  	TDX Module only preserves bits 1, 12 and 14

Key Details

Argument Passing: Similar to other SEAMCALLs, TDH.VP.ENTER passes
arguments through General Purpose Registers (GPRs). For the special case
of the TD guest invoking TDG.VP.VMCALL, nearly any GPR can be used,
as well as XMM0 to XMM15. Notably, RBP is not used, and Linux mandates the
TDX Module feature NO_RBP_MOD, which is enforced elsewhere. Additionally,
XMM registers are not required for the existing Guest Hypervisor
Communication Interface and are handled by existing KVM code should they be
modified by the guest.

Debug Register Handling: After TDH.VP.ENTER returns, registers DR0, DR1,
DR2, DR3, DR6, and DR7 are set to their architectural INIT values. Existing
KVM code already handles the restoration of host values as needed, refer
vcpu_enter_guest() which calls hw_breakpoint_restore().

MSR Restoration: Certain Model-Specific Registers (MSRs) need to be
restored post TDH.VP.ENTER. The Intel TDX ABI documentation provides a
detailed list in the msr_preservation.json file. Most MSRs do not require
restoration if the guest is not utilizing the corresponding feature. The
following features are currently assumed to be unsupported, and their MSRs
are not restored:
  PERFMON            (TD ATTRIBUTES[63])
  LBRs               (XFAM[15])
  User Interrupts    (XFAM[14])
  Intel PT           (XFAM[8])
The one feature that is supported:
  CET                (XFAM[11-12])	is restored via kvm_put_guest_fpu()

Other host MSR/Register Handling:

MSR IA32_XFD is already restored by KVM, refer to kvm_put_guest_fpu().
The TDX Module sets MSR IA32_XFD_ERR to its RESET value (0) which is
fine for the kernel.

MSR IA32_DEBUGCTL appears to have been overlooked.  According to
msr_preservation.json, the TDX Module preserves only bits 1, 12 and 14.
For VMX there is code to save and restore in vmx_vcpu_load() and
vmx_vcpu_run() respectively, but TDX does not use those functions.

MSR IA32_UARCH_MISC_CTL is not utilized by the kernel, so it is fine if
the TDX Module sets it to it's RESET value.

MSR IA32_KERNEL_GS_BASE is addressed in patch "KVM: TDX: vcpu_run:
save/restore host state (host kernel gs)".

MSRs IA32_XSS and XCRO are handled in patch "KVM: TDX: restore host xsave
state when exiting from the guest TD".

MSRs IA32_STAR, IA32_LSTAR, IA32_FMASK, and IA32_TSC_AUX are handled in
patch "KVM: TDX: restore user ret MSRs".

MSR IA32_TSX_CTRL is handled in patch "KVM: TDX: Add TSX_CTRL msr into
uret_msrs list".

MSR IA32_UMWAIT_CONTROL appears to have been overlooked.  The host value
needs to be restored if guest CPUID(7,0).ECX.WAITPKG[5] is 1, otherwise
that guest CPUID value needs to be disallowed.

Additional Notes

The patch "KVM: TDX: Implement TDX vcpu enter/exit path" highlights that
TDX does not support "PAUSE-loop exiting".  According to the TDX Module
Base arch. spec., hypercalls are expected to be used instead.  Note that
the Linux TDX guest supports existing hypercalls via TDG.VP.VMCALL.

Base

This series is based off of a kvm-coco-queue commit and some pre-req
series:
1. commit ee69eb746754 ("KVM: x86/mmu: Prevent aliased memslot GFNs") (in
   kvm-coco-queue).
2. v7 of "TDX host: metadata reading tweaks, bug fix and info dump" [1].
3. v1 of "KVM: VMX: Initialize TDX when loading KVM module" [2], with some
   new feedback from Sean.
4. v2 of “TDX vCPU/VM creation” [3]
5. v2 of "TDX KVM MMU part 2" [4]

It requires TDX module 1.5.06.00.0744[5], or later. This is due to removal
of the workarounds for the lack of the NO_RBP_MOD feature required by the
kernel. Now NO_RBP_MOD is enabled (in VM/vCPU creation patches), and this
particular version of the TDX module has a required NO_RBP_MOD related bug
fix.
A working edk2 commit is 95d8a1c ("UnitTestFrameworkPkg: Use TianoCore
mirror of subhook submodule").

Testing

The series has been tested as part of the development branch for the TDX
base series. The testing consisted of TDX kvm-unit-tests and booting a
Linux TD, and TDX enhanced KVM selftests.

The full KVM branch is here:
https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20

Matching QEMU:
https://github.com/intel-staging/qemu-tdx/commits/tdx-qemu-upstream-v6.1/

[0] https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/kvm/cover.1731318868.git.kai.huang@intel.com/#t
[2] https://lore.kernel.org/kvm/cover.1730120881.git.kai.huang@intel.com/
[3] https://lore.kernel.org/kvm/20241030190039.77971-1-rick.p.edgecombe@intel.com/
[4] https://lore.kernel.org/kvm/20241112073327.21979-1-yan.y.zhao@intel.com/
[5] https://github.com/intel/tdx-module/releases/tag/TDX_1.5.06

Chao Gao (1):
      KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
      wrmsr

Isaku Yamahata (4):
      KVM: TDX: Implement TDX vcpu enter/exit path
      KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
      KVM: TDX: restore host xsave state when exit from the guest TD
      KVM: TDX: restore user ret MSRs

Kai Huang (1):
      x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest

Yang Weijiang (1):
      KVM: TDX: Add TSX_CTRL msr into uret_msrs list

 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/include/asm/tdx.h      |   1 +
 arch/x86/kvm/vmx/main.c         |  45 ++++++++-
 arch/x86/kvm/vmx/tdx.c          | 212 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h          |  14 +++
 arch/x86/kvm/vmx/x86_ops.h      |   9 ++
 arch/x86/kvm/x86.c              |  24 ++++-
 arch/x86/virt/vmx/tdx/tdx.c     |   8 ++
 arch/x86/virt/vmx/tdx/tdx.h     |   1 +
 9 files changed, 306 insertions(+), 9 deletions(-)

Regards
Adrian

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-11-21 20:14 [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Adrian Hunter
@ 2024-11-21 20:14 ` Adrian Hunter
  2024-11-22 11:10   ` Adrian Hunter
  2024-11-22 16:26   ` Dave Hansen
  2024-11-21 20:14 ` [PATCH 2/7] KVM: TDX: Implement TDX vcpu enter/exit path Adrian Hunter
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-11-21 20:14 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, x86, yan.y.zhao, chao.gao,
	weijiang.yang

From: Kai Huang <kai.huang@intel.com>

Intel TDX protects guest VM's from malicious host and certain physical
attacks.  TDX introduces a new operation mode, Secure Arbitration Mode
(SEAM) to isolate and protect guest VM's.  A TDX guest VM runs in SEAM and,
unlike VMX, direct control and interaction with the guest by the host VMM
is not possible.  Instead, Intel TDX Module, which also runs in SEAM,
provides a SEAMCALL API.

The SEAMCALL that provides the ability to enter a guest is TDH.VP.ENTER.
The TDX Module processes TDH.VP.ENTER, and enters the guest via VMX
VMLAUNCH/VMRESUME instructions.  When a guest VM-exit requires host VMM
interaction, the TDH.VP.ENTER SEAMCALL returns to the host VMM (KVM).

Add tdh_vp_enter() to wrap the SEAMCALL invocation of TDH.VP.ENTER.

TDH.VP.ENTER is different from other SEAMCALLS in several ways:
 - it may take some time to return as the guest executes
 - it uses more arguments
 - after it returns some host state may need to be restored

TDH.VP.ENTER arguments are passed through General Purpose Registers (GPRs).
For the special case of the TD guest invoking TDG.VP.VMCALL, nearly any GPR
can be used, as well as XMM0 to XMM15. Notably, RBP is not used, and Linux
mandates the TDX Module feature NO_RBP_MOD, which is enforced elsewhere.
Additionally, XMM registers are not required for the existing Guest
Hypervisor Communication Interface and are handled by existing KVM code
should they be modified by the guest.

There are 2 input formats and 5 output formats for TDH.VP.ENTER arguments.
Input #1 : Initial entry or following a previous async. TD Exit
Input #2 : Following a previous TDCALL(TDG.VP.VMCALL)
Output #1 : On Error (No TD Entry)
Output #2 : Async. Exits with a VMX Architectural Exit Reason
Output #3 : Async. Exits with a non-VMX TD Exit Status
Output #4 : Async. Exits with Cross-TD Exit Details
Output #5 : On TDCALL(TDG.VP.VMCALL)

Currently, to keep things simple, the wrapper function does not attempt
to support different formats, and just passes all the GPRs that could be
used.  The GPR values are held by KVM in the area set aside for guest
GPRs.  KVM code uses the guest GPR area (vcpu->arch.regs[]) to set up for
or process results of tdh_vp_enter().

Therefore changing tdh_vp_enter() to use more complex argument formats
would also alter the way KVM code interacts with tdh_vp_enter().

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/include/asm/tdx.h  | 1 +
 arch/x86/virt/vmx/tdx/tdx.c | 8 ++++++++
 arch/x86/virt/vmx/tdx/tdx.h | 1 +
 3 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index fdc81799171e..77477b905dca 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -123,6 +123,7 @@ int tdx_guest_keyid_alloc(void);
 void tdx_guest_keyid_free(unsigned int keyid);
 
 /* SEAMCALL wrappers for creating/destroying/running TDX guests */
+u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args);
 u64 tdh_mng_addcx(u64 tdr, u64 tdcs);
 u64 tdh_mem_page_add(u64 tdr, u64 gpa, u64 hpa, u64 source, u64 *rcx, u64 *rdx);
 u64 tdh_mem_sept_add(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 04cb2f1d6deb..2a8997eb1ef1 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1600,6 +1600,14 @@ static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
 	return ret;
 }
 
+u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args)
+{
+	args->rcx = tdvpr;
+
+	return __seamcall_saved_ret(TDH_VP_ENTER, args);
+}
+EXPORT_SYMBOL_GPL(tdh_vp_enter);
+
 u64 tdh_mng_addcx(u64 tdr, u64 tdcs)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 4919d00025c9..58d5754dcb4d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -17,6 +17,7 @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_VP_ENTER			0
 #define TDH_MNG_ADDCX			1
 #define TDH_MEM_PAGE_ADD		2
 #define TDH_MEM_SEPT_ADD		3
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 2/7] KVM: TDX: Implement TDX vcpu enter/exit path
  2024-11-21 20:14 [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Adrian Hunter
  2024-11-21 20:14 ` [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest Adrian Hunter
@ 2024-11-21 20:14 ` Adrian Hunter
  2024-11-22  5:23   ` Xiaoyao Li
  2024-11-21 20:14 ` [PATCH 3/7] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) Adrian Hunter
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-11-21 20:14 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, x86, yan.y.zhao, chao.gao,
	weijiang.yang

From: Isaku Yamahata <isaku.yamahata@intel.com>

This patch implements running TDX vcpu.  Once vcpu runs on the logical
processor (LP), the TDX vcpu is associated with it.  When the TDX vcpu
moves to another LP, the TDX vcpu needs to flush its status on the LP.
When destroying TDX vcpu, it needs to complete flush, and flush cpu memory
cache.  Track which LP the TDX vcpu run and flush it as necessary.

Compared to VMX, do nothing on sched_in event as TDX doesn't support pause
loop.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v1:
- Make argument of tdx_vcpu_enter_exit() struct kvm_vcpu.
- Update for the wrapper functions for SEAMCALLs. (Sean)
- Remove noinstr (Sean)
- Add a missing comma, clarify sched_in part, and update changelog to
  match code by dropping the PMU related paragraph (Binbin)
  https://lore.kernel.org/lkml/c0029d4d-3dee-4f11-a929-d64d2651bfb3@linux.intel.com/
- Remove the union tdx_exit_reason. (Sean)
  https://lore.kernel.org/kvm/ZfSExlemFMKjBtZb@google.com/
- Remove the code of special handling of vcpu->kvm->vm_bugged (Rick)
  https://lore.kernel.org/kvm/20240318234010.GD1645738@ls.amr.corp.intel.com/
- For !tdx->initialized case, set tdx->vp_enter_ret to TDX_SW_ERROR to avoid
  collision with EXIT_REASON_EXCEPTION_NMI.

v19:
- Removed export_symbol_gpl(host_xcr0) to the patch that uses it

Changes v15 -> v16:
- use __seamcall_saved_ret()
- As struct tdx_module_args doesn't match with vcpu.arch.regs, copy regs
  before/after calling __seamcall_saved_ret().
---
 arch/x86/kvm/vmx/main.c    | 21 ++++++++++-
 arch/x86/kvm/vmx/tdx.c     | 76 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h     |  2 +
 arch/x86/kvm/vmx/x86_ops.h |  5 +++
 4 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index bfed421e6fbb..44ec6005a448 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -129,6 +129,23 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vmx_vcpu_load(vcpu, cpu);
 }
 
+static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		/* Unconditionally continue to vcpu_run(). */
+		return 1;
+
+	return vmx_vcpu_pre_run(vcpu);
+}
+
+static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_run(vcpu, force_immediate_exit);
+
+	return vmx_vcpu_run(vcpu, force_immediate_exit);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu)) {
@@ -267,8 +284,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.flush_tlb_gva = vt_flush_tlb_gva,
 	.flush_tlb_guest = vt_flush_tlb_guest,
 
-	.vcpu_pre_run = vmx_vcpu_pre_run,
-	.vcpu_run = vmx_vcpu_run,
+	.vcpu_pre_run = vt_vcpu_pre_run,
+	.vcpu_run = vt_vcpu_run,
 	.handle_exit = vmx_handle_exit,
 	.skip_emulated_instruction = vmx_skip_emulated_instruction,
 	.update_emulated_instruction = vmx_update_emulated_instruction,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index dc6c5f40608e..5fa5b65b9588 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -10,6 +10,9 @@
 #include "mmu/spte.h"
 #include "common.h"
 
+#include <trace/events/kvm.h>
+#include "trace.h"
+
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
@@ -662,6 +665,79 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 }
 
 
+static void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct tdx_module_args args;
+
+	guest_state_enter_irqoff();
+
+	/*
+	 * TODO: optimization:
+	 * - Eliminate copy between args and vcpu->arch.regs.
+	 * - copyin/copyout registers only if (tdx->tdvmvall.regs_mask != 0)
+	 *   which means TDG.VP.VMCALL.
+	 */
+	args = (struct tdx_module_args) {
+		.rcx = tdx->tdvpr_pa,
+#define REG(reg, REG)	.reg = vcpu->arch.regs[VCPU_REGS_ ## REG]
+		REG(rdx, RDX),
+		REG(r8,  R8),
+		REG(r9,  R9),
+		REG(r10, R10),
+		REG(r11, R11),
+		REG(r12, R12),
+		REG(r13, R13),
+		REG(r14, R14),
+		REG(r15, R15),
+		REG(rbx, RBX),
+		REG(rdi, RDI),
+		REG(rsi, RSI),
+#undef REG
+	};
+
+	tdx->vp_enter_ret = tdh_vp_enter(tdx->tdvpr_pa, &args);
+
+#define REG(reg, REG)	vcpu->arch.regs[VCPU_REGS_ ## REG] = args.reg
+	REG(rcx, RCX);
+	REG(rdx, RDX);
+	REG(r8,  R8);
+	REG(r9,  R9);
+	REG(r10, R10);
+	REG(r11, R11);
+	REG(r12, R12);
+	REG(r13, R13);
+	REG(r14, R14);
+	REG(r15, R15);
+	REG(rbx, RBX);
+	REG(rdi, RDI);
+	REG(rsi, RSI);
+#undef REG
+
+	guest_state_exit_irqoff();
+}
+
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	/* TDX exit handle takes care of this error case. */
+	if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
+		/* Set to avoid collision with EXIT_REASON_EXCEPTION_NMI. */
+		tdx->vp_enter_ret = TDX_SW_ERROR;
+		return EXIT_FASTPATH_NONE;
+	}
+
+	trace_kvm_entry(vcpu, force_immediate_exit);
+
+	tdx_vcpu_enter_exit(vcpu);
+
+	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
+	trace_kvm_exit(vcpu, KVM_ISA_VMX);
+
+	return EXIT_FASTPATH_NONE;
+}
+
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
 	u64 shared_bit = (pgd_level == 5) ? TDX_SHARED_BIT_PWL_5 :
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 899654519df6..ebee1049b08b 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -51,6 +51,8 @@ struct vcpu_tdx {
 
 	struct list_head cpu_list;
 
+	u64 vp_enter_ret;
+
 	enum vcpu_tdx_state state;
 };
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 06583b1afa4f..3d292a677b92 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -129,6 +129,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
@@ -156,6 +157,10 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
+static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
+{
+	return EXIT_FASTPATH_NONE;
+}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 3/7] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
  2024-11-21 20:14 [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Adrian Hunter
  2024-11-21 20:14 ` [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest Adrian Hunter
  2024-11-21 20:14 ` [PATCH 2/7] KVM: TDX: Implement TDX vcpu enter/exit path Adrian Hunter
@ 2024-11-21 20:14 ` Adrian Hunter
  2024-11-25 14:12   ` Nikolay Borisov
  2024-11-21 20:14 ` [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-11-21 20:14 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, x86, yan.y.zhao, chao.gao,
	weijiang.yang

From: Isaku Yamahata <isaku.yamahata@intel.com>

On entering/exiting TDX vcpu, preserved or clobbered CPU state is different
from the VMX case. Add TDX hooks to save/restore host/guest CPU state.
Save/restore kernel GS base MSR.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TD vcpu enter/exit v1:
 - Clarify comment (Binbin)
 - Use lower case preserved and add the for VMX in log (Tony)
 - Fix bisectability issue with includes (Kai)
---
 arch/x86/kvm/vmx/main.c    | 24 ++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 46 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h     |  4 ++++
 arch/x86/kvm/vmx/x86_ops.h |  4 ++++
 4 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 44ec6005a448..3a8ffc199be2 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -129,6 +129,26 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vmx_vcpu_load(vcpu, cpu);
 }
 
+static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_prepare_switch_to_guest(vcpu);
+		return;
+	}
+
+	vmx_prepare_switch_to_guest(vcpu);
+}
+
+static void vt_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_vcpu_put(vcpu);
+		return;
+	}
+
+	vmx_vcpu_put(vcpu);
+}
+
 static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -250,9 +270,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_free = vt_vcpu_free,
 	.vcpu_reset = vt_vcpu_reset,
 
-	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
+	.prepare_switch_to_guest = vt_prepare_switch_to_guest,
 	.vcpu_load = vt_vcpu_load,
-	.vcpu_put = vmx_vcpu_put,
+	.vcpu_put = vt_vcpu_put,
 
 	.update_exception_bitmap = vmx_update_exception_bitmap,
 	.get_feature_msr = vmx_get_feature_msr,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 5fa5b65b9588..6e4ea2d420bc 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/cleanup.h>
 #include <linux/cpu.h>
+#include <linux/mmu_context.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
 #include "mmu.h"
@@ -9,6 +10,7 @@
 #include "vmx.h"
 #include "mmu/spte.h"
 #include "common.h"
+#include "posted_intr.h"
 
 #include <trace/events/kvm.h>
 #include "trace.h"
@@ -605,6 +607,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
 		vcpu->arch.xfd_no_write_intercept = true;
 
+	tdx->host_state_need_save = true;
+	tdx->host_state_need_restore = false;
+
 	tdx->state = VCPU_TD_STATE_UNINITIALIZED;
 
 	return 0;
@@ -631,6 +636,45 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	local_irq_enable();
 }
 
+/*
+ * Compared to vmx_prepare_switch_to_guest(), there is not much to do
+ * as SEAMCALL/SEAMRET calls take care of most of save and restore.
+ */
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (!tdx->host_state_need_save)
+		return;
+
+	if (likely(is_64bit_mm(current->mm)))
+		tdx->msr_host_kernel_gs_base = current->thread.gsbase;
+	else
+		tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
+
+	tdx->host_state_need_save = false;
+}
+
+static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	tdx->host_state_need_save = true;
+	if (!tdx->host_state_need_restore)
+		return;
+
+	++vcpu->stat.host_state_reload;
+
+	wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
+	tdx->host_state_need_restore = false;
+}
+
+void tdx_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	vmx_vcpu_pi_put(vcpu);
+	tdx_prepare_switch_to_host(vcpu);
+}
+
 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -732,6 +776,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 
 	tdx_vcpu_enter_exit(vcpu);
 
+	tdx->host_state_need_restore = true;
+
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
 	trace_kvm_exit(vcpu, KVM_ISA_VMX);
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ebee1049b08b..48cf0a1abfcc 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -54,6 +54,10 @@ struct vcpu_tdx {
 	u64 vp_enter_ret;
 
 	enum vcpu_tdx_state state;
+
+	bool host_state_need_save;
+	bool host_state_need_restore;
+	u64 msr_host_kernel_gs_base;
 };
 
 void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 3d292a677b92..5bd45a720007 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -130,6 +130,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void tdx_vcpu_put(struct kvm_vcpu *vcpu);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
@@ -161,6 +163,8 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediat
 {
 	return EXIT_FASTPATH_NONE;
 }
+static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-11-21 20:14 [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (2 preceding siblings ...)
  2024-11-21 20:14 ` [PATCH 3/7] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) Adrian Hunter
@ 2024-11-21 20:14 ` Adrian Hunter
  2024-11-22  5:49   ` Chao Gao
  2024-11-21 20:14 ` [PATCH 5/7] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr Adrian Hunter
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-11-21 20:14 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, x86, yan.y.zhao, chao.gao,
	weijiang.yang

From: Isaku Yamahata <isaku.yamahata@intel.com>

On exiting from the guest TD, xsave state is clobbered.  Restore xsave
state on TD exit.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v1:
- Remove noinstr on tdx_vcpu_enter_exit() (Sean)
- Switch to kvm_host struct for xcr0 and xss

v19:
- Add EXPORT_SYMBOL_GPL(host_xcr0)

v15 -> v16:
- Added CET flag mask
---
 arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6e4ea2d420bc..00fdd2932205 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2,6 +2,8 @@
 #include <linux/cleanup.h>
 #include <linux/cpu.h>
 #include <linux/mmu_context.h>
+
+#include <asm/fpu/xcr.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
 #include "mmu.h"
@@ -709,6 +711,24 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 }
 
 
+static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+	if (static_cpu_has(X86_FEATURE_XSAVE) &&
+	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
+		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
+	if (static_cpu_has(X86_FEATURE_XSAVES) &&
+	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
+	    kvm_host.xss != (kvm_tdx->xfam &
+			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
+			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
+		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
+	if (static_cpu_has(X86_FEATURE_PKU) &&
+	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
+		write_pkru(vcpu->arch.host_pkru);
+}
+
 static void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -776,6 +796,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 
 	tdx_vcpu_enter_exit(vcpu);
 
+	tdx_restore_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 5/7] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr
  2024-11-21 20:14 [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (3 preceding siblings ...)
  2024-11-21 20:14 ` [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
@ 2024-11-21 20:14 ` Adrian Hunter
  2024-11-21 20:14 ` [PATCH 6/7] KVM: TDX: restore user ret MSRs Adrian Hunter
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-11-21 20:14 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, x86, yan.y.zhao, chao.gao,
	weijiang.yang

From: Chao Gao <chao.gao@intel.com>

Several MSRs are constant and only used in userspace(ring 3).  But VMs may
have different values.  KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace.  To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.

TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit.  It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware.  This inconsistency needs to be
fixed.  Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values.  kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr.  So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TD vcpu enter/exit v1:
 - Rename functions and remove useless comment (Binbin)
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 24 +++++++++++++++++++-----
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dfa89a5d15ef..e51a95aba824 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2303,6 +2303,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
 int kvm_add_user_return_msr(u32 msr);
 int kvm_find_user_return_msr(u32 msr);
 int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
+void kvm_user_return_msr_update_cache(unsigned int index, u64 val);
 
 static inline bool kvm_is_supported_user_return_msr(u32 msr)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 92de7ebf2cee..2b5b0ae3dd7e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -637,6 +637,15 @@ static void kvm_user_return_msr_cpu_online(void)
 	}
 }
 
+static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
+{
+	if (!msrs->registered) {
+		msrs->urn.on_user_return = kvm_on_user_return;
+		user_return_notifier_register(&msrs->urn);
+		msrs->registered = true;
+	}
+}
+
 int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 {
 	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
@@ -650,15 +659,20 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 		return 1;
 
 	msrs->values[slot].curr = value;
-	if (!msrs->registered) {
-		msrs->urn.on_user_return = kvm_on_user_return;
-		user_return_notifier_register(&msrs->urn);
-		msrs->registered = true;
-	}
+	kvm_user_return_register_notifier(msrs);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
 
+void kvm_user_return_msr_update_cache(unsigned int slot, u64 value)
+{
+	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
+
+	msrs->values[slot].curr = value;
+	kvm_user_return_register_notifier(msrs);
+}
+EXPORT_SYMBOL_GPL(kvm_user_return_msr_update_cache);
+
 static void drop_user_return_notifiers(void)
 {
 	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 6/7] KVM: TDX: restore user ret MSRs
  2024-11-21 20:14 [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (4 preceding siblings ...)
  2024-11-21 20:14 ` [PATCH 5/7] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr Adrian Hunter
@ 2024-11-21 20:14 ` Adrian Hunter
  2024-11-21 20:14 ` [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list Adrian Hunter
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-11-21 20:14 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, x86, yan.y.zhao, chao.gao,
	weijiang.yang

From: Isaku Yamahata <isaku.yamahata@intel.com>

Several user ret MSRs are clobbered on TD exit.  Restore those values on
TD exit and before returning to ring 3.  Because TSX_CTRL requires special
treatment, this patch doesn't address it.

Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TD vcpu enter/exit v1:
 - Rename tdx_user_return_update_cache() ->
     tdx_user_return_msr_update_cache() (extrapolated from Binbin)
 - Adjust to rename in previous patches (Binbin)
 - Simplify comment (Tony)
 - Move code change in tdx_hardware_setup() to __tdx_bringup().
---
 arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 00fdd2932205..4a33ca54c8ba 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2,7 +2,6 @@
 #include <linux/cleanup.h>
 #include <linux/cpu.h>
 #include <linux/mmu_context.h>
-
 #include <asm/fpu/xcr.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
@@ -711,6 +710,28 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 }
 
 
+struct tdx_uret_msr {
+	u32 msr;
+	unsigned int slot;
+	u64 defval;
+};
+
+static struct tdx_uret_msr tdx_uret_msrs[] = {
+	{.msr = MSR_SYSCALL_MASK, .defval = 0x20200 },
+	{.msr = MSR_STAR,},
+	{.msr = MSR_LSTAR,},
+	{.msr = MSR_TSC_AUX,},
+};
+
+static void tdx_user_return_msr_update_cache(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
+		kvm_user_return_msr_update_cache(tdx_uret_msrs[i].slot,
+						 tdx_uret_msrs[i].defval);
+}
+
 static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -796,6 +817,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 
 	tdx_vcpu_enter_exit(vcpu);
 
+	tdx_user_return_msr_update_cache();
 	tdx_restore_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
@@ -2233,6 +2255,24 @@ static int __init __tdx_bringup(void)
 	for_each_possible_cpu(i)
 		INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, i));
 
+	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
+		/*
+		 * Check if MSRs (tdx_uret_msrs) can be saved/restored
+		 * before returning to user space.
+		 *
+		 * this_cpu_ptr(user_return_msrs)->registered isn't checked
+		 * because the registration is done at vcpu runtime by
+		 * tdx_user_return_msr_update_cache().
+		 */
+		tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
+		if (tdx_uret_msrs[i].slot == -1) {
+			/* If any MSR isn't supported, it is a KVM bug */
+			pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
+				tdx_uret_msrs[i].msr);
+			return -EIO;
+		}
+	}
+
 	/*
 	 * Enabling TDX requires enabling hardware virtualization first,
 	 * as making SEAMCALLs requires CPU being in post-VMXON state.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-11-21 20:14 [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (5 preceding siblings ...)
  2024-11-21 20:14 ` [PATCH 6/7] KVM: TDX: restore user ret MSRs Adrian Hunter
@ 2024-11-21 20:14 ` Adrian Hunter
  2024-11-22  3:27   ` Chao Gao
  2024-11-25  1:25 ` [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Binbin Wu
  2024-12-10 18:23 ` Paolo Bonzini
  8 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-11-21 20:14 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, x86, yan.y.zhao, chao.gao,
	weijiang.yang

From: Yang Weijiang <weijiang.yang@intel.com>

TDX module resets the TSX_CTRL MSR to 0 at TD exit if TSX is enabled for
TD. Or it preserves the TSX_CTRL MSR if TSX is disabled for TD.  VMM can
rely on uret_msrs mechanism to defer the reload of host value until exiting
to user space.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v1:
 - Update from rename in earlier patches (Binbin)

v19:
- fix the type of tdx_uret_tsx_ctrl_slot. unguent int => int.
---
 arch/x86/kvm/vmx/tdx.c | 33 +++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.h |  8 ++++++++
 2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4a33ca54c8ba..2c7b6308da73 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -722,14 +722,21 @@ static struct tdx_uret_msr tdx_uret_msrs[] = {
 	{.msr = MSR_LSTAR,},
 	{.msr = MSR_TSC_AUX,},
 };
+static int tdx_uret_tsx_ctrl_slot;
 
-static void tdx_user_return_msr_update_cache(void)
+static void tdx_user_return_msr_update_cache(struct kvm_vcpu *vcpu)
 {
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
 		kvm_user_return_msr_update_cache(tdx_uret_msrs[i].slot,
 						 tdx_uret_msrs[i].defval);
+	/*
+	 * TSX_CTRL is reset to 0 if guest TSX is supported. Otherwise
+	 * preserved.
+	 */
+	if (to_kvm_tdx(vcpu->kvm)->tsx_supported && tdx_uret_tsx_ctrl_slot != -1)
+		kvm_user_return_msr_update_cache(tdx_uret_tsx_ctrl_slot, 0);
 }
 
 static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
@@ -817,7 +824,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 
 	tdx_vcpu_enter_exit(vcpu);
 
-	tdx_user_return_msr_update_cache();
+	tdx_user_return_msr_update_cache(vcpu);
 	tdx_restore_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
@@ -1258,6 +1265,22 @@ static int setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
 	return 0;
 }
 
+static bool tdparams_tsx_supported(struct kvm_cpuid2 *cpuid)
+{
+	const struct kvm_cpuid_entry2 *entry;
+	u64 mask;
+	u32 ebx;
+
+	entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x7, 0);
+	if (entry)
+		ebx = entry->ebx;
+	else
+		ebx = 0;
+
+	mask = __feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM);
+	return ebx & mask;
+}
+
 static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
 			struct kvm_tdx_init_vm *init_vm)
 {
@@ -1299,6 +1322,7 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
 	MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
 	MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);
 
+	to_kvm_tdx(kvm)->tsx_supported = tdparams_tsx_supported(cpuid);
 	return 0;
 }
 
@@ -2272,6 +2296,11 @@ static int __init __tdx_bringup(void)
 			return -EIO;
 		}
 	}
+	tdx_uret_tsx_ctrl_slot = kvm_find_user_return_msr(MSR_IA32_TSX_CTRL);
+	if (tdx_uret_tsx_ctrl_slot == -1 && boot_cpu_has(X86_FEATURE_MSR_TSX_CTRL)) {
+		pr_err("MSR_IA32_TSX_CTRL isn't included by kvm_find_user_return_msr\n");
+		return -EIO;
+	}
 
 	/*
 	 * Enabling TDX requires enabling hardware virtualization first,
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 48cf0a1abfcc..815ff6bdbc7e 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -29,6 +29,14 @@ struct kvm_tdx {
 	u8 nr_tdcs_pages;
 	u8 nr_vcpu_tdcx_pages;
 
+	/*
+	 * Used on each TD-exit, see tdx_user_return_msr_update_cache().
+	 * TSX_CTRL value on TD exit
+	 * - set 0     if guest TSX enabled
+	 * - preserved if guest TSX disabled
+	 */
+	bool tsx_supported;
+
 	u64 tsc_offset;
 
 	enum kvm_tdx_state state;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-11-21 20:14 ` [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list Adrian Hunter
@ 2024-11-22  3:27   ` Chao Gao
  2024-11-27 14:00     ` Sean Christopherson
  0 siblings, 1 reply; 82+ messages in thread
From: Chao Gao @ 2024-11-22  3:27 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: pbonzini, seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

>+static bool tdparams_tsx_supported(struct kvm_cpuid2 *cpuid)
>+{
>+	const struct kvm_cpuid_entry2 *entry;
>+	u64 mask;
>+	u32 ebx;
>+
>+	entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x7, 0);
>+	if (entry)
>+		ebx = entry->ebx;
>+	else
>+		ebx = 0;
>+
>+	mask = __feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM);
>+	return ebx & mask;
>+}
>+
> static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> 			struct kvm_tdx_init_vm *init_vm)
> {
>@@ -1299,6 +1322,7 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> 	MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
> 	MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);
> 
>+	to_kvm_tdx(kvm)->tsx_supported = tdparams_tsx_supported(cpuid);
> 	return 0;
> }
> 
>@@ -2272,6 +2296,11 @@ static int __init __tdx_bringup(void)
> 			return -EIO;
> 		}
> 	}
>+	tdx_uret_tsx_ctrl_slot = kvm_find_user_return_msr(MSR_IA32_TSX_CTRL);
>+	if (tdx_uret_tsx_ctrl_slot == -1 && boot_cpu_has(X86_FEATURE_MSR_TSX_CTRL)) {
>+		pr_err("MSR_IA32_TSX_CTRL isn't included by kvm_find_user_return_msr\n");
>+		return -EIO;
>+	}
> 
> 	/*
> 	 * Enabling TDX requires enabling hardware virtualization first,
>diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
>index 48cf0a1abfcc..815ff6bdbc7e 100644
>--- a/arch/x86/kvm/vmx/tdx.h
>+++ b/arch/x86/kvm/vmx/tdx.h
>@@ -29,6 +29,14 @@ struct kvm_tdx {
> 	u8 nr_tdcs_pages;
> 	u8 nr_vcpu_tdcx_pages;
> 
>+	/*
>+	 * Used on each TD-exit, see tdx_user_return_msr_update_cache().
>+	 * TSX_CTRL value on TD exit
>+	 * - set 0     if guest TSX enabled
>+	 * - preserved if guest TSX disabled
>+	 */
>+	bool tsx_supported;

Is it possible to drop this boolean and tdparams_tsx_supported()? I think we
can use the guest_can_use() framework instead.

>+
> 	u64 tsc_offset;
> 
> 	enum kvm_tdx_state state;
>-- 
>2.43.0
>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/7] KVM: TDX: Implement TDX vcpu enter/exit path
  2024-11-21 20:14 ` [PATCH 2/7] KVM: TDX: Implement TDX vcpu enter/exit path Adrian Hunter
@ 2024-11-22  5:23   ` Xiaoyao Li
  2024-11-22  5:56     ` Binbin Wu
  0 siblings, 1 reply; 82+ messages in thread
From: Xiaoyao Li @ 2024-11-22  5:23 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	x86, yan.y.zhao, chao.gao, weijiang.yang

On 11/22/2024 4:14 AM, Adrian Hunter wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> This patch implements running TDX vcpu.  Once vcpu runs on the logical
> processor (LP), the TDX vcpu is associated with it.  When the TDX vcpu
> moves to another LP, the TDX vcpu needs to flush its status on the LP.
> When destroying TDX vcpu, it needs to complete flush, and flush cpu memory
> cache.  Track which LP the TDX vcpu run and flush it as necessary.

The changelog needs update. It doesn't match the patch content.

> Compared to VMX, do nothing on sched_in event as TDX doesn't support pause
> loop.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
> TD vcpu enter/exit v1:
> - Make argument of tdx_vcpu_enter_exit() struct kvm_vcpu.
> - Update for the wrapper functions for SEAMCALLs. (Sean)
> - Remove noinstr (Sean)
> - Add a missing comma, clarify sched_in part, and update changelog to
>    match code by dropping the PMU related paragraph (Binbin)
>    https://lore.kernel.org/lkml/c0029d4d-3dee-4f11-a929-d64d2651bfb3@linux.intel.com/
> - Remove the union tdx_exit_reason. (Sean)
>    https://lore.kernel.org/kvm/ZfSExlemFMKjBtZb@google.com/
> - Remove the code of special handling of vcpu->kvm->vm_bugged (Rick)
>    https://lore.kernel.org/kvm/20240318234010.GD1645738@ls.amr.corp.intel.com/
> - For !tdx->initialized case, set tdx->vp_enter_ret to TDX_SW_ERROR to avoid
>    collision with EXIT_REASON_EXCEPTION_NMI.
> 
> v19:
> - Removed export_symbol_gpl(host_xcr0) to the patch that uses it
> 
> Changes v15 -> v16:
> - use __seamcall_saved_ret()
> - As struct tdx_module_args doesn't match with vcpu.arch.regs, copy regs
>    before/after calling __seamcall_saved_ret().
> ---
>   arch/x86/kvm/vmx/main.c    | 21 ++++++++++-
>   arch/x86/kvm/vmx/tdx.c     | 76 ++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/tdx.h     |  2 +
>   arch/x86/kvm/vmx/x86_ops.h |  5 +++
>   4 files changed, 102 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index bfed421e6fbb..44ec6005a448 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -129,6 +129,23 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   	vmx_vcpu_load(vcpu, cpu);
>   }
>   
> +static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		/* Unconditionally continue to vcpu_run(). */
> +		return 1;
> +
> +	return vmx_vcpu_pre_run(vcpu);
> +}
> +
> +static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_vcpu_run(vcpu, force_immediate_exit);
> +
> +	return vmx_vcpu_run(vcpu, force_immediate_exit);
> +}
> +
>   static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
>   {
>   	if (is_td_vcpu(vcpu)) {
> @@ -267,8 +284,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.flush_tlb_gva = vt_flush_tlb_gva,
>   	.flush_tlb_guest = vt_flush_tlb_guest,
>   
> -	.vcpu_pre_run = vmx_vcpu_pre_run,
> -	.vcpu_run = vmx_vcpu_run,
> +	.vcpu_pre_run = vt_vcpu_pre_run,
> +	.vcpu_run = vt_vcpu_run,
>   	.handle_exit = vmx_handle_exit,
>   	.skip_emulated_instruction = vmx_skip_emulated_instruction,
>   	.update_emulated_instruction = vmx_update_emulated_instruction,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index dc6c5f40608e..5fa5b65b9588 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -10,6 +10,9 @@
>   #include "mmu/spte.h"
>   #include "common.h"
>   
> +#include <trace/events/kvm.h>
> +#include "trace.h"
> +
>   #undef pr_fmt
>   #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>   
> @@ -662,6 +665,79 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
>   }
>   
>   
> +static void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	struct tdx_module_args args;
> +
> +	guest_state_enter_irqoff();
> +
> +	/*
> +	 * TODO: optimization:
> +	 * - Eliminate copy between args and vcpu->arch.regs.
> +	 * - copyin/copyout registers only if (tdx->tdvmvall.regs_mask != 0)
> +	 *   which means TDG.VP.VMCALL.
> +	 */
> +	args = (struct tdx_module_args) {
> +		.rcx = tdx->tdvpr_pa,
> +#define REG(reg, REG)	.reg = vcpu->arch.regs[VCPU_REGS_ ## REG]
> +		REG(rdx, RDX),
> +		REG(r8,  R8),
> +		REG(r9,  R9),
> +		REG(r10, R10),
> +		REG(r11, R11),
> +		REG(r12, R12),
> +		REG(r13, R13),
> +		REG(r14, R14),
> +		REG(r15, R15),
> +		REG(rbx, RBX),
> +		REG(rdi, RDI),
> +		REG(rsi, RSI),
> +#undef REG
> +	};
> +
> +	tdx->vp_enter_ret = tdh_vp_enter(tdx->tdvpr_pa, &args);
> +
> +#define REG(reg, REG)	vcpu->arch.regs[VCPU_REGS_ ## REG] = args.reg
> +	REG(rcx, RCX);
> +	REG(rdx, RDX);
> +	REG(r8,  R8);
> +	REG(r9,  R9);
> +	REG(r10, R10);
> +	REG(r11, R11);
> +	REG(r12, R12);
> +	REG(r13, R13);
> +	REG(r14, R14);
> +	REG(r15, R15);
> +	REG(rbx, RBX);
> +	REG(rdi, RDI);
> +	REG(rsi, RSI);
> +#undef REG
> +
> +	guest_state_exit_irqoff();
> +}
> +
> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	/* TDX exit handle takes care of this error case. */
> +	if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
> +		/* Set to avoid collision with EXIT_REASON_EXCEPTION_NMI. */

It seems the check fits better in tdx_vcpu_pre_run().

And without the patch of how TDX handles Exit (i.e., how deal with 
vp_enter_ret), it's hard to review this comment.

> +		tdx->vp_enter_ret = TDX_SW_ERROR;
> +		return EXIT_FASTPATH_NONE;
> +	}
> +
> +	trace_kvm_entry(vcpu, force_immediate_exit);
> +
> +	tdx_vcpu_enter_exit(vcpu);
> +
> +	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> +	trace_kvm_exit(vcpu, KVM_ISA_VMX);
> +
> +	return EXIT_FASTPATH_NONE;
> +}
> +
>   void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>   {
>   	u64 shared_bit = (pgd_level == 5) ? TDX_SHARED_BIT_PWL_5 :
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 899654519df6..ebee1049b08b 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -51,6 +51,8 @@ struct vcpu_tdx {
>   
>   	struct list_head cpu_list;
>   
> +	u64 vp_enter_ret;
> +
>   	enum vcpu_tdx_state state;
>   };
>   
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 06583b1afa4f..3d292a677b92 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -129,6 +129,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>   int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
>   
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>   
> @@ -156,6 +157,10 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
>   static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>   static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>   static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
> +static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> +{
> +	return EXIT_FASTPATH_NONE;
> +}
>   
>   static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
>   


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-11-21 20:14 ` [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
@ 2024-11-22  5:49   ` Chao Gao
  2024-11-25 11:10     ` Adrian Hunter
  2024-11-25 11:34     ` Adrian Hunter
  0 siblings, 2 replies; 82+ messages in thread
From: Chao Gao @ 2024-11-22  5:49 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: pbonzini, seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

>+static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>+{
>+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>+
>+	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>+	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
>+		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>+	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>+	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
>+	    kvm_host.xss != (kvm_tdx->xfam &
>+			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>+			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))

Should we drop CET/PT from this series? I think they are worth a new
patch/series.

>+		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
>+	if (static_cpu_has(X86_FEATURE_PKU) &&

How about using cpu_feature_enabled()? It is used in kvm_load_host_xsave_state()
It handles the case where CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is not
enabled.

>+	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
>+		write_pkru(vcpu->arch.host_pkru);

If host_pkru happens to match the hardware value after TD-exits, the write can
be omitted, similar to what is done above for xss and xcr0.

>+}

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/7] KVM: TDX: Implement TDX vcpu enter/exit path
  2024-11-22  5:23   ` Xiaoyao Li
@ 2024-11-22  5:56     ` Binbin Wu
  2024-11-22 14:33       ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Binbin Wu @ 2024-11-22  5:56 UTC (permalink / raw)
  To: Xiaoyao Li, Adrian Hunter, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	dmatlack, isaku.yamahata, nik.borisov, linux-kernel, x86,
	yan.y.zhao, chao.gao, weijiang.yang




On 11/22/2024 1:23 PM, Xiaoyao Li wrote:
[...]
>> +
>> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>> +{
>> +    struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +
>> +    /* TDX exit handle takes care of this error case. */
>> +    if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
>> +        /* Set to avoid collision with EXIT_REASON_EXCEPTION_NMI. */
>
> It seems the check fits better in tdx_vcpu_pre_run().

Indeed, it's cleaner to move the check to vcpu_pre_run.
Then no need to set the value to vp_enter_ret, and the comments are not
needed.

>
> And without the patch of how TDX handles Exit (i.e., how deal with vp_enter_ret), it's hard to review this comment.
>
>> +        tdx->vp_enter_ret = TDX_SW_ERROR;
>> +        return EXIT_FASTPATH_NONE;
>> +    }
>> +
>> +    trace_kvm_entry(vcpu, force_immediate_exit);
>> +
>> +    tdx_vcpu_enter_exit(vcpu);
>> +
>> +    vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
>> +    trace_kvm_exit(vcpu, KVM_ISA_VMX);
>> +
>> +    return EXIT_FASTPATH_NONE;
>> +}
>> +
[...]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-11-21 20:14 ` [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest Adrian Hunter
@ 2024-11-22 11:10   ` Adrian Hunter
  2024-11-22 16:33     ` Dave Hansen
  2024-11-22 16:26   ` Dave Hansen
  1 sibling, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-11-22 11:10 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang

On 21/11/24 22:14, Adrian Hunter wrote:
> From: Kai Huang <kai.huang@intel.com>
> 
> Intel TDX protects guest VM's from malicious host and certain physical
> attacks.  TDX introduces a new operation mode, Secure Arbitration Mode
> (SEAM) to isolate and protect guest VM's.  A TDX guest VM runs in SEAM and,
> unlike VMX, direct control and interaction with the guest by the host VMM
> is not possible.  Instead, Intel TDX Module, which also runs in SEAM,
> provides a SEAMCALL API.
> 
> The SEAMCALL that provides the ability to enter a guest is TDH.VP.ENTER.
> The TDX Module processes TDH.VP.ENTER, and enters the guest via VMX
> VMLAUNCH/VMRESUME instructions.  When a guest VM-exit requires host VMM
> interaction, the TDH.VP.ENTER SEAMCALL returns to the host VMM (KVM).
> 
> Add tdh_vp_enter() to wrap the SEAMCALL invocation of TDH.VP.ENTER.
> 
> TDH.VP.ENTER is different from other SEAMCALLS in several ways:
>  - it may take some time to return as the guest executes
>  - it uses more arguments
>  - after it returns some host state may need to be restored
> 
> TDH.VP.ENTER arguments are passed through General Purpose Registers (GPRs).
> For the special case of the TD guest invoking TDG.VP.VMCALL, nearly any GPR
> can be used, as well as XMM0 to XMM15. Notably, RBP is not used, and Linux
> mandates the TDX Module feature NO_RBP_MOD, which is enforced elsewhere.
> Additionally, XMM registers are not required for the existing Guest
> Hypervisor Communication Interface and are handled by existing KVM code
> should they be modified by the guest.
> 
> There are 2 input formats and 5 output formats for TDH.VP.ENTER arguments.
> Input #1 : Initial entry or following a previous async. TD Exit
> Input #2 : Following a previous TDCALL(TDG.VP.VMCALL)
> Output #1 : On Error (No TD Entry)
> Output #2 : Async. Exits with a VMX Architectural Exit Reason
> Output #3 : Async. Exits with a non-VMX TD Exit Status
> Output #4 : Async. Exits with Cross-TD Exit Details
> Output #5 : On TDCALL(TDG.VP.VMCALL)
> 
> Currently, to keep things simple, the wrapper function does not attempt
> to support different formats, and just passes all the GPRs that could be
> used.  The GPR values are held by KVM in the area set aside for guest
> GPRs.  KVM code uses the guest GPR area (vcpu->arch.regs[]) to set up for
> or process results of tdh_vp_enter().
> 
> Therefore changing tdh_vp_enter() to use more complex argument formats
> would also alter the way KVM code interacts with tdh_vp_enter().
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  arch/x86/include/asm/tdx.h  | 1 +
>  arch/x86/virt/vmx/tdx/tdx.c | 8 ++++++++
>  arch/x86/virt/vmx/tdx/tdx.h | 1 +
>  3 files changed, 10 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index fdc81799171e..77477b905dca 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -123,6 +123,7 @@ int tdx_guest_keyid_alloc(void);
>  void tdx_guest_keyid_free(unsigned int keyid);
>  
>  /* SEAMCALL wrappers for creating/destroying/running TDX guests */
> +u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args);
>  u64 tdh_mng_addcx(u64 tdr, u64 tdcs);
>  u64 tdh_mem_page_add(u64 tdr, u64 gpa, u64 hpa, u64 source, u64 *rcx, u64 *rdx);
>  u64 tdh_mem_sept_add(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx);
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 04cb2f1d6deb..2a8997eb1ef1 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1600,6 +1600,14 @@ static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
>  	return ret;
>  }
>  
> +u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args)
> +{
> +	args->rcx = tdvpr;
> +
> +	return __seamcall_saved_ret(TDH_VP_ENTER, args);
> +}
> +EXPORT_SYMBOL_GPL(tdh_vp_enter);

One alternative could be to create a union to hold the arguments:

u64 tdh_vp_enter(u64 tdvpr, union tdh_vp_enter_args *vp_enter_args)
{
	struct tdx_module_args *args = (struct tdx_module_args *)vp_enter_args;

	args->rcx = tdvpr;

	return __seamcall_saved_ret(TDH_VP_ENTER, args);
}

The diff below shows what that would look like for KVM TDX, based on top
of:

	https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20

Define 'union tdh_vp_enter_args' to hold tdh_vp_enter() arguments
instead of using vcpu->arch.regs[].  For example, in tdexit_exit_qual()

	kvm_rcx_read(vcpu)

becomes:

	to_tdx(vcpu)->vp_enter_args.out.exit_qual

which has the advantage that it provides variable names for the different
arguments.

---
 arch/x86/include/asm/tdx.h  | 163 +++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.c      | 205 +++++++++++++++---------------------
 arch/x86/kvm/vmx/tdx.h      |   1 +
 arch/x86/virt/vmx/tdx/tdx.c |   4 +-
 4 files changed, 249 insertions(+), 124 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 01409a59224d..3568e6b36b77 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -123,8 +123,169 @@ const struct tdx_sys_info *tdx_get_sysinfo(void);
 int tdx_guest_keyid_alloc(void);
 void tdx_guest_keyid_free(unsigned int keyid);
 
+/* TDH.VP.ENTER Input Format #2 : Following a previous TDCALL(TDG.VP.VMCALL) */
+struct tdh_vp_enter_in {
+	u64	__vcpu_handle_and_flags; /* Don't use. tdh_vp_enter() will take care of it */
+	u64	unused[3];
+	u64	ret_code;
+	union {
+		u64 gettdvmcallinfo[4];
+		struct {
+			u64	failed_gpa;
+		} mapgpa;
+		struct {
+			u64	unused;
+			u64	eax;
+			u64	ebx;
+			u64	ecx;
+			u64	edx;
+		} cpuid;
+		/* Value read for IO, MMIO or RDMSR */
+		struct {
+			u64	value;
+		} read;
+	};
+};
+
+/*
+ * TDH.VP.ENTER Output Formats #2 and #3 combined:
+ *	#2 : Async TD exits with a VMX Architectural Exit Reason
+ *	#3 : Async TD exits with a non-VMX TD Exit Status
+ */
+struct tdh_vp_enter_out {
+	u64	exit_qual	: 32,	/* #2 only */
+		vm_idx		:  2,	/* #2 and #3 */
+		reserved_0	: 30;
+	u64	ext_exit_qual;		/* #2 only */
+	u64	gpa;			/* #2 only */
+	u64	interrupt_info	: 32,	/* #2 only */
+		reserved_1	: 32;
+	u64	unused[9];
+};
+
+/*
+ * KVM hypercall : Refer struct tdh_vp_enter_tdcall - fn is the non-zero
+ * hypercall number (nr), subfn is the first parameter (p1), and p2 to p3
+ * below are the remaining parameters.
+ */
+struct tdh_vp_enter_vmcall {
+	u64	p2;
+	u64	p3;
+	u64	p4;
+};
+
+/* TDVMCALL_GET_TD_VM_CALL_INFO */
+struct tdh_vp_enter_gettdvmcallinfo {
+	u64	leaf;
+};
+
+/* TDVMCALL_MAP_GPA */
+struct tdh_vp_enter_mapgpa {
+	u64	gpa;
+	u64	size;
+};
+
+/* TDVMCALL_GET_QUOTE */
+struct tdh_vp_enter_getquote {
+	u64	shared_gpa;
+	u64	size;
+};
+
+#define TDX_ERR_DATA_PART_1 5
+
+/* TDVMCALL_REPORT_FATAL_ERROR */
+struct tdh_vp_enter_reportfatalerror {
+	union {
+		u64	err_codes;
+		struct {
+			u64	err_code	: 32,
+				ext_err_code	: 31,
+				gpa_valid	:  1;
+		};
+	};
+	u64	err_data_gpa;
+	u64	err_data[TDX_ERR_DATA_PART_1];
+};
+
+/* EXIT_REASON_CPUID */
+struct tdh_vp_enter_cpuid {
+	u64	eax;
+	u64	ecx;
+};
+
+/* EXIT_REASON_EPT_VIOLATION */
+struct tdh_vp_enter_mmio {
+	u64	size;
+	u64	direction;
+	u64	mmio_addr;
+	u64	value;
+};
+
+/* EXIT_REASON_HLT */
+struct tdh_vp_enter_hlt {
+	u64	intr_blocked_flag;
+};
+
+/* EXIT_REASON_IO_INSTRUCTION */
+struct tdh_vp_enter_io {
+	u64	size;
+	u64	direction;
+	u64	port;
+	u64	value;
+};
+
+/* EXIT_REASON_MSR_READ */
+struct tdh_vp_enter_rd {
+	u64	msr;
+};
+
+/* EXIT_REASON_MSR_WRITE */
+struct  tdh_vp_enter_wr {
+	u64	msr;
+	u64	value;
+};
+
+#define TDX_ERR_DATA_PART_2 3
+
+/* TDH.VP.ENTER  Output Format #5 : On TDCALL(TDG.VP.VMCALL) */
+struct tdh_vp_enter_tdcall {
+	u64	reg_mask	: 32,
+		vm_idx		:  2,
+		reserved_0	: 30;
+	u64	data[TDX_ERR_DATA_PART_2];
+	u64	fn;	/* Non-zero for hypercalls, zero otherwise */
+	u64	subfn;
+	union {
+		struct tdh_vp_enter_vmcall 		vmcall;
+		struct tdh_vp_enter_gettdvmcallinfo	gettdvmcallinfo;
+		struct tdh_vp_enter_mapgpa		mapgpa;
+		struct tdh_vp_enter_getquote		getquote;
+		struct tdh_vp_enter_reportfatalerror	reportfatalerror;
+		struct tdh_vp_enter_cpuid		cpuid;
+		struct tdh_vp_enter_mmio		mmio;
+		struct tdh_vp_enter_hlt			hlt;
+		struct tdh_vp_enter_io			io;
+		struct tdh_vp_enter_rd			rd;
+		struct tdh_vp_enter_wr			wr;
+	};
+};
+
+/* Must be kept exactly in sync with struct tdx_module_args */
+union tdh_vp_enter_args {
+	/* Input Format #2 : Following a previous TDCALL(TDG.VP.VMCALL) */
+	struct tdh_vp_enter_in in;
+	/*
+	 * Output Formats #2 and #3 combined:
+	 *	#2 : Async TD exits with a VMX Architectural Exit Reason
+	 *	#3 : Async TD exits with a non-VMX TD Exit Status
+	 */
+	struct tdh_vp_enter_out out;
+	/* Output Format #5 : On TDCALL(TDG.VP.VMCALL) */
+	struct tdh_vp_enter_tdcall tdcall;
+};
+
 /* SEAMCALL wrappers for creating/destroying/running TDX guests */
-u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args);
+u64 tdh_vp_enter(u64 tdvpr, union tdh_vp_enter_args *tdh_vp_enter_args);
 u64 tdh_mng_addcx(u64 tdr, u64 tdcs);
 u64 tdh_mem_page_add(u64 tdr, u64 gpa, u64 hpa, u64 source, u64 *rcx, u64 *rdx);
 u64 tdh_mem_sept_add(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f5fc1a782b5b..56af7b8c71ab 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -211,57 +211,41 @@ static bool tdx_check_exit_reason(struct kvm_vcpu *vcpu, u16 reason)
 
 static __always_inline unsigned long tdexit_exit_qual(struct kvm_vcpu *vcpu)
 {
-	return kvm_rcx_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_args.out.exit_qual;
 }
 
 static __always_inline unsigned long tdexit_ext_exit_qual(struct kvm_vcpu *vcpu)
 {
-	return kvm_rdx_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_args.out.ext_exit_qual;
 }
 
 static __always_inline unsigned long tdexit_gpa(struct kvm_vcpu *vcpu)
 {
-	return kvm_r8_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_args.out.gpa;
 }
 
 static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
 {
-	return kvm_r9_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_args.out.interrupt_info;
 }
 
-#define BUILD_TDVMCALL_ACCESSORS(param, gpr)				\
-static __always_inline							\
-unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu)		\
-{									\
-	return kvm_##gpr##_read(vcpu);					\
-}									\
-static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu, \
-						     unsigned long val)  \
-{									\
-	kvm_##gpr##_write(vcpu, val);					\
-}
-BUILD_TDVMCALL_ACCESSORS(a0, r12);
-BUILD_TDVMCALL_ACCESSORS(a1, r13);
-BUILD_TDVMCALL_ACCESSORS(a2, r14);
-BUILD_TDVMCALL_ACCESSORS(a3, r15);
-
 static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
 {
-	return kvm_r10_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_args.tdcall.fn;
 }
 static __always_inline unsigned long tdvmcall_leaf(struct kvm_vcpu *vcpu)
 {
-	return kvm_r11_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_args.tdcall.subfn;
 }
 static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
 						     long val)
 {
-	kvm_r10_write(vcpu, val);
+	to_tdx(vcpu)->vp_enter_args.in.ret_code = val;
 }
 static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
 						    unsigned long val)
 {
-	kvm_r11_write(vcpu, val);
+	to_tdx(vcpu)->vp_enter_args.in.read.value = val;
 }
 
 static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
@@ -745,7 +729,7 @@ bool tdx_interrupt_allowed(struct kvm_vcpu *vcpu)
 	    tdvmcall_exit_type(vcpu) || tdvmcall_leaf(vcpu) != EXIT_REASON_HLT)
 	    return true;
 
-	return !tdvmcall_a0_read(vcpu);
+	return !to_tdx(vcpu)->vp_enter_args.tdcall.hlt.intr_blocked_flag;
 }
 
 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
@@ -899,51 +883,10 @@ static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
 static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
-	struct tdx_module_args args;
 
 	guest_state_enter_irqoff();
 
-	/*
-	 * TODO: optimization:
-	 * - Eliminate copy between args and vcpu->arch.regs.
-	 * - copyin/copyout registers only if (tdx->tdvmvall.regs_mask != 0)
-	 *   which means TDG.VP.VMCALL.
-	 */
-	args = (struct tdx_module_args) {
-		.rcx = tdx->tdvpr_pa,
-#define REG(reg, REG)	.reg = vcpu->arch.regs[VCPU_REGS_ ## REG]
-		REG(rdx, RDX),
-		REG(r8,  R8),
-		REG(r9,  R9),
-		REG(r10, R10),
-		REG(r11, R11),
-		REG(r12, R12),
-		REG(r13, R13),
-		REG(r14, R14),
-		REG(r15, R15),
-		REG(rbx, RBX),
-		REG(rdi, RDI),
-		REG(rsi, RSI),
-#undef REG
-	};
-
-	tdx->vp_enter_ret = tdh_vp_enter(tdx->tdvpr_pa, &args);
-
-#define REG(reg, REG)	vcpu->arch.regs[VCPU_REGS_ ## REG] = args.reg
-	REG(rcx, RCX);
-	REG(rdx, RDX);
-	REG(r8,  R8);
-	REG(r9,  R9);
-	REG(r10, R10);
-	REG(r11, R11);
-	REG(r12, R12);
-	REG(r13, R13);
-	REG(r14, R14);
-	REG(r15, R15);
-	REG(rbx, RBX);
-	REG(rdi, RDI);
-	REG(rsi, RSI);
-#undef REG
+	tdx->vp_enter_ret = tdh_vp_enter(tdx->tdvpr_pa, &tdx->vp_enter_args);
 
 	if (tdx_check_exit_reason(vcpu, EXIT_REASON_EXCEPTION_NMI) &&
 	    is_nmi(tdexit_intr_info(vcpu)))
@@ -1083,8 +1026,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 
 static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	int r;
 
+	kvm_r10_write(vcpu, tdx->vp_enter_args.tdcall.fn);
+	kvm_r11_write(vcpu, tdx->vp_enter_args.tdcall.subfn);
+	kvm_r12_write(vcpu, tdx->vp_enter_args.tdcall.vmcall.p2);
+	kvm_r13_write(vcpu, tdx->vp_enter_args.tdcall.vmcall.p3);
+	kvm_r14_write(vcpu, tdx->vp_enter_args.tdcall.vmcall.p4);
+
 	/*
 	 * ABI for KVM tdvmcall argument:
 	 * In Guest-Hypervisor Communication Interface(GHCI) specification,
@@ -1092,13 +1042,12 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
 	 * vendor-specific.  KVM uses this for KVM hypercall.  NOTE: KVM
 	 * hypercall number starts from one.  Zero isn't used for KVM hypercall
 	 * number.
-	 *
-	 * R10: KVM hypercall number
-	 * arguments: R11, R12, R13, R14.
 	 */
 	r = __kvm_emulate_hypercall(vcpu, r10, r11, r12, r13, r14, true, 0,
 				    R10, complete_hypercall_exit);
 
+	tdvmcall_set_return_code(vcpu, kvm_r10_read(vcpu));
+
 	return r > 0;
 }
 
@@ -1116,7 +1065,7 @@ static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
 
 	if(vcpu->run->hypercall.ret) {
 		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
-		kvm_r11_write(vcpu, tdx->map_gpa_next);
+		tdx->vp_enter_args.in.mapgpa.failed_gpa = tdx->map_gpa_next;
 		return 1;
 	}
 
@@ -1137,7 +1086,7 @@ static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
 	if (pi_has_pending_interrupt(vcpu) ||
 	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
 		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
-		kvm_r11_write(vcpu, tdx->map_gpa_next);
+		tdx->vp_enter_args.in.mapgpa.failed_gpa = tdx->map_gpa_next;
 		return 1;
 	}
 
@@ -1169,8 +1118,8 @@ static void __tdx_map_gpa(struct vcpu_tdx * tdx)
 static int tdx_map_gpa(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx * tdx = to_tdx(vcpu);
-	u64 gpa = tdvmcall_a0_read(vcpu);
-	u64 size = tdvmcall_a1_read(vcpu);
+	u64 gpa  = tdx->vp_enter_args.tdcall.mapgpa.gpa;
+	u64 size = tdx->vp_enter_args.tdcall.mapgpa.size;
 	u64 ret;
 
 	/*
@@ -1206,14 +1155,19 @@ static int tdx_map_gpa(struct kvm_vcpu *vcpu)
 
 error:
 	tdvmcall_set_return_code(vcpu, ret);
-	kvm_r11_write(vcpu, gpa);
+	tdx->vp_enter_args.in.mapgpa.failed_gpa = gpa;
 	return 1;
 }
 
 static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
 {
-	u64 reg_mask = kvm_rcx_read(vcpu);
-	u64* opt_regs;
+	union tdh_vp_enter_args *args = &to_tdx(vcpu)->vp_enter_args;
+	__u64 *data = &vcpu->run->system_event.data[0];
+	u64 reg_mask = args->tdcall.reg_mask;
+	const int mask[] = {14, 15, 3, 7, 6};
+	int cnt = 0;
+
+	BUILD_BUG_ON(ARRAY_SIZE(mask) != TDX_ERR_DATA_PART_1);
 
 	/*
 	 * Skip sanity checks and let userspace decide what to do if sanity
@@ -1221,32 +1175,35 @@ static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
 	 */
 	vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
 	vcpu->run->system_event.type = KVM_SYSTEM_EVENT_TDX_FATAL;
-	vcpu->run->system_event.ndata = 10;
 	/* Error codes. */
-	vcpu->run->system_event.data[0] = tdvmcall_a0_read(vcpu);
+	data[cnt++] = args->tdcall.reportfatalerror.err_codes;
 	/* GPA of additional information page. */
-	vcpu->run->system_event.data[1] = tdvmcall_a1_read(vcpu);
+	data[cnt++] = args->tdcall.reportfatalerror.err_data_gpa;
+
 	/* Information passed via registers (up to 64 bytes). */
-	opt_regs = &vcpu->run->system_event.data[2];
+	for (int i = 0; i < TDX_ERR_DATA_PART_1; i++) {
+		if (reg_mask & BIT_ULL(mask[i]))
+			data[cnt++] = args->tdcall.reportfatalerror.err_data[i];
+		else
+			data[cnt++] = 0;
+	}
 
-#define COPY_REG(REG, MASK)						\
-	do {								\
-		if (reg_mask & MASK)					\
-			*opt_regs = kvm_ ## REG ## _read(vcpu);		\
-		else							\
-			*opt_regs = 0;					\
-		opt_regs++;						\
-	} while (0)
+	if (reg_mask & BIT_ULL(8))
+		data[cnt++] = args->tdcall.data[1];
+	else
+		data[cnt++] = 0;
 
-	/* The order is defined in GHCI. */
-	COPY_REG(r14, BIT_ULL(14));
-	COPY_REG(r15, BIT_ULL(15));
-	COPY_REG(rbx, BIT_ULL(3));
-	COPY_REG(rdi, BIT_ULL(7));
-	COPY_REG(rsi, BIT_ULL(6));
-	COPY_REG(r8, BIT_ULL(8));
-	COPY_REG(r9, BIT_ULL(9));
-	COPY_REG(rdx, BIT_ULL(2));
+	if (reg_mask & BIT_ULL(9))
+		data[cnt++] = args->tdcall.data[2];
+	else
+		data[cnt++] = 0;
+
+	if (reg_mask & BIT_ULL(2))
+		data[cnt++] = args->tdcall.data[0];
+	else
+		data[cnt++] = 0;
+
+	vcpu->run->system_event.ndata = cnt;
 
 	/*
 	 * Set the status code according to GHCI spec, although the vCPU may
@@ -1260,18 +1217,18 @@ static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
 
 static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	u32 eax, ebx, ecx, edx;
 
-	/* EAX and ECX for cpuid is stored in R12 and R13. */
-	eax = tdvmcall_a0_read(vcpu);
-	ecx = tdvmcall_a1_read(vcpu);
+	eax = tdx->vp_enter_args.tdcall.cpuid.eax;
+	ecx = tdx->vp_enter_args.tdcall.cpuid.ecx;
 
 	kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, false);
 
-	tdvmcall_a0_write(vcpu, eax);
-	tdvmcall_a1_write(vcpu, ebx);
-	tdvmcall_a2_write(vcpu, ecx);
-	tdvmcall_a3_write(vcpu, edx);
+	tdx->vp_enter_args.in.cpuid.eax = eax;
+	tdx->vp_enter_args.in.cpuid.ebx = ebx;
+	tdx->vp_enter_args.in.cpuid.ecx = ecx;
+	tdx->vp_enter_args.in.cpuid.edx = edx;
 
 	tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
 
@@ -1312,6 +1269,7 @@ static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
 static int tdx_emulate_io(struct kvm_vcpu *vcpu)
 {
 	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	unsigned long val = 0;
 	unsigned int port;
 	int size, ret;
@@ -1319,9 +1277,9 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
 
 	++vcpu->stat.io_exits;
 
-	size = tdvmcall_a0_read(vcpu);
-	write = tdvmcall_a1_read(vcpu);
-	port = tdvmcall_a2_read(vcpu);
+	size  = tdx->vp_enter_args.tdcall.io.size;
+	write = tdx->vp_enter_args.tdcall.io.direction;
+	port  = tdx->vp_enter_args.tdcall.io.port;
 
 	if (size != 1 && size != 2 && size != 4) {
 		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
@@ -1329,7 +1287,7 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
 	}
 
 	if (write) {
-		val = tdvmcall_a3_read(vcpu);
+		val = tdx->vp_enter_args.tdcall.io.value;
 		ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
 	} else {
 		ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
@@ -1397,14 +1355,15 @@ static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
 
 static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	int size, write, r;
 	unsigned long val;
 	gpa_t gpa;
 
-	size = tdvmcall_a0_read(vcpu);
-	write = tdvmcall_a1_read(vcpu);
-	gpa = tdvmcall_a2_read(vcpu);
-	val = write ? tdvmcall_a3_read(vcpu) : 0;
+	size  = tdx->vp_enter_args.tdcall.mmio.size;
+	write = tdx->vp_enter_args.tdcall.mmio.direction;
+	gpa   = tdx->vp_enter_args.tdcall.mmio.mmio_addr;
+	val = write ? tdx->vp_enter_args.tdcall.mmio.value : 0;
 
 	if (size != 1 && size != 2 && size != 4 && size != 8)
 		goto error;
@@ -1456,7 +1415,7 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
 
 static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
 {
-	u32 index = tdvmcall_a0_read(vcpu);
+	u32 index = to_tdx(vcpu)->vp_enter_args.tdcall.rd.msr;
 	u64 data;
 
 	if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ) ||
@@ -1474,8 +1433,8 @@ static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
 
 static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
 {
-	u32 index = tdvmcall_a0_read(vcpu);
-	u64 data = tdvmcall_a1_read(vcpu);
+	u32 index = to_tdx(vcpu)->vp_enter_args.tdcall.wr.msr;
+	u64 data  = to_tdx(vcpu)->vp_enter_args.tdcall.wr.value;
 
 	if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE) ||
 	    kvm_set_msr(vcpu, index, data)) {
@@ -1491,14 +1450,16 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
 
 static int tdx_get_td_vm_call_info(struct kvm_vcpu *vcpu)
 {
-	if (tdvmcall_a0_read(vcpu))
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (tdx->vp_enter_args.tdcall.gettdvmcallinfo.leaf) {
 		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
-	else {
+	} else {
 		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
-		kvm_r11_write(vcpu, 0);
-		tdvmcall_a0_write(vcpu, 0);
-		tdvmcall_a1_write(vcpu, 0);
-		tdvmcall_a2_write(vcpu, 0);
+		tdx->vp_enter_args.in.gettdvmcallinfo[0] = 0;
+		tdx->vp_enter_args.in.gettdvmcallinfo[1] = 0;
+		tdx->vp_enter_args.in.gettdvmcallinfo[2] = 0;
+		tdx->vp_enter_args.in.gettdvmcallinfo[3] = 0;
 	}
 	return 1;
 }
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index c9daf71d358a..a0d33b048b7e 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -71,6 +71,7 @@ struct vcpu_tdx {
 	struct list_head cpu_list;
 
 	u64 vp_enter_ret;
+	union tdh_vp_enter_args vp_enter_args;
 
 	enum vcpu_tdx_state state;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 16e0b598c4ec..d5c06c5eeaec 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1600,8 +1600,10 @@ static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
 	return ret;
 }
 
-noinstr u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args)
+noinstr u64 tdh_vp_enter(u64 tdvpr, union tdh_vp_enter_args *vp_enter_args)
 {
+	struct tdx_module_args *args = (struct tdx_module_args *)vp_enter_args;
+
 	args->rcx = tdvpr;
 
 	return __seamcall_saved_ret(TDH_VP_ENTER, args);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/7] KVM: TDX: Implement TDX vcpu enter/exit path
  2024-11-22  5:56     ` Binbin Wu
@ 2024-11-22 14:33       ` Adrian Hunter
  2024-11-28  5:56         ` Yan Zhao
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-11-22 14:33 UTC (permalink / raw)
  To: Binbin Wu, Xiaoyao Li, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	dmatlack, isaku.yamahata, nik.borisov, linux-kernel, x86,
	yan.y.zhao, chao.gao, weijiang.yang

On 22/11/24 07:56, Binbin Wu wrote:
> 
> 
> 
> On 11/22/2024 1:23 PM, Xiaoyao Li wrote:
> [...]
>>> +
>>> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>> +{
>>> +    struct vcpu_tdx *tdx = to_tdx(vcpu);
>>> +
>>> +    /* TDX exit handle takes care of this error case. */
>>> +    if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
>>> +        /* Set to avoid collision with EXIT_REASON_EXCEPTION_NMI. */
>>
>> It seems the check fits better in tdx_vcpu_pre_run().
> 
> Indeed, it's cleaner to move the check to vcpu_pre_run.
> Then no need to set the value to vp_enter_ret, and the comments are not
> needed.

And we can take out the same check in tdx_handle_exit()
because it won't get there if ->vcpu_pre_run() fails.

> 
>>
>> And without the patch of how TDX handles Exit (i.e., how deal with vp_enter_ret), it's hard to review this comment.
>>
>>> +        tdx->vp_enter_ret = TDX_SW_ERROR;
>>> +        return EXIT_FASTPATH_NONE;
>>> +    }
>>> +
>>> +    trace_kvm_entry(vcpu, force_immediate_exit);
>>> +
>>> +    tdx_vcpu_enter_exit(vcpu);
>>> +
>>> +    vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
>>> +    trace_kvm_exit(vcpu, KVM_ISA_VMX);
>>> +
>>> +    return EXIT_FASTPATH_NONE;
>>> +}
>>> +
> [...]


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-11-21 20:14 ` [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest Adrian Hunter
  2024-11-22 11:10   ` Adrian Hunter
@ 2024-11-22 16:26   ` Dave Hansen
  2024-11-22 17:29     ` Edgecombe, Rick P
  1 sibling, 1 reply; 82+ messages in thread
From: Dave Hansen @ 2024-11-22 16:26 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang

On 11/21/24 12:14, Adrian Hunter wrote:
> +u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args)
> +{
> +	args->rcx = tdvpr;
> +
> +	return __seamcall_saved_ret(TDH_VP_ENTER, args);
> +}
> +EXPORT_SYMBOL_GPL(tdh_vp_enter);

I made a similar comment on another series, but it stands here too: the
typing of this wrappers really needs a closer look. Passing u64's around
everywhere means zero type safety.

Type safety is the reason that we have types like pte_t and pgprot_t in
mm code even though they're really just longs (most of the time).

I'd suggest keeping the tdx_td_page type as long as possible, probably
until (for example) the ->rcx assignment, like this:

	args->rcx = td_page.pa;

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-11-22 11:10   ` Adrian Hunter
@ 2024-11-22 16:33     ` Dave Hansen
  2024-11-25 13:40       ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Dave Hansen @ 2024-11-22 16:33 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang

On 11/22/24 03:10, Adrian Hunter wrote:
> +struct tdh_vp_enter_tdcall {
> +	u64	reg_mask	: 32,
> +		vm_idx		:  2,
> +		reserved_0	: 30;
> +	u64	data[TDX_ERR_DATA_PART_2];
> +	u64	fn;	/* Non-zero for hypercalls, zero otherwise */
> +	u64	subfn;
> +	union {
> +		struct tdh_vp_enter_vmcall 		vmcall;
> +		struct tdh_vp_enter_gettdvmcallinfo	gettdvmcallinfo;
> +		struct tdh_vp_enter_mapgpa		mapgpa;
> +		struct tdh_vp_enter_getquote		getquote;
> +		struct tdh_vp_enter_reportfatalerror	reportfatalerror;
> +		struct tdh_vp_enter_cpuid		cpuid;
> +		struct tdh_vp_enter_mmio		mmio;
> +		struct tdh_vp_enter_hlt			hlt;
> +		struct tdh_vp_enter_io			io;
> +		struct tdh_vp_enter_rd			rd;
> +		struct tdh_vp_enter_wr			wr;
> +	};
> +};

Let's say someone declares this:

struct tdh_vp_enter_mmio {
	u64	size;
	u64	mmio_addr;
	u64	direction;
	u64	value;
};

How long is that going to take you to debug?


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-11-22 16:26   ` Dave Hansen
@ 2024-11-22 17:29     ` Edgecombe, Rick P
  2024-11-25 13:43       ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Edgecombe, Rick P @ 2024-11-22 17:29 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, Hansen, Dave,
	Hunter, Adrian, seanjc@google.com, dave.hansen@linux.intel.com
  Cc: Gao, Chao, Yang, Weijiang, x86@kernel.org, Huang, Kai,
	binbin.wu@linux.intel.com, Li, Xiaoyao, Chatre, Reinette,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	dmatlack@google.com, Zhao, Yan Y, Yamahata, Isaku,
	nik.borisov@suse.com

On Fri, 2024-11-22 at 08:26 -0800, Dave Hansen wrote:
> On 11/21/24 12:14, Adrian Hunter wrote:
> > +u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args)
> > +{
> > +	args->rcx = tdvpr;
> > +
> > +	return __seamcall_saved_ret(TDH_VP_ENTER, args);
> > +}
> > +EXPORT_SYMBOL_GPL(tdh_vp_enter);
> 
> I made a similar comment on another series, but it stands here too: the
> typing of this wrappers really needs a closer look. Passing u64's around
> everywhere means zero type safety.
> 
> Type safety is the reason that we have types like pte_t and pgprot_t in
> mm code even though they're really just longs (most of the time).
> 
> I'd suggest keeping the tdx_td_page type as long as possible, probably
> until (for example) the ->rcx assignment, like this:
> 
> 	args->rcx = td_page.pa;

Any thoughts on the approach here to the type questions?

https://lore.kernel.org/kvm/20241115202028.1585487-1-rick.p.edgecombe@intel.com/



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
  2024-11-21 20:14 [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (6 preceding siblings ...)
  2024-11-21 20:14 ` [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list Adrian Hunter
@ 2024-11-25  1:25 ` Binbin Wu
  2024-11-25 15:19   ` Sean Christopherson
  2024-12-10 18:23 ` Paolo Bonzini
  8 siblings, 1 reply; 82+ messages in thread
From: Binbin Wu @ 2024-11-25  1:25 UTC (permalink / raw)
  To: seanjc, Adrian Hunter, pbonzini
  Cc: dave.hansen, kvm, rick.p.edgecombe, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang




On 11/22/2024 4:14 AM, Adrian Hunter wrote:
[...]
>    - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
>      and guest_state_exit_irqoff() which comments say should be
>      called from non-instrumentable code but noinst was removed
>      at Sean's suggestion:
>    	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
>      noinstr is also needed to retain NMI-blocking by avoiding
>      instrumented code that leads to an IRET which unblocks NMIs.
>      A later patch set will deal with NMI VM-exits.
>
In https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com, Sean mentioned:
"The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks.  None of
that applies to TDX.  Either that, or there are some massive bugs lurking due to
missing code."

I don't understand why handle NMI VM-Exits with NMIs blocks doesn't apply to
TDX.  IIUIC, similar to VMX, TDX also needs to handle the NMI VM-exit in the
noinstr section to avoid the unblock of NMIs due to instrumentation-induced
fault.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-11-22  5:49   ` Chao Gao
@ 2024-11-25 11:10     ` Adrian Hunter
  2024-11-26  2:20       ` Chao Gao
  2024-12-17 16:09       ` Sean Christopherson
  2024-11-25 11:34     ` Adrian Hunter
  1 sibling, 2 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-11-25 11:10 UTC (permalink / raw)
  To: Chao Gao
  Cc: pbonzini, seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On 22/11/24 07:49, Chao Gao wrote:
>> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> +
>> +	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>> +	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
>> +		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>> +	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>> +	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
>> +	    kvm_host.xss != (kvm_tdx->xfam &
>> +			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>> +			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
> 
> Should we drop CET/PT from this series? I think they are worth a new
> patch/series.

This is not really about CET/PT

What is happening here is that we are calculating the current
MSR_IA32_XSS value based on the TDX Module spec which says the
TDX Module sets MSR_IA32_XSS to the XSS bits from XFAM.  The
TDX Module does that literally, from TDX Module source code:

	#define XCR0_SUPERVISOR_BIT_MASK            0x0001FD00
and
	ia32_wrmsr(IA32_XSS_MSR_ADDR, xfam & XCR0_SUPERVISOR_BIT_MASK);

For KVM, rather than:

			kvm_tdx->xfam &
			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)

it would be more direct to define the bits and enforce them
via tdx_get_supported_xfam() e.g.

/* 
 * Before returning from TDH.VP.ENTER, the TDX Module assigns:
 *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
 *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
 */
#define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
#define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
#define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)

static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
{
	u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;

	/* Ensure features are in the masks */
	val &= TDX_XFAM_MASK;

	if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
		return 0;

	val &= td_conf->xfam_fixed0;

	return val;
}

and then:

	if (static_cpu_has(X86_FEATURE_XSAVE) &&
	    kvm_host.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK))
		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
	if (static_cpu_has(X86_FEATURE_XSAVES) &&
	    kvm_host.xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK))
		wrmsrl(MSR_IA32_XSS, kvm_host.xss);

> 
>> +		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
>> +	if (static_cpu_has(X86_FEATURE_PKU) &&
> 
> How about using cpu_feature_enabled()? It is used in kvm_load_host_xsave_state()
> It handles the case where CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is not
> enabled.
> 
>> +	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
>> +		write_pkru(vcpu->arch.host_pkru);
> 
> If host_pkru happens to match the hardware value after TD-exits, the write can
> be omitted, similar to what is done above for xss and xcr0.

True.  It might be better to make restoring PKRU a separate
patch so that the commit message can explain why it needs to
be done here.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-11-22  5:49   ` Chao Gao
  2024-11-25 11:10     ` Adrian Hunter
@ 2024-11-25 11:34     ` Adrian Hunter
  1 sibling, 0 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-11-25 11:34 UTC (permalink / raw)
  To: Chao Gao
  Cc: pbonzini, seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On 22/11/24 07:49, Chao Gao wrote:
>> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> +
>> +	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>> +	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
>> +		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>> +	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>> +	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
>> +	    kvm_host.xss != (kvm_tdx->xfam &
>> +			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>> +			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
> 
> Should we drop CET/PT from this series? I think they are worth a new
> patch/series.
> 
>> +		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
>> +	if (static_cpu_has(X86_FEATURE_PKU) &&
> 
> How about using cpu_feature_enabled()? It is used in kvm_load_host_xsave_state()
> It handles the case where CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is not
> enabled.

Seems reasonable

> 
>> +	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
>> +		write_pkru(vcpu->arch.host_pkru);
> 
> If host_pkru happens to match the hardware value after TD-exits, the write can
> be omitted, similar to what is done above for xss and xcr0.
> 
>> +}


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-11-22 16:33     ` Dave Hansen
@ 2024-11-25 13:40       ` Adrian Hunter
  2024-11-28 11:13         ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-11-25 13:40 UTC (permalink / raw)
  To: Dave Hansen, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang

On 22/11/24 18:33, Dave Hansen wrote:
> On 11/22/24 03:10, Adrian Hunter wrote:
>> +struct tdh_vp_enter_tdcall {
>> +	u64	reg_mask	: 32,
>> +		vm_idx		:  2,
>> +		reserved_0	: 30;
>> +	u64	data[TDX_ERR_DATA_PART_2];
>> +	u64	fn;	/* Non-zero for hypercalls, zero otherwise */
>> +	u64	subfn;
>> +	union {
>> +		struct tdh_vp_enter_vmcall 		vmcall;
>> +		struct tdh_vp_enter_gettdvmcallinfo	gettdvmcallinfo;
>> +		struct tdh_vp_enter_mapgpa		mapgpa;
>> +		struct tdh_vp_enter_getquote		getquote;
>> +		struct tdh_vp_enter_reportfatalerror	reportfatalerror;
>> +		struct tdh_vp_enter_cpuid		cpuid;
>> +		struct tdh_vp_enter_mmio		mmio;
>> +		struct tdh_vp_enter_hlt			hlt;
>> +		struct tdh_vp_enter_io			io;
>> +		struct tdh_vp_enter_rd			rd;
>> +		struct tdh_vp_enter_wr			wr;
>> +	};
>> +};
> 
> Let's say someone declares this:
> 
> struct tdh_vp_enter_mmio {
> 	u64	size;
> 	u64	mmio_addr;
> 	u64	direction;
> 	u64	value;
> };
> 
> How long is that going to take you to debug?

When adding a new hardware definition, it would be sensible
to check the hardware definition first before checking anything
else.

However, to stop existing members being accidentally moved,
could add:

#define CHECK_OFFSETS_EQ(reg, member) \
	BUILD_BUG_ON(offsetof(struct tdx_module_args, reg) != offsetof(union tdh_vp_enter_args, member));

	CHECK_OFFSETS_EQ(r12, tdcall.mmio.size);
	CHECK_OFFSETS_EQ(r13, tdcall.mmio.direction);
	CHECK_OFFSETS_EQ(r14, tdcall.mmio.mmio_addr);
	CHECK_OFFSETS_EQ(r15, tdcall.mmio.value);


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-11-22 17:29     ` Edgecombe, Rick P
@ 2024-11-25 13:43       ` Adrian Hunter
  0 siblings, 0 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-11-25 13:43 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org, pbonzini@redhat.com,
	Hansen, Dave, seanjc@google.com, dave.hansen@linux.intel.com
  Cc: Gao, Chao, Yang, Weijiang, x86@kernel.org, Huang, Kai,
	binbin.wu@linux.intel.com, Li, Xiaoyao, Chatre, Reinette,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com,
	dmatlack@google.com, Zhao, Yan Y, Yamahata, Isaku,
	nik.borisov@suse.com

On 22/11/24 19:29, Edgecombe, Rick P wrote:
> On Fri, 2024-11-22 at 08:26 -0800, Dave Hansen wrote:
>> On 11/21/24 12:14, Adrian Hunter wrote:
>>> +u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args)
>>> +{
>>> +   args->rcx = tdvpr;
>>> +
>>> +   return __seamcall_saved_ret(TDH_VP_ENTER, args);
>>> +}
>>> +EXPORT_SYMBOL_GPL(tdh_vp_enter);
>>
>> I made a similar comment on another series, but it stands here too: the
>> typing of this wrappers really needs a closer look. Passing u64's around
>> everywhere means zero type safety.
>>
>> Type safety is the reason that we have types like pte_t and pgprot_t in
>> mm code even though they're really just longs (most of the time).
>>
>> I'd suggest keeping the tdx_td_page type as long as possible, probably
>> until (for example) the ->rcx assignment, like this:
>>
>>       args->rcx = td_page.pa;
> 
> Any thoughts on the approach here to the type questions?
> 
> https://lore.kernel.org/kvm/20241115202028.1585487-1-rick.p.edgecombe@intel.com/

For tdh_vp_enter() we will just use the same approach for
tdvpr, whatever that ends up being.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/7] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
  2024-11-21 20:14 ` [PATCH 3/7] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) Adrian Hunter
@ 2024-11-25 14:12   ` Nikolay Borisov
  2024-11-26 16:15     ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Nikolay Borisov @ 2024-11-25 14:12 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, linux-kernel,
	x86, yan.y.zhao, chao.gao, weijiang.yang



On 21.11.24 г. 22:14 ч., Adrian Hunter wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> On entering/exiting TDX vcpu, preserved or clobbered CPU state is different
> from the VMX case. Add TDX hooks to save/restore host/guest CPU state.
> Save/restore kernel GS base MSR.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> TD vcpu enter/exit v1:
>   - Clarify comment (Binbin)
>   - Use lower case preserved and add the for VMX in log (Tony)
>   - Fix bisectability issue with includes (Kai)
> ---
>   arch/x86/kvm/vmx/main.c    | 24 ++++++++++++++++++--
>   arch/x86/kvm/vmx/tdx.c     | 46 ++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/tdx.h     |  4 ++++
>   arch/x86/kvm/vmx/x86_ops.h |  4 ++++
>   4 files changed, 76 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 44ec6005a448..3a8ffc199be2 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -129,6 +129,26 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   	vmx_vcpu_load(vcpu, cpu);
>   }
>   
> +static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu)) {
> +		tdx_prepare_switch_to_guest(vcpu);
> +		return;
> +	}
> +
> +	vmx_prepare_switch_to_guest(vcpu);
> +}
> +
> +static void vt_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu)) {
> +		tdx_vcpu_put(vcpu);
> +		return;
> +	}
> +
> +	vmx_vcpu_put(vcpu);
> +}
> +
>   static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
>   {
>   	if (is_td_vcpu(vcpu))
> @@ -250,9 +270,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.vcpu_free = vt_vcpu_free,
>   	.vcpu_reset = vt_vcpu_reset,
>   
> -	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
> +	.prepare_switch_to_guest = vt_prepare_switch_to_guest,
>   	.vcpu_load = vt_vcpu_load,
> -	.vcpu_put = vmx_vcpu_put,
> +	.vcpu_put = vt_vcpu_put,
>   
>   	.update_exception_bitmap = vmx_update_exception_bitmap,
>   	.get_feature_msr = vmx_get_feature_msr,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 5fa5b65b9588..6e4ea2d420bc 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1,6 +1,7 @@
>   // SPDX-License-Identifier: GPL-2.0
>   #include <linux/cleanup.h>
>   #include <linux/cpu.h>
> +#include <linux/mmu_context.h>
>   #include <asm/tdx.h>
>   #include "capabilities.h"
>   #include "mmu.h"
> @@ -9,6 +10,7 @@
>   #include "vmx.h"
>   #include "mmu/spte.h"
>   #include "common.h"
> +#include "posted_intr.h"
>   
>   #include <trace/events/kvm.h>
>   #include "trace.h"
> @@ -605,6 +607,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>   	if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
>   		vcpu->arch.xfd_no_write_intercept = true;
>   
> +	tdx->host_state_need_save = true;
> +	tdx->host_state_need_restore = false;

nit: Rather than have 2 separate values which actually work in tandem, 
why not define a u8 or even u32 and have a mask of the valid flags.

So you can have something like:

#define SAVE_HOST BIT(0)
#define RESTORE_HOST BIT(1)

tdx->state_flags = SAVE_HOST

I don't know what are the plans for the future but there might be cases 
where you can have more complex flags composed of more simple ones.

>   	tdx->state = VCPU_TD_STATE_UNINITIALIZED;
>   
>   	return 0;
> @@ -631,6 +636,45 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   	local_irq_enable();
>   }
>   
> +/*
> + * Compared to vmx_prepare_switch_to_guest(), there is not much to do
> + * as SEAMCALL/SEAMRET calls take care of most of save and restore.
> + */
> +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	if (!tdx->host_state_need_save)
if (!(tdx->state_flags & SAVE_HOST))
> +		return;
> +
> +	if (likely(is_64bit_mm(current->mm)))
> +		tdx->msr_host_kernel_gs_base = current->thread.gsbase;
> +	else
> +		tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
> +
> +	tdx->host_state_need_save = false;

tdx->state &= ~SAVE_HOST
> +}
> +
> +static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	tdx->host_state_need_save = true;
> +	if (!tdx->host_state_need_restore)
if (!(tdx->state_flags & RESTORE_HOST)

> +		return;
> +
> +	++vcpu->stat.host_state_reload;
> +
> +	wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
> +	tdx->host_state_need_restore = false;
> +}
> +
> +void tdx_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> +	vmx_vcpu_pi_put(vcpu);
> +	tdx_prepare_switch_to_host(vcpu);
> +}
> +
>   void tdx_vcpu_free(struct kvm_vcpu *vcpu)
>   {
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> @@ -732,6 +776,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>   
>   	tdx_vcpu_enter_exit(vcpu);
>   
> +	tdx->host_state_need_restore = true;

tdx->state_flags |= RESTORE_HOST

> +
>   	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
>   	trace_kvm_exit(vcpu, KVM_ISA_VMX);
>   
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index ebee1049b08b..48cf0a1abfcc 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -54,6 +54,10 @@ struct vcpu_tdx {
>   	u64 vp_enter_ret;
>   
>   	enum vcpu_tdx_state state;
> +
> +	bool host_state_need_save;
> +	bool host_state_need_restore;

this would save having a discrete member for those boolean checks.

> +	u64 msr_host_kernel_gs_base;
>   };
>   
>   void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 3d292a677b92..5bd45a720007 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -130,6 +130,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>   fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
> +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
> +void tdx_vcpu_put(struct kvm_vcpu *vcpu);
>   
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>   
> @@ -161,6 +163,8 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediat
>   {
>   	return EXIT_FASTPATH_NONE;
>   }
> +static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
> +static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
>   
>   static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
>   


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
  2024-11-25  1:25 ` [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Binbin Wu
@ 2024-11-25 15:19   ` Sean Christopherson
  2024-11-25 19:50     ` Huang, Kai
  0 siblings, 1 reply; 82+ messages in thread
From: Sean Christopherson @ 2024-11-25 15:19 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Adrian Hunter, pbonzini, dave.hansen, kvm, rick.p.edgecombe,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	chao.gao, weijiang.yang

On Mon, Nov 25, 2024, Binbin Wu wrote:
> On 11/22/2024 4:14 AM, Adrian Hunter wrote:
> [...]
> >    - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
> >      and guest_state_exit_irqoff() which comments say should be
> >      called from non-instrumentable code but noinst was removed
> >      at Sean's suggestion:
> >    	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
> >      noinstr is also needed to retain NMI-blocking by avoiding
> >      instrumented code that leads to an IRET which unblocks NMIs.
> >      A later patch set will deal with NMI VM-exits.
> > 
> In https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com, Sean mentioned:
> "The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
> like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks.  None of
> that applies to TDX.  Either that, or there are some massive bugs lurking due to
> missing code."
> 
> I don't understand why handle NMI VM-Exits with NMIs blocks doesn't apply to
> TDX.  IIUIC, similar to VMX, TDX also needs to handle the NMI VM-exit in the
> noinstr section to avoid the unblock of NMIs due to instrumentation-induced
> fault.

With TDX, SEAMCALL is mechnically a VM-Exit.  KVM is the "guest" running in VMX
root mode, and the TDX-Module is the "host", running in SEAM root mode.

And for TDH.VP.ENTER, if a hardware NMI arrives with the TDX guest is active,
the initial NMI VM-Exit, which consumes the NMI and blocks further NMIs, goes
from SEAM non-root to SEAM root.  The SEAMRET from SEAM root to VMX root (KVM)
is effectively a VM-Enter, and does NOT block NMIs in VMX root (at least, AFAIK).

So trying to handle the NMI "exit" in a noinstr section is pointless because NMIs
are never blocked.

TDX is also different because KVM isn't responsible for context switching guest
state.  Specifically, CR2 is managed by the TDX Module, and so there is no window
where KVM runs with guest CR2, and thus there is no risk of clobbering guest CR2
with a host value, e.g. due to take a #PF due instrumentation triggering something.

All that said, I did forget that code that runs between guest_state_enter_irqoff()
and guest_state_exit_irqoff() can't be instrumeneted.  And at least as of patch 2
in this series, the simplest way to make that happen is to tag tdx_vcpu_enter_exit()
as noinstr.  Just please make sure nothing else is added in the noinstr section
unless it absolutely needs to be there.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
  2024-11-25 15:19   ` Sean Christopherson
@ 2024-11-25 19:50     ` Huang, Kai
  2024-11-25 22:51       ` Sean Christopherson
  0 siblings, 1 reply; 82+ messages in thread
From: Huang, Kai @ 2024-11-25 19:50 UTC (permalink / raw)
  To: seanjc@google.com, binbin.wu@linux.intel.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Chatre, Reinette, Yang, Weijiang,
	pbonzini@redhat.com, Yamahata, Isaku, dmatlack@google.com,
	nik.borisov@suse.com, tony.lindgren@linux.intel.com, Gao, Chao,
	Edgecombe, Rick P, x86@kernel.org

On Mon, 2024-11-25 at 07:19 -0800, Sean Christopherson wrote:
> On Mon, Nov 25, 2024, Binbin Wu wrote:
> > On 11/22/2024 4:14 AM, Adrian Hunter wrote:
> > [...]
> > >    - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
> > >      and guest_state_exit_irqoff() which comments say should be
> > >      called from non-instrumentable code but noinst was removed
> > >      at Sean's suggestion:
> > >    	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
> > >      noinstr is also needed to retain NMI-blocking by avoiding
> > >      instrumented code that leads to an IRET which unblocks NMIs.
> > >      A later patch set will deal with NMI VM-exits.
> > > 
> > In https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com, Sean mentioned:
> > "The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
> > like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks.  None of
> > that applies to TDX.  Either that, or there are some massive bugs lurking due to
> > missing code."
> > 
> > I don't understand why handle NMI VM-Exits with NMIs blocks doesn't apply to
> > TDX.  IIUIC, similar to VMX, TDX also needs to handle the NMI VM-exit in the
> > noinstr section to avoid the unblock of NMIs due to instrumentation-induced
> > fault.
> 
> With TDX, SEAMCALL is mechnically a VM-Exit.  KVM is the "guest" running in VMX
> root mode, and the TDX-Module is the "host", running in SEAM root mode.
> 
> And for TDH.VP.ENTER, if a hardware NMI arrives with the TDX guest is active,
> the initial NMI VM-Exit, which consumes the NMI and blocks further NMIs, goes
> from SEAM non-root to SEAM root.  The SEAMRET from SEAM root to VMX root (KVM)
> is effectively a VM-Enter, and does NOT block NMIs in VMX root (at least, AFAIK).
> 
> So trying to handle the NMI "exit" in a noinstr section is pointless because NMIs
> are never blocked.

I thought NMI remains being blocked after SEAMRET?

The TDX CPU architecture extension spec says:

"
On transition to SEAM VMX root operation, the processor can inhibit NMI and SMI.
While inhibited, if these events occur, then they are tailored to stay pending
and be delivered when the inhibit state is removed. NMI and external interrupts
can be uninhibited in SEAM VMX-root operation. In SEAM VMX-root operation,
MSR_INTR_PENDING can be read to help determine status of any pending events.

On transition to SEAM VMX non-root operation using a VM entry, NMI and INTR
inhibit states are, by design, updated based on the configuration of the TD VMCS
used to perform the VM entry.

...

On transition to legacy VMX root operation using SEAMRET, the NMI and SMI
inhibit state can be restored to the inhibit state at the time of the previous
SEAMCALL and any pending NMI/SMI would be delivered if not inhibited.
"

Here NMI is inhibited in SEAM VMX root, but is never inhibited in VMX root.  

And the last paragraph does say "any pending NMI would be delivered if not
inhibited".  

But I thought this applies to the case when "NMI happens in SEAM VMX root", but
not "NMI happens in SEAM VMX non-root"?  I thought the NMI is already
"delivered" when CPU is in "SEAM VMX non-root", but I guess I was wrong here..

> 
> TDX is also different because KVM isn't responsible for context switching guest
> state.  Specifically, CR2 is managed by the TDX Module, and so there is no window
> where KVM runs with guest CR2, and thus there is no risk of clobbering guest CR2
> with a host value, e.g. due to take a #PF due instrumentation triggering something.
> 
> All that said, I did forget that code that runs between guest_state_enter_irqoff()
> and guest_state_exit_irqoff() can't be instrumeneted.  And at least as of patch 2
> in this series, the simplest way to make that happen is to tag tdx_vcpu_enter_exit()
> as noinstr.  Just please make sure nothing else is added in the noinstr section
> unless it absolutely needs to be there.

If NMI is not a concern, is below also an option?

	guest_state_enter_irqoff();

	instructmentation_begin();
	tdh_vp_enter();
	instructmentation_end();

	guest_state_exit_irqoff();


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
  2024-11-25 19:50     ` Huang, Kai
@ 2024-11-25 22:51       ` Sean Christopherson
  2024-11-26  1:43         ` Huang, Kai
  2024-11-26  1:44         ` Binbin Wu
  0 siblings, 2 replies; 82+ messages in thread
From: Sean Christopherson @ 2024-11-25 22:51 UTC (permalink / raw)
  To: Kai Huang
  Cc: binbin.wu@linux.intel.com, kvm@vger.kernel.org, Xiaoyao Li,
	Yan Y Zhao, dave.hansen@linux.intel.com, Adrian Hunter,
	linux-kernel@vger.kernel.org, Reinette Chatre, Weijiang Yang,
	pbonzini@redhat.com, Isaku Yamahata, dmatlack@google.com,
	nik.borisov@suse.com, tony.lindgren@linux.intel.com, Chao Gao,
	Rick P Edgecombe, x86@kernel.org

On Mon, Nov 25, 2024, Kai Huang wrote:
> On Mon, 2024-11-25 at 07:19 -0800, Sean Christopherson wrote:
> > On Mon, Nov 25, 2024, Binbin Wu wrote:
> > > On 11/22/2024 4:14 AM, Adrian Hunter wrote:
> > > [...]
> > > >    - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
> > > >      and guest_state_exit_irqoff() which comments say should be
> > > >      called from non-instrumentable code but noinst was removed
> > > >      at Sean's suggestion:
> > > >    	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
> > > >      noinstr is also needed to retain NMI-blocking by avoiding
> > > >      instrumented code that leads to an IRET which unblocks NMIs.
> > > >      A later patch set will deal with NMI VM-exits.
> > > > 
> > > In https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com, Sean mentioned:
> > > "The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
> > > like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks.  None of
> > > that applies to TDX.  Either that, or there are some massive bugs lurking due to
> > > missing code."
> > > 
> > > I don't understand why handle NMI VM-Exits with NMIs blocks doesn't apply to
> > > TDX.  IIUIC, similar to VMX, TDX also needs to handle the NMI VM-exit in the
> > > noinstr section to avoid the unblock of NMIs due to instrumentation-induced
> > > fault.
> > 
> > With TDX, SEAMCALL is mechnically a VM-Exit.  KVM is the "guest" running in VMX
> > root mode, and the TDX-Module is the "host", running in SEAM root mode.
> > 
> > And for TDH.VP.ENTER, if a hardware NMI arrives with the TDX guest is active,
> > the initial NMI VM-Exit, which consumes the NMI and blocks further NMIs, goes
> > from SEAM non-root to SEAM root.  The SEAMRET from SEAM root to VMX root (KVM)
> > is effectively a VM-Enter, and does NOT block NMIs in VMX root (at least, AFAIK).
> > 
> > So trying to handle the NMI "exit" in a noinstr section is pointless because NMIs
> > are never blocked.
> 
> I thought NMI remains being blocked after SEAMRET?

No, because NMIs weren't blocked at SEAMCALL.

> The TDX CPU architecture extension spec says:
> 
> "
> On transition to SEAM VMX root operation, the processor can inhibit NMI and SMI.
> While inhibited, if these events occur, then they are tailored to stay pending
> and be delivered when the inhibit state is removed. NMI and external interrupts
> can be uninhibited in SEAM VMX-root operation. In SEAM VMX-root operation,
> MSR_INTR_PENDING can be read to help determine status of any pending events.
> 
> On transition to SEAM VMX non-root operation using a VM entry, NMI and INTR
> inhibit states are, by design, updated based on the configuration of the TD VMCS
> used to perform the VM entry.
> 
> ...
> 
> On transition to legacy VMX root operation using SEAMRET, the NMI and SMI
> inhibit state can be restored to the inhibit state at the time of the previous
> SEAMCALL and any pending NMI/SMI would be delivered if not inhibited.
> "
> 
> Here NMI is inhibited in SEAM VMX root, but is never inhibited in VMX root.  

Yep.

> And the last paragraph does say "any pending NMI would be delivered if not
> inhibited".  

That's referring to the scenario where an NMI becomes pending while the CPU is in
SEAM, i.e. has NMIs blocked.

> But I thought this applies to the case when "NMI happens in SEAM VMX root", but
> not "NMI happens in SEAM VMX non-root"?  I thought the NMI is already
> "delivered" when CPU is in "SEAM VMX non-root", but I guess I was wrong here..

When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
TDX-Module that propagates that blocking to SEAMCALL VMCS.

Hmm, actually, this means that TDX has a causality inversion, which may become
visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
and handled before NMI X.

So the TDX-Module needs something like this:

diff --git a/src/td_transitions/td_exit.c b/src/td_transitions/td_exit.c
index eecfb2e..b5c17c3 100644
--- a/src/td_transitions/td_exit.c
+++ b/src/td_transitions/td_exit.c
@@ -527,6 +527,11 @@ void td_vmexit_to_vmm(uint8_t vcpu_state, uint8_t last_td_exit, uint64_t scrub_m
         load_xmms_by_mask(tdvps_ptr, xmm_select);
     }
 
+    if (<is NMI VM-Exit => SEAMRET)
+    {
+        set_guest_inter_blocking_by_nmi();
+    }
+
     // 7.   Run the common SEAMRET routine.
     tdx_vmm_post_dispatching();


and then KVM should indeed handle NMI exits prior to leaving the noinstr section.
 
> > TDX is also different because KVM isn't responsible for context switching guest
> > state.  Specifically, CR2 is managed by the TDX Module, and so there is no window
> > where KVM runs with guest CR2, and thus there is no risk of clobbering guest CR2
> > with a host value, e.g. due to take a #PF due instrumentation triggering something.
> > 
> > All that said, I did forget that code that runs between guest_state_enter_irqoff()
> > and guest_state_exit_irqoff() can't be instrumeneted.  And at least as of patch 2
> > in this series, the simplest way to make that happen is to tag tdx_vcpu_enter_exit()
> > as noinstr.  Just please make sure nothing else is added in the noinstr section
> > unless it absolutely needs to be there.
> 
> If NMI is not a concern, is below also an option?

No, because instrumentation needs to be prohibited for the entire time between
guest_state_enter_irqoff() and guest_state_exit_irqoff().

> 	guest_state_enter_irqoff();
> 
> 	instructmentation_begin();
> 	tdh_vp_enter();
> 	instructmentation_end();
> 
> 	guest_state_exit_irqoff();
> 

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
  2024-11-25 22:51       ` Sean Christopherson
@ 2024-11-26  1:43         ` Huang, Kai
  2024-11-26  1:44         ` Binbin Wu
  1 sibling, 0 replies; 82+ messages in thread
From: Huang, Kai @ 2024-11-26  1:43 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: dmatlack@google.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Zhao, Yan Y, dave.hansen@linux.intel.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, pbonzini@redhat.com, Chatre, Reinette,
	Yamahata, Isaku, nik.borisov@suse.com,
	tony.lindgren@linux.intel.com, Gao, Chao, Edgecombe, Rick P,
	x86@kernel.org

On Mon, 2024-11-25 at 14:51 -0800, Sean Christopherson wrote:
> On Mon, Nov 25, 2024, Kai Huang wrote:
> > On Mon, 2024-11-25 at 07:19 -0800, Sean Christopherson wrote:
> > > On Mon, Nov 25, 2024, Binbin Wu wrote:
> > > > On 11/22/2024 4:14 AM, Adrian Hunter wrote:
> > > > [...]
> > > > >    - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
> > > > >      and guest_state_exit_irqoff() which comments say should be
> > > > >      called from non-instrumentable code but noinst was removed
> > > > >      at Sean's suggestion:
> > > > >    	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
> > > > >      noinstr is also needed to retain NMI-blocking by avoiding
> > > > >      instrumented code that leads to an IRET which unblocks NMIs.
> > > > >      A later patch set will deal with NMI VM-exits.
> > > > > 
> > > > In https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com, Sean mentioned:
> > > > "The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
> > > > like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks.  None of
> > > > that applies to TDX.  Either that, or there are some massive bugs lurking due to
> > > > missing code."
> > > > 
> > > > I don't understand why handle NMI VM-Exits with NMIs blocks doesn't apply to
> > > > TDX.  IIUIC, similar to VMX, TDX also needs to handle the NMI VM-exit in the
> > > > noinstr section to avoid the unblock of NMIs due to instrumentation-induced
> > > > fault.
> > > 
> > > With TDX, SEAMCALL is mechnically a VM-Exit.  KVM is the "guest" running in VMX
> > > root mode, and the TDX-Module is the "host", running in SEAM root mode.
> > > 
> > > And for TDH.VP.ENTER, if a hardware NMI arrives with the TDX guest is active,
> > > the initial NMI VM-Exit, which consumes the NMI and blocks further NMIs, goes
> > > from SEAM non-root to SEAM root.  The SEAMRET from SEAM root to VMX root (KVM)
> > > is effectively a VM-Enter, and does NOT block NMIs in VMX root (at least, AFAIK).
> > > 
> > > So trying to handle the NMI "exit" in a noinstr section is pointless because NMIs
> > > are never blocked.
> > 
> > I thought NMI remains being blocked after SEAMRET?
> 
> No, because NMIs weren't blocked at SEAMCALL.
> 
> > The TDX CPU architecture extension spec says:
> > 
> > "
> > On transition to SEAM VMX root operation, the processor can inhibit NMI and SMI.
> > While inhibited, if these events occur, then they are tailored to stay pending
> > and be delivered when the inhibit state is removed. NMI and external interrupts
> > can be uninhibited in SEAM VMX-root operation. In SEAM VMX-root operation,
> > MSR_INTR_PENDING can be read to help determine status of any pending events.
> > 
> > On transition to SEAM VMX non-root operation using a VM entry, NMI and INTR
> > inhibit states are, by design, updated based on the configuration of the TD VMCS
> > used to perform the VM entry.
> > 
> > ...
> > 
> > On transition to legacy VMX root operation using SEAMRET, the NMI and SMI
> > inhibit state can be restored to the inhibit state at the time of the previous
> > SEAMCALL and any pending NMI/SMI would be delivered if not inhibited.
> > "
> > 
> > Here NMI is inhibited in SEAM VMX root, but is never inhibited in VMX root.  
> 
> Yep.
> 
> > And the last paragraph does say "any pending NMI would be delivered if not
> > inhibited".  
> 
> That's referring to the scenario where an NMI becomes pending while the CPU is in
> SEAM, i.e. has NMIs blocked.
> 
> > But I thought this applies to the case when "NMI happens in SEAM VMX root", but
> > not "NMI happens in SEAM VMX non-root"?  I thought the NMI is already
> > "delivered" when CPU is in "SEAM VMX non-root", but I guess I was wrong here..
> 
> When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
> performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
> TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
> load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
> TDX-Module that propagates that blocking to SEAMCALL VMCS.

Oh, I didn't read the module code, but was trying to looking for clue from the
TDX specs.  It was a surprise to me that VMX case and TDX case have different
behaviour in terms of "NMI blocking when exiting to _host_ VMM".

I was thinking SEAMRET (or hardware in general) should have done something to
make sure of it.

> 
> Hmm, actually, this means that TDX has a causality inversion, which may become
> visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
> and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
> TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
> and handled before NMI X.

Sorry, NMI X was acked by CPU firstly before NMI X+1, why is NMI X+1 delivered
before NMI X?

> 
> So the TDX-Module needs something like this:
> 
> diff --git a/src/td_transitions/td_exit.c b/src/td_transitions/td_exit.c
> index eecfb2e..b5c17c3 100644
> --- a/src/td_transitions/td_exit.c
> +++ b/src/td_transitions/td_exit.c
> @@ -527,6 +527,11 @@ void td_vmexit_to_vmm(uint8_t vcpu_state, uint8_t last_td_exit, uint64_t scrub_m
>          load_xmms_by_mask(tdvps_ptr, xmm_select);
>      }
>  
> +    if (<is NMI VM-Exit => SEAMRET)
> +    {
> +        set_guest_inter_blocking_by_nmi();
> +    }
> +
>      // 7.   Run the common SEAMRET routine.
>      tdx_vmm_post_dispatching();
> 
> 
> and then KVM should indeed handle NMI exits prior to leaving the noinstr section.

Yeah to me it should be done unconditionally as it gives the same behaviour to
the normal VMX VM-Exit case, that NMI is left blocked after exiting to the host
VMM.  The NMI Exit reason is passed to host VMM anyway.

If NMI is handled immediately after SEAMRET, KVM won't have any chance to do
additional things before handing NMI like below for VMX:

  kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
  // call NMI handling routine
  kvm_after_interrupt(vcpu);

I suppose this should be a concern?



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
  2024-11-25 22:51       ` Sean Christopherson
  2024-11-26  1:43         ` Huang, Kai
@ 2024-11-26  1:44         ` Binbin Wu
  2024-11-26  3:52           ` Huang, Kai
  1 sibling, 1 reply; 82+ messages in thread
From: Binbin Wu @ 2024-11-26  1:44 UTC (permalink / raw)
  To: Sean Christopherson, Kai Huang
  Cc: kvm@vger.kernel.org, Xiaoyao Li, Yan Y Zhao,
	dave.hansen@linux.intel.com, Adrian Hunter,
	linux-kernel@vger.kernel.org, Reinette Chatre, Weijiang Yang,
	pbonzini@redhat.com, Isaku Yamahata, dmatlack@google.com,
	nik.borisov@suse.com, tony.lindgren@linux.intel.com, Chao Gao,
	Rick P Edgecombe, x86@kernel.org




On 11/26/2024 6:51 AM, Sean Christopherson wrote:

[...]
> When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
> performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
> TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
> load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
> TDX-Module that propagates that blocking to SEAMCALL VMCS.
I see, thanks for the explanation!

>
> Hmm, actually, this means that TDX has a causality inversion, which may become
> visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
> and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
> TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
> and handled before NMI X.

This example can also cause an issue without FRED.
1. NMI X arrives in SEAM non-root and triggers a VM-Exit.
2. NMI X+1 becomes pending while SEAM root is active.
3. TDX-Module SEAMRETs to VMX root, NMIs are unblocked.
4. NMI X+1 is delivered and handled before NMI X.
    (NMI handler could handle all NMI source events, including the source
     triggered NMI X)
5. KVM calls exc_nmi() to handle the VM Exit caused by NMI X
In step 5, because the source event caused NMI X has been handled, and NMI X
will not be detected as a second half of back-to-back NMIs, according to
Linux NMI handler, it will be considered as an unknown NMI.

Actually, the issue could happen if NMI X+1 occurs after exiting to SEAM root
mode and before KVM handling the VM-Exit caused by NMI X.


>
> So the TDX-Module needs something like this:
>
> diff --git a/src/td_transitions/td_exit.c b/src/td_transitions/td_exit.c
> index eecfb2e..b5c17c3 100644
> --- a/src/td_transitions/td_exit.c
> +++ b/src/td_transitions/td_exit.c
> @@ -527,6 +527,11 @@ void td_vmexit_to_vmm(uint8_t vcpu_state, uint8_t last_td_exit, uint64_t scrub_m
>           load_xmms_by_mask(tdvps_ptr, xmm_select);
>       }
>   
> +    if (<is NMI VM-Exit => SEAMRET)
> +    {
> +        set_guest_inter_blocking_by_nmi();
> +    }
> +
>       // 7.   Run the common SEAMRET routine.
>       tdx_vmm_post_dispatching();
>
>
> and then KVM should indeed handle NMI exits prior to leaving the noinstr section.
>   
>>> TDX is also different because KVM isn't responsible for context switching guest
>>> state.  Specifically, CR2 is managed by the TDX Module, and so there is no window
>>> where KVM runs with guest CR2, and thus there is no risk of clobbering guest CR2
>>> with a host value, e.g. due to take a #PF due instrumentation triggering something.
>>>
>>> All that said, I did forget that code that runs between guest_state_enter_irqoff()
>>> and guest_state_exit_irqoff() can't be instrumeneted.  And at least as of patch 2
>>> in this series, the simplest way to make that happen is to tag tdx_vcpu_enter_exit()
>>> as noinstr.  Just please make sure nothing else is added in the noinstr section
>>> unless it absolutely needs to be there.
>> If NMI is not a concern, is below also an option?
> No, because instrumentation needs to be prohibited for the entire time between
> guest_state_enter_irqoff() and guest_state_exit_irqoff().
>
>> 	guest_state_enter_irqoff();
>>
>> 	instructmentation_begin();
>> 	tdh_vp_enter();
>> 	instructmentation_end();
>>
>> 	guest_state_exit_irqoff();
>>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-11-25 11:10     ` Adrian Hunter
@ 2024-11-26  2:20       ` Chao Gao
  2024-11-28  6:50         ` Adrian Hunter
  2024-12-17 16:09       ` Sean Christopherson
  1 sibling, 1 reply; 82+ messages in thread
From: Chao Gao @ 2024-11-26  2:20 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: pbonzini, seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On Mon, Nov 25, 2024 at 01:10:37PM +0200, Adrian Hunter wrote:
>On 22/11/24 07:49, Chao Gao wrote:
>>> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>>> +{
>>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>>> +
>>> +	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>>> +	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
>>> +		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>>> +	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>>> +	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
>>> +	    kvm_host.xss != (kvm_tdx->xfam &
>>> +			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>>> +			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
>> 
>> Should we drop CET/PT from this series? I think they are worth a new
>> patch/series.
>
>This is not really about CET/PT
>
>What is happening here is that we are calculating the current
>MSR_IA32_XSS value based on the TDX Module spec which says the
>TDX Module sets MSR_IA32_XSS to the XSS bits from XFAM.  The
>TDX Module does that literally, from TDX Module source code:
>
>	#define XCR0_SUPERVISOR_BIT_MASK            0x0001FD00
>and
>	ia32_wrmsr(IA32_XSS_MSR_ADDR, xfam & XCR0_SUPERVISOR_BIT_MASK);
>
>For KVM, rather than:
>
>			kvm_tdx->xfam &
>			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)
>
>it would be more direct to define the bits and enforce them
>via tdx_get_supported_xfam() e.g.
>
>/* 
> * Before returning from TDH.VP.ENTER, the TDX Module assigns:
> *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
> *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
> */
>#define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
>#define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
>#define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)
>
>static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
>{
>	u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;
>
>	/* Ensure features are in the masks */
>	val &= TDX_XFAM_MASK;

Before exposing a feature to TD VMs, both the TDX module and KVM must support
it. In other words, kvm_tdx->xfam & kvm_caps.supported_xss should yield the
same result as kvm_tdx->xfam & TDX_XFAM_XSS_MASK. So, to me, the current
approach and your new proposal are functionally identical.

I prefer checking against kvm_caps.supported_xss because we don't need to
update TDX_XFAM_XSS/XCR0_MASK when new user/supervisor xstate bits are added.
Note kvm_caps.supported_xss/xcr0 need to be updated for normal VMs anyway.

>
>	if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
>		return 0;
>
>	val &= td_conf->xfam_fixed0;
>
>	return val;
>}
>
>and then:
>
>	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>	    kvm_host.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK))
>		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>	    kvm_host.xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK))
>		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
  2024-11-26  1:44         ` Binbin Wu
@ 2024-11-26  3:52           ` Huang, Kai
  2024-11-26  5:29             ` Binbin Wu
  0 siblings, 1 reply; 82+ messages in thread
From: Huang, Kai @ 2024-11-26  3:52 UTC (permalink / raw)
  To: seanjc@google.com, binbin.wu@linux.intel.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, Hunter, Adrian, dmatlack@google.com,
	Yang, Weijiang, Chatre, Reinette, pbonzini@redhat.com,
	Yamahata, Isaku, tony.lindgren@linux.intel.com,
	nik.borisov@suse.com, linux-kernel@vger.kernel.org, Gao, Chao,
	Edgecombe, Rick P, x86@kernel.org

On Tue, 2024-11-26 at 09:44 +0800, Binbin Wu wrote:
> 
> 
> On 11/26/2024 6:51 AM, Sean Christopherson wrote:
> 
> [...]
> > When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
> > performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
> > TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
> > load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
> > TDX-Module that propagates that blocking to SEAMCALL VMCS.
> I see, thanks for the explanation!
> 
> > 
> > Hmm, actually, this means that TDX has a causality inversion, which may become
> > visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
> > and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
> > TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
> > and handled before NMI X.
> 
> This example can also cause an issue without FRED.
> 1. NMI X arrives in SEAM non-root and triggers a VM-Exit.
> 2. NMI X+1 becomes pending while SEAM root is active.
> 3. TDX-Module SEAMRETs to VMX root, NMIs are unblocked.
> 4. NMI X+1 is delivered and handled before NMI X.
>     (NMI handler could handle all NMI source events, including the source
>      triggered NMI X)
> 5. KVM calls exc_nmi() to handle the VM Exit caused by NMI X
> In step 5, because the source event caused NMI X has been handled, and NMI X
> will not be detected as a second half of back-to-back NMIs, according to
> Linux NMI handler, it will be considered as an unknown NMI.

I don't think KVM should call exc_nmi() anymore if NMI is unblocked upon
SEAMRET.

> 
> Actually, the issue could happen if NMI X+1 occurs after exiting to SEAM root
> mode and before KVM handling the VM-Exit caused by NMI X.
> 

If we can make sure NMI is still blocked upon SEAMRET then everything follows
the current VMX flow IIUC.  We should make that happen IMHO.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
  2024-11-26  3:52           ` Huang, Kai
@ 2024-11-26  5:29             ` Binbin Wu
  2024-11-26  5:37               ` Huang, Kai
  2024-11-26 21:41               ` Sean Christopherson
  0 siblings, 2 replies; 82+ messages in thread
From: Binbin Wu @ 2024-11-26  5:29 UTC (permalink / raw)
  To: Huang, Kai, seanjc@google.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, Hunter, Adrian, dmatlack@google.com,
	Yang, Weijiang, Chatre, Reinette, pbonzini@redhat.com,
	Yamahata, Isaku, tony.lindgren@linux.intel.com,
	nik.borisov@suse.com, linux-kernel@vger.kernel.org, Gao, Chao,
	Edgecombe, Rick P, x86@kernel.org, Li, Xin3




On 11/26/2024 11:52 AM, Huang, Kai wrote:
> On Tue, 2024-11-26 at 09:44 +0800, Binbin Wu wrote:
>>
>> On 11/26/2024 6:51 AM, Sean Christopherson wrote:
>>
>> [...]
>>> When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
>>> performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
>>> TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
>>> load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
>>> TDX-Module that propagates that blocking to SEAMCALL VMCS.
>> I see, thanks for the explanation!
>>
>>> Hmm, actually, this means that TDX has a causality inversion, which may become
>>> visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
>>> and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
>>> TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
>>> and handled before NMI X.
>> This example can also cause an issue without FRED.
>> 1. NMI X arrives in SEAM non-root and triggers a VM-Exit.
>> 2. NMI X+1 becomes pending while SEAM root is active.
>> 3. TDX-Module SEAMRETs to VMX root, NMIs are unblocked.
>> 4. NMI X+1 is delivered and handled before NMI X.
>>      (NMI handler could handle all NMI source events, including the source
>>       triggered NMI X)
>> 5. KVM calls exc_nmi() to handle the VM Exit caused by NMI X
>> In step 5, because the source event caused NMI X has been handled, and NMI X
>> will not be detected as a second half of back-to-back NMIs, according to
>> Linux NMI handler, it will be considered as an unknown NMI.
> I don't think KVM should call exc_nmi() anymore if NMI is unblocked upon
> SEAMRET.

IIUC, KVM has to, because the NMI triggered the VM-Exit can't trigger the
NMI handler to be invoked automatically even if NMI is unblocked upon SEAMRET.

>
>> Actually, the issue could happen if NMI X+1 occurs after exiting to SEAM root
>> mode and before KVM handling the VM-Exit caused by NMI X.
>>
> If we can make sure NMI is still blocked upon SEAMRET then everything follows
> the current VMX flow IIUC.  We should make that happen IMHO.
>
>
Agree.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
  2024-11-26  5:29             ` Binbin Wu
@ 2024-11-26  5:37               ` Huang, Kai
  2024-11-26 21:41               ` Sean Christopherson
  1 sibling, 0 replies; 82+ messages in thread
From: Huang, Kai @ 2024-11-26  5:37 UTC (permalink / raw)
  To: seanjc@google.com, binbin.wu@linux.intel.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, Hunter, Adrian, Li, Xin3,
	linux-kernel@vger.kernel.org, Yang, Weijiang, Chatre, Reinette,
	dmatlack@google.com, pbonzini@redhat.com, Yamahata, Isaku,
	nik.borisov@suse.com, tony.lindgren@linux.intel.com,
	Edgecombe, Rick P, Gao, Chao, x86@kernel.org

On Tue, 2024-11-26 at 13:29 +0800, Binbin Wu wrote:
> 
> 
> On 11/26/2024 11:52 AM, Huang, Kai wrote:
> > On Tue, 2024-11-26 at 09:44 +0800, Binbin Wu wrote:
> > > 
> > > On 11/26/2024 6:51 AM, Sean Christopherson wrote:
> > > 
> > > [...]
> > > > When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
> > > > performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
> > > > TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
> > > > load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
> > > > TDX-Module that propagates that blocking to SEAMCALL VMCS.
> > > I see, thanks for the explanation!
> > > 
> > > > Hmm, actually, this means that TDX has a causality inversion, which may become
> > > > visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
> > > > and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
> > > > TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
> > > > and handled before NMI X.
> > > This example can also cause an issue without FRED.
> > > 1. NMI X arrives in SEAM non-root and triggers a VM-Exit.
> > > 2. NMI X+1 becomes pending while SEAM root is active.
> > > 3. TDX-Module SEAMRETs to VMX root, NMIs are unblocked.
> > > 4. NMI X+1 is delivered and handled before NMI X.
> > >      (NMI handler could handle all NMI source events, including the source
> > >       triggered NMI X)
> > > 5. KVM calls exc_nmi() to handle the VM Exit caused by NMI X
> > > In step 5, because the source event caused NMI X has been handled, and NMI X
> > > will not be detected as a second half of back-to-back NMIs, according to
> > > Linux NMI handler, it will be considered as an unknown NMI.
> > I don't think KVM should call exc_nmi() anymore if NMI is unblocked upon
> > SEAMRET.
> 
> IIUC, KVM has to, because the NMI triggered the VM-Exit can't trigger the
> NMI handler to be invoked automatically even if NMI is unblocked upon SEAMRET.

Ah I missed this.  You mean unblocking NMI won't invoke NMI handler via IDT
descriptor 2.  Then I see why NMI X+1 is handled before NMI X.  Thanks.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/7] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
  2024-11-25 14:12   ` Nikolay Borisov
@ 2024-11-26 16:15     ` Adrian Hunter
  0 siblings, 0 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-11-26 16:15 UTC (permalink / raw)
  To: Nikolay Borisov, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, linux-kernel,
	x86, yan.y.zhao, chao.gao, weijiang.yang

On 25/11/24 16:12, Nikolay Borisov wrote:
> 
> 
> On 21.11.24 г. 22:14 ч., Adrian Hunter wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> On entering/exiting TDX vcpu, preserved or clobbered CPU state is different
>> from the VMX case. Add TDX hooks to save/restore host/guest CPU state.
>> Save/restore kernel GS base MSR.
>>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>> TD vcpu enter/exit v1:
>>   - Clarify comment (Binbin)
>>   - Use lower case preserved and add the for VMX in log (Tony)
>>   - Fix bisectability issue with includes (Kai)
>> ---
>>   arch/x86/kvm/vmx/main.c    | 24 ++++++++++++++++++--
>>   arch/x86/kvm/vmx/tdx.c     | 46 ++++++++++++++++++++++++++++++++++++++
>>   arch/x86/kvm/vmx/tdx.h     |  4 ++++
>>   arch/x86/kvm/vmx/x86_ops.h |  4 ++++
>>   4 files changed, 76 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
>> index 44ec6005a448..3a8ffc199be2 100644
>> --- a/arch/x86/kvm/vmx/main.c
>> +++ b/arch/x86/kvm/vmx/main.c
>> @@ -129,6 +129,26 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>       vmx_vcpu_load(vcpu, cpu);
>>   }
>>   +static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>> +{
>> +    if (is_td_vcpu(vcpu)) {
>> +        tdx_prepare_switch_to_guest(vcpu);
>> +        return;
>> +    }
>> +
>> +    vmx_prepare_switch_to_guest(vcpu);
>> +}
>> +
>> +static void vt_vcpu_put(struct kvm_vcpu *vcpu)
>> +{
>> +    if (is_td_vcpu(vcpu)) {
>> +        tdx_vcpu_put(vcpu);
>> +        return;
>> +    }
>> +
>> +    vmx_vcpu_put(vcpu);
>> +}
>> +
>>   static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
>>   {
>>       if (is_td_vcpu(vcpu))
>> @@ -250,9 +270,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>>       .vcpu_free = vt_vcpu_free,
>>       .vcpu_reset = vt_vcpu_reset,
>>   -    .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
>> +    .prepare_switch_to_guest = vt_prepare_switch_to_guest,
>>       .vcpu_load = vt_vcpu_load,
>> -    .vcpu_put = vmx_vcpu_put,
>> +    .vcpu_put = vt_vcpu_put,
>>         .update_exception_bitmap = vmx_update_exception_bitmap,
>>       .get_feature_msr = vmx_get_feature_msr,
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index 5fa5b65b9588..6e4ea2d420bc 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -1,6 +1,7 @@
>>   // SPDX-License-Identifier: GPL-2.0
>>   #include <linux/cleanup.h>
>>   #include <linux/cpu.h>
>> +#include <linux/mmu_context.h>
>>   #include <asm/tdx.h>
>>   #include "capabilities.h"
>>   #include "mmu.h"
>> @@ -9,6 +10,7 @@
>>   #include "vmx.h"
>>   #include "mmu/spte.h"
>>   #include "common.h"
>> +#include "posted_intr.h"
>>     #include <trace/events/kvm.h>
>>   #include "trace.h"
>> @@ -605,6 +607,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>>       if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
>>           vcpu->arch.xfd_no_write_intercept = true;
>>   +    tdx->host_state_need_save = true;
>> +    tdx->host_state_need_restore = false;
> 
> nit: Rather than have 2 separate values which actually work in tandem, why not define a u8 or even u32 and have a mask of the valid flags.
> 
> So you can have something like:
> 
> #define SAVE_HOST BIT(0)
> #define RESTORE_HOST BIT(1)
> 
> tdx->state_flags = SAVE_HOST
> 
> I don't know what are the plans for the future but there might be cases where you can have more complex flags composed of more simple ones.
> 

There are really only 3 possibilities:

	initial state (or after tdx_prepare_switch_to_host())
		tdx->host_state_need_save = true;
		tdx->host_state_need_restore = false;
	After save (i.e. after tdx_prepare_switch_to_guest())
		tdx->host_state_need_save = false
		tdx->host_state_need_restore = false;
	After enter/exit (i.e. after tdx_vcpu_enter_exit())
		tdx->host_state_need_save = false
		tdx->host_state_need_restore = true;

I can't think of good names, perhaps:

enum tdx_prepare_switch_state {
	TDX_PREP_UNSAVED,
	TDX_PREP_SAVED,
	TDX_PREP_UNRESTORED,
};

>>       tdx->state = VCPU_TD_STATE_UNINITIALIZED;
>>         return 0;
>> @@ -631,6 +636,45 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>       local_irq_enable();
>>   }
>>   +/*
>> + * Compared to vmx_prepare_switch_to_guest(), there is not much to do
>> + * as SEAMCALL/SEAMRET calls take care of most of save and restore.
>> + */
>> +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>> +{
>> +    struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +
>> +    if (!tdx->host_state_need_save)
> if (!(tdx->state_flags & SAVE_HOST))

	if (tdx->prep_switch_state != TDX_PREP_UNSAVED)

>> +        return;
>> +
>> +    if (likely(is_64bit_mm(current->mm)))
>> +        tdx->msr_host_kernel_gs_base = current->thread.gsbase;
>> +    else
>> +        tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
>> +
>> +    tdx->host_state_need_save = false;
> 
> tdx->state &= ~SAVE_HOST

	tdx->prep_switch_state = TDX_PREP_SAVED;

>> +}
>> +
>> +static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
>> +{
>> +    struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +
>> +    tdx->host_state_need_save = true;
>> +    if (!tdx->host_state_need_restore)
> if (!(tdx->state_flags & RESTORE_HOST)

	if (tdx->prep_switch_state != TDX_PREP_UNRESTORED)

> 
>> +        return;
>> +
>> +    ++vcpu->stat.host_state_reload;
>> +
>> +    wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
>> +    tdx->host_state_need_restore = false;

	tdx->prep_switch_state = TDX_PREP_UNSAVED;

>> +}
>> +
>> +void tdx_vcpu_put(struct kvm_vcpu *vcpu)
>> +{
>> +    vmx_vcpu_pi_put(vcpu);
>> +    tdx_prepare_switch_to_host(vcpu);
>> +}
>> +
>>   void tdx_vcpu_free(struct kvm_vcpu *vcpu)
>>   {
>>       struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> @@ -732,6 +776,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>         tdx_vcpu_enter_exit(vcpu);
>>   +    tdx->host_state_need_restore = true;
> 
> tdx->state_flags |= RESTORE_HOST

	tdx->prep_switch_state = TDX_PREP_UNRESTORED;

> 
>> +
>>       vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
>>       trace_kvm_exit(vcpu, KVM_ISA_VMX);
>>   diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
>> index ebee1049b08b..48cf0a1abfcc 100644
>> --- a/arch/x86/kvm/vmx/tdx.h
>> +++ b/arch/x86/kvm/vmx/tdx.h
>> @@ -54,6 +54,10 @@ struct vcpu_tdx {
>>       u64 vp_enter_ret;
>>         enum vcpu_tdx_state state;
>> +
>> +    bool host_state_need_save;
>> +    bool host_state_need_restore;
> 
> this would save having a discrete member for those boolean checks.
> 
>> +    u64 msr_host_kernel_gs_base;
>>   };
>>     void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
>> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
>> index 3d292a677b92..5bd45a720007 100644
>> --- a/arch/x86/kvm/vmx/x86_ops.h
>> +++ b/arch/x86/kvm/vmx/x86_ops.h
>> @@ -130,6 +130,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>>   void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>>   void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>>   fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
>> +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
>> +void tdx_vcpu_put(struct kvm_vcpu *vcpu);
>>     int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>>   @@ -161,6 +163,8 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediat
>>   {
>>       return EXIT_FASTPATH_NONE;
>>   }
>> +static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
>> +static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
>>     static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
>>   
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
  2024-11-26  5:29             ` Binbin Wu
  2024-11-26  5:37               ` Huang, Kai
@ 2024-11-26 21:41               ` Sean Christopherson
  1 sibling, 0 replies; 82+ messages in thread
From: Sean Christopherson @ 2024-11-26 21:41 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Kai Huang, kvm@vger.kernel.org, Xiaoyao Li, Yan Y Zhao,
	dave.hansen@linux.intel.com, Adrian Hunter, dmatlack@google.com,
	Weijiang Yang, Reinette Chatre, pbonzini@redhat.com,
	Isaku Yamahata, tony.lindgren@linux.intel.com,
	nik.borisov@suse.com, linux-kernel@vger.kernel.org, Chao Gao,
	Rick P Edgecombe, x86@kernel.org, Xin3 Li

On Tue, Nov 26, 2024, Binbin Wu wrote:
> On 11/26/2024 11:52 AM, Huang, Kai wrote:
> > On Tue, 2024-11-26 at 09:44 +0800, Binbin Wu wrote:
> > > 
> > > On 11/26/2024 6:51 AM, Sean Christopherson wrote:
> > > 
> > > [...]
> > > > When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
> > > > performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
> > > > TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
> > > > load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
> > > > TDX-Module that propagates that blocking to SEAMCALL VMCS.
> > > I see, thanks for the explanation!
> > > 
> > > > Hmm, actually, this means that TDX has a causality inversion, which may become
> > > > visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
> > > > and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
> > > > TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
> > > > and handled before NMI X.
> > > This example can also cause an issue without FRED.
> > > 1. NMI X arrives in SEAM non-root and triggers a VM-Exit.
> > > 2. NMI X+1 becomes pending while SEAM root is active.
> > > 3. TDX-Module SEAMRETs to VMX root, NMIs are unblocked.
> > > 4. NMI X+1 is delivered and handled before NMI X.
> > >      (NMI handler could handle all NMI source events, including the source
> > >       triggered NMI X)
> > > 5. KVM calls exc_nmi() to handle the VM Exit caused by NMI X
> > > In step 5, because the source event caused NMI X has been handled, and NMI X
> > > will not be detected as a second half of back-to-back NMIs, according to
> > > Linux NMI handler, it will be considered as an unknown NMI.
> > I don't think KVM should call exc_nmi() anymore if NMI is unblocked upon
> > SEAMRET.
> 
> IIUC, KVM has to, because the NMI triggered the VM-Exit can't trigger the
> NMI handler to be invoked automatically even if NMI is unblocked upon SEAMRET.

Yep.  The NMI is consumed by the VM-Exit, for all intents and purposes.  KVM must
manually invoke the NMI handler.

Which is how the ordering gets messed up: NMI X+1 arrives before KVM has a chance
to manually invoke the handler for NMI X.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-11-22  3:27   ` Chao Gao
@ 2024-11-27 14:00     ` Sean Christopherson
  2024-11-29 11:39       ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Sean Christopherson @ 2024-11-27 14:00 UTC (permalink / raw)
  To: Chao Gao
  Cc: Adrian Hunter, pbonzini, kvm, dave.hansen, rick.p.edgecombe,
	kai.huang, reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	dmatlack, isaku.yamahata, nik.borisov, linux-kernel, x86,
	yan.y.zhao, weijiang.yang

On Fri, Nov 22, 2024, Chao Gao wrote:
> >+static bool tdparams_tsx_supported(struct kvm_cpuid2 *cpuid)
> >+{
> >+	const struct kvm_cpuid_entry2 *entry;
> >+	u64 mask;
> >+	u32 ebx;
> >+
> >+	entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x7, 0);
> >+	if (entry)
> >+		ebx = entry->ebx;
> >+	else
> >+		ebx = 0;
> >+
> >+	mask = __feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM);
> >+	return ebx & mask;
> >+}
> >+
> > static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> > 			struct kvm_tdx_init_vm *init_vm)
> > {
> >@@ -1299,6 +1322,7 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> > 	MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
> > 	MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);
> > 
> >+	to_kvm_tdx(kvm)->tsx_supported = tdparams_tsx_supported(cpuid);
> > 	return 0;
> > }
> > 
> >@@ -2272,6 +2296,11 @@ static int __init __tdx_bringup(void)
> > 			return -EIO;
> > 		}
> > 	}
> >+	tdx_uret_tsx_ctrl_slot = kvm_find_user_return_msr(MSR_IA32_TSX_CTRL);
> >+	if (tdx_uret_tsx_ctrl_slot == -1 && boot_cpu_has(X86_FEATURE_MSR_TSX_CTRL)) {
> >+		pr_err("MSR_IA32_TSX_CTRL isn't included by kvm_find_user_return_msr\n");
> >+		return -EIO;
> >+	}
> > 
> > 	/*
> > 	 * Enabling TDX requires enabling hardware virtualization first,
> >diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> >index 48cf0a1abfcc..815ff6bdbc7e 100644
> >--- a/arch/x86/kvm/vmx/tdx.h
> >+++ b/arch/x86/kvm/vmx/tdx.h
> >@@ -29,6 +29,14 @@ struct kvm_tdx {
> > 	u8 nr_tdcs_pages;
> > 	u8 nr_vcpu_tdcx_pages;
> > 
> >+	/*
> >+	 * Used on each TD-exit, see tdx_user_return_msr_update_cache().
> >+	 * TSX_CTRL value on TD exit
> >+	 * - set 0     if guest TSX enabled
> >+	 * - preserved if guest TSX disabled
> >+	 */
> >+	bool tsx_supported;
> 
> Is it possible to drop this boolean and tdparams_tsx_supported()? I think we
> can use the guest_can_use() framework instead.

Yeah, though that optimized handling will soon come for free[*], and I plan on
landing that sooner than TDX, so don't fret too much over this.

[*] https://lore.kernel.org/all/20240517173926.965351-1-seanjc@google.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/7] KVM: TDX: Implement TDX vcpu enter/exit path
  2024-11-22 14:33       ` Adrian Hunter
@ 2024-11-28  5:56         ` Yan Zhao
  2024-11-28  6:26           ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Yan Zhao @ 2024-11-28  5:56 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Binbin Wu, Xiaoyao Li, pbonzini, seanjc, kvm, dave.hansen,
	rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	dmatlack, isaku.yamahata, nik.borisov, linux-kernel, x86,
	chao.gao, weijiang.yang

On Fri, Nov 22, 2024 at 04:33:27PM +0200, Adrian Hunter wrote:
> On 22/11/24 07:56, Binbin Wu wrote:
> > 
> > 
> > 
> > On 11/22/2024 1:23 PM, Xiaoyao Li wrote:
> > [...]
> >>> +
> >>> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> >>> +{
> >>> +    struct vcpu_tdx *tdx = to_tdx(vcpu);
> >>> +
> >>> +    /* TDX exit handle takes care of this error case. */
> >>> +    if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
> >>> +        /* Set to avoid collision with EXIT_REASON_EXCEPTION_NMI. */
> >>
> >> It seems the check fits better in tdx_vcpu_pre_run().
> > 
> > Indeed, it's cleaner to move the check to vcpu_pre_run.
> > Then no need to set the value to vp_enter_ret, and the comments are not
> > needed.
> 
> And we can take out the same check in tdx_handle_exit()
> because it won't get there if ->vcpu_pre_run() fails.

And also check for TD_STATE_RUNNABLE in tdx_vcpu_pre_run()?

> > 
> >>
> >> And without the patch of how TDX handles Exit (i.e., how deal with vp_enter_ret), it's hard to review this comment.
> >>
> >>> +        tdx->vp_enter_ret = TDX_SW_ERROR;
> >>> +        return EXIT_FASTPATH_NONE;
> >>> +    }
> >>> +
> >>> +    trace_kvm_entry(vcpu, force_immediate_exit);
> >>> +
> >>> +    tdx_vcpu_enter_exit(vcpu);
> >>> +
> >>> +    vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> >>> +    trace_kvm_exit(vcpu, KVM_ISA_VMX);
> >>> +
> >>> +    return EXIT_FASTPATH_NONE;
> >>> +}
> >>> +
> > [...]
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/7] KVM: TDX: Implement TDX vcpu enter/exit path
  2024-11-28  5:56         ` Yan Zhao
@ 2024-11-28  6:26           ` Adrian Hunter
  0 siblings, 0 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-11-28  6:26 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Binbin Wu, Xiaoyao Li, pbonzini, seanjc, kvm, dave.hansen,
	rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	dmatlack, isaku.yamahata, nik.borisov, linux-kernel, x86,
	chao.gao, weijiang.yang

On 28/11/24 07:56, Yan Zhao wrote:
> On Fri, Nov 22, 2024 at 04:33:27PM +0200, Adrian Hunter wrote:
>> On 22/11/24 07:56, Binbin Wu wrote:
>>>
>>>
>>>
>>> On 11/22/2024 1:23 PM, Xiaoyao Li wrote:
>>> [...]
>>>>> +
>>>>> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>>>> +{
>>>>> +    struct vcpu_tdx *tdx = to_tdx(vcpu);
>>>>> +
>>>>> +    /* TDX exit handle takes care of this error case. */
>>>>> +    if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
>>>>> +        /* Set to avoid collision with EXIT_REASON_EXCEPTION_NMI. */
>>>>
>>>> It seems the check fits better in tdx_vcpu_pre_run().
>>>
>>> Indeed, it's cleaner to move the check to vcpu_pre_run.
>>> Then no need to set the value to vp_enter_ret, and the comments are not
>>> needed.
>>
>> And we can take out the same check in tdx_handle_exit()
>> because it won't get there if ->vcpu_pre_run() fails.
> 
> And also check for TD_STATE_RUNNABLE in tdx_vcpu_pre_run()?

Yes, let's also do that.

> 
>>>
>>>>
>>>> And without the patch of how TDX handles Exit (i.e., how deal with vp_enter_ret), it's hard to review this comment.
>>>>
>>>>> +        tdx->vp_enter_ret = TDX_SW_ERROR;
>>>>> +        return EXIT_FASTPATH_NONE;
>>>>> +    }
>>>>> +
>>>>> +    trace_kvm_entry(vcpu, force_immediate_exit);
>>>>> +
>>>>> +    tdx_vcpu_enter_exit(vcpu);
>>>>> +
>>>>> +    vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
>>>>> +    trace_kvm_exit(vcpu, KVM_ISA_VMX);
>>>>> +
>>>>> +    return EXIT_FASTPATH_NONE;
>>>>> +}
>>>>> +
>>> [...]
>>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-11-26  2:20       ` Chao Gao
@ 2024-11-28  6:50         ` Adrian Hunter
  2024-12-02  2:52           ` Chao Gao
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-11-28  6:50 UTC (permalink / raw)
  To: Chao Gao
  Cc: pbonzini, seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On 26/11/24 04:20, Chao Gao wrote:
> On Mon, Nov 25, 2024 at 01:10:37PM +0200, Adrian Hunter wrote:
>> On 22/11/24 07:49, Chao Gao wrote:
>>>> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>>>> +
>>>> +	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>>>> +	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
>>>> +		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>>>> +	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>>>> +	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
>>>> +	    kvm_host.xss != (kvm_tdx->xfam &
>>>> +			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>>>> +			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
>>>
>>> Should we drop CET/PT from this series? I think they are worth a new
>>> patch/series.
>>
>> This is not really about CET/PT
>>
>> What is happening here is that we are calculating the current
>> MSR_IA32_XSS value based on the TDX Module spec which says the
>> TDX Module sets MSR_IA32_XSS to the XSS bits from XFAM.  The
>> TDX Module does that literally, from TDX Module source code:
>>
>> 	#define XCR0_SUPERVISOR_BIT_MASK            0x0001FD00
>> and
>> 	ia32_wrmsr(IA32_XSS_MSR_ADDR, xfam & XCR0_SUPERVISOR_BIT_MASK);
>>
>> For KVM, rather than:
>>
>> 			kvm_tdx->xfam &
>> 			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
>> 			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)
>>
>> it would be more direct to define the bits and enforce them
>> via tdx_get_supported_xfam() e.g.
>>
>> /* 
>> * Before returning from TDH.VP.ENTER, the TDX Module assigns:
>> *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
>> *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
>> */
>> #define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
>> #define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
>> #define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)
>>
>> static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
>> {
>> 	u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;
>>
>> 	/* Ensure features are in the masks */
>> 	val &= TDX_XFAM_MASK;
> 
> Before exposing a feature to TD VMs, both the TDX module and KVM must support
> it. In other words, kvm_tdx->xfam & kvm_caps.supported_xss should yield the
> same result as kvm_tdx->xfam & TDX_XFAM_XSS_MASK. So, to me, the current
> approach and your new proposal are functionally identical.
> 
> I prefer checking against kvm_caps.supported_xss because we don't need to
> update TDX_XFAM_XSS/XCR0_MASK when new user/supervisor xstate bits are added.

Arguably, making the addition of new XFAM bits more visible
is a good thing.

> Note kvm_caps.supported_xss/xcr0 need to be updated for normal VMs anyway.

The way the code is at the moment, that seems too fragile.
At the moment there are direct changes to XFAM bits in
tdx_get_supported_xfam() that are not reflected in supported_xss
and so have to be added to tdx_restore_host_xsave_state() as well.
That is, with the current code, changes to tdx_get_supported_xfam()
can break tdx_restore_host_xsave_state().

The new approach:
	reflects what the TDX Module does
	reflects what the TDX Module base spec says
	makes it harder to break tdx_restore_host_xsave_state()

> 
>>
>> 	if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
>> 		return 0;
>>
>> 	val &= td_conf->xfam_fixed0;
>>
>> 	return val;
>> }
>>
>> and then:
>>
>> 	if (static_cpu_has(X86_FEATURE_XSAVE) &&
>> 	    kvm_host.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK))
>> 		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>> 	if (static_cpu_has(X86_FEATURE_XSAVES) &&
>> 	    kvm_host.xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK))
>> 		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
>>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-11-25 13:40       ` Adrian Hunter
@ 2024-11-28 11:13         ` Adrian Hunter
  2024-12-04 15:58           ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-11-28 11:13 UTC (permalink / raw)
  To: Dave Hansen, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang

On 25/11/24 15:40, Adrian Hunter wrote:
> On 22/11/24 18:33, Dave Hansen wrote:
>> On 11/22/24 03:10, Adrian Hunter wrote:
>>> +struct tdh_vp_enter_tdcall {
>>> +	u64	reg_mask	: 32,
>>> +		vm_idx		:  2,
>>> +		reserved_0	: 30;
>>> +	u64	data[TDX_ERR_DATA_PART_2];
>>> +	u64	fn;	/* Non-zero for hypercalls, zero otherwise */
>>> +	u64	subfn;
>>> +	union {
>>> +		struct tdh_vp_enter_vmcall 		vmcall;
>>> +		struct tdh_vp_enter_gettdvmcallinfo	gettdvmcallinfo;
>>> +		struct tdh_vp_enter_mapgpa		mapgpa;
>>> +		struct tdh_vp_enter_getquote		getquote;
>>> +		struct tdh_vp_enter_reportfatalerror	reportfatalerror;
>>> +		struct tdh_vp_enter_cpuid		cpuid;
>>> +		struct tdh_vp_enter_mmio		mmio;
>>> +		struct tdh_vp_enter_hlt			hlt;
>>> +		struct tdh_vp_enter_io			io;
>>> +		struct tdh_vp_enter_rd			rd;
>>> +		struct tdh_vp_enter_wr			wr;
>>> +	};
>>> +};
>>
>> Let's say someone declares this:
>>
>> struct tdh_vp_enter_mmio {
>> 	u64	size;
>> 	u64	mmio_addr;
>> 	u64	direction;
>> 	u64	value;
>> };
>>
>> How long is that going to take you to debug?
> 
> When adding a new hardware definition, it would be sensible
> to check the hardware definition first before checking anything
> else.
> 
> However, to stop existing members being accidentally moved,
> could add:
> 
> #define CHECK_OFFSETS_EQ(reg, member) \
> 	BUILD_BUG_ON(offsetof(struct tdx_module_args, reg) != offsetof(union tdh_vp_enter_args, member));
> 
> 	CHECK_OFFSETS_EQ(r12, tdcall.mmio.size);
> 	CHECK_OFFSETS_EQ(r13, tdcall.mmio.direction);
> 	CHECK_OFFSETS_EQ(r14, tdcall.mmio.mmio_addr);
> 	CHECK_OFFSETS_EQ(r15, tdcall.mmio.value);
> 

Note, struct tdh_vp_enter_tdcall is an output format.  The tdcall
arguments come directly from the guest with no validation by the
TDX Module.  It could be rubbish, or even malicious rubbish.  The
exit handlers validate the values before using them.

WRT the TDCALL input format (response by the host VMM), 'ret_code'
and 'failed_gpa' could use types other than 'u64', but the other
members are really 'u64'.

/* TDH.VP.ENTER Input Format #2 : Following a previous TDCALL(TDG.VP.VMCALL) */
struct tdh_vp_enter_in {
	u64	__vcpu_handle_and_flags; /* Don't use. tdh_vp_enter() will take care of it */
	u64	unused[3];
	u64	ret_code;
	union {
		u64 gettdvmcallinfo[4];
		struct {
			u64	failed_gpa;
		} mapgpa;
		struct {
			u64	unused;
			u64	eax;
			u64	ebx;
			u64	ecx;
			u64	edx;
		} cpuid;
		/* Value read for IO, MMIO or RDMSR */
		struct {
			u64	value;
		} read;
	};
};

Another different alternative could be to use an opaque structure,
not visible to KVM, and then all accesses to it become helper
functions like:

struct tdx_args;

int tdx_args_get_mmio(struct tdx_args *args,
		      enum tdx_access_size *size,
		      enum tdx_access_dir *direction,
		      gpa_t *addr,
		      u64 *value);

void tdx_args_set_failed_gpa(struct tdx_args *args, gpa_t gpa);
void tdx_args_set_ret_code(struct tdx_args *args, enum tdx_ret_code ret_code);
etc

For the 'get' functions, that would tend to imply the helpers
would do some validation.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-11-27 14:00     ` Sean Christopherson
@ 2024-11-29 11:39       ` Adrian Hunter
  2024-12-02 19:07         ` Sean Christopherson
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-11-29 11:39 UTC (permalink / raw)
  To: Sean Christopherson, Chao Gao
  Cc: pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On 27/11/24 16:00, Sean Christopherson wrote:
> On Fri, Nov 22, 2024, Chao Gao wrote:
>>> +static bool tdparams_tsx_supported(struct kvm_cpuid2 *cpuid)
>>> +{
>>> +	const struct kvm_cpuid_entry2 *entry;
>>> +	u64 mask;
>>> +	u32 ebx;
>>> +
>>> +	entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x7, 0);
>>> +	if (entry)
>>> +		ebx = entry->ebx;
>>> +	else
>>> +		ebx = 0;
>>> +
>>> +	mask = __feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM);
>>> +	return ebx & mask;
>>> +}
>>> +
>>> static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
>>> 			struct kvm_tdx_init_vm *init_vm)
>>> {
>>> @@ -1299,6 +1322,7 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
>>> 	MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
>>> 	MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);
>>>
>>> +	to_kvm_tdx(kvm)->tsx_supported = tdparams_tsx_supported(cpuid);
>>> 	return 0;
>>> }
>>>
>>> @@ -2272,6 +2296,11 @@ static int __init __tdx_bringup(void)
>>> 			return -EIO;
>>> 		}
>>> 	}
>>> +	tdx_uret_tsx_ctrl_slot = kvm_find_user_return_msr(MSR_IA32_TSX_CTRL);
>>> +	if (tdx_uret_tsx_ctrl_slot == -1 && boot_cpu_has(X86_FEATURE_MSR_TSX_CTRL)) {
>>> +		pr_err("MSR_IA32_TSX_CTRL isn't included by kvm_find_user_return_msr\n");
>>> +		return -EIO;
>>> +	}
>>>
>>> 	/*
>>> 	 * Enabling TDX requires enabling hardware virtualization first,
>>> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
>>> index 48cf0a1abfcc..815ff6bdbc7e 100644
>>> --- a/arch/x86/kvm/vmx/tdx.h
>>> +++ b/arch/x86/kvm/vmx/tdx.h
>>> @@ -29,6 +29,14 @@ struct kvm_tdx {
>>> 	u8 nr_tdcs_pages;
>>> 	u8 nr_vcpu_tdcx_pages;
>>>
>>> +	/*
>>> +	 * Used on each TD-exit, see tdx_user_return_msr_update_cache().
>>> +	 * TSX_CTRL value on TD exit
>>> +	 * - set 0     if guest TSX enabled
>>> +	 * - preserved if guest TSX disabled
>>> +	 */
>>> +	bool tsx_supported;
>>
>> Is it possible to drop this boolean and tdparams_tsx_supported()? I think we
>> can use the guest_can_use() framework instead.
> 
> Yeah, though that optimized handling will soon come for free[*], and I plan on
> landing that sooner than TDX, so don't fret too much over this.
> 
> [*] https://lore.kernel.org/all/20240517173926.965351-1-seanjc@google.com

guest_can_use() is per-vcpu whereas we are currently using the
CPUID from TD_PARAMS (as per spec) before there are any VCPU's.
It is a bit of a disconnect so let's keep tsx_supported for now.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-11-28  6:50         ` Adrian Hunter
@ 2024-12-02  2:52           ` Chao Gao
  2024-12-02  6:36             ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Chao Gao @ 2024-12-02  2:52 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: pbonzini, seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

>>> /* 
>>> * Before returning from TDH.VP.ENTER, the TDX Module assigns:
>>> *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
>>> *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)

TILECFG state (bit 17) and TILEDATA state (bit 18) are also user state. Are they
cleared unconditionally?

>>> */
>>> #define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
>>> #define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
>>> #define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-12-02  2:52           ` Chao Gao
@ 2024-12-02  6:36             ` Adrian Hunter
  0 siblings, 0 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-12-02  6:36 UTC (permalink / raw)
  To: Chao Gao
  Cc: pbonzini, seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On 2/12/24 04:52, Chao Gao wrote:
>>>> /* 
>>>> * Before returning from TDH.VP.ENTER, the TDX Module assigns:
>>>> *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
>>>> *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
> 
> TILECFG state (bit 17) and TILEDATA state (bit 18) are also user state. Are they
> cleared unconditionally?

Bit 17 and 18 should also be in TDX_XFAM_XCR0_MASK
TDX Module does define them, from TDX Module sources:

	#define XCR0_USER_BIT_MASK                  0x000602FF

Thanks for spotting that!

> 
>>>> */
>>>> #define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
>>>> #define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
>>>> #define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-11-29 11:39       ` Adrian Hunter
@ 2024-12-02 19:07         ` Sean Christopherson
  2024-12-02 19:24           ` Edgecombe, Rick P
  0 siblings, 1 reply; 82+ messages in thread
From: Sean Christopherson @ 2024-12-02 19:07 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On Fri, Nov 29, 2024, Adrian Hunter wrote:
> On 27/11/24 16:00, Sean Christopherson wrote:
> > On Fri, Nov 22, 2024, Chao Gao wrote:
> >>> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> >>> index 48cf0a1abfcc..815ff6bdbc7e 100644
> >>> --- a/arch/x86/kvm/vmx/tdx.h
> >>> +++ b/arch/x86/kvm/vmx/tdx.h
> >>> @@ -29,6 +29,14 @@ struct kvm_tdx {
> >>> 	u8 nr_tdcs_pages;
> >>> 	u8 nr_vcpu_tdcx_pages;
> >>>
> >>> +	/*
> >>> +	 * Used on each TD-exit, see tdx_user_return_msr_update_cache().
> >>> +	 * TSX_CTRL value on TD exit
> >>> +	 * - set 0     if guest TSX enabled
> >>> +	 * - preserved if guest TSX disabled
> >>> +	 */
> >>> +	bool tsx_supported;
> >>
> >> Is it possible to drop this boolean and tdparams_tsx_supported()? I think we
> >> can use the guest_can_use() framework instead.
> > 
> > Yeah, though that optimized handling will soon come for free[*], and I plan on
> > landing that sooner than TDX, so don't fret too much over this.
> > 
> > [*] https://lore.kernel.org/all/20240517173926.965351-1-seanjc@google.com
> 
> guest_can_use() is per-vcpu whereas we are currently using the
> CPUID from TD_PARAMS (as per spec) before there are any VCPU's.
> It is a bit of a disconnect so let's keep tsx_supported for now.

No, as was agreed upon[*], KVM needs to ensure consistency between what KVM sees
as guest CPUID and what is actually enabled/exposed to the guest.  If there are
no vCPUs, then there's zero reason to snapshot the value in kvm_tdx.  And if there
are vCPUs, then their CPUID info needs to be consistent with respect to TDPARAMS.

 - Don't hardcode fixed/required CPUID values in KVM, use available metadata
   from TDX Module to reject "bad" guest CPUID (or let the TDX module reject?).
   I.e. don't let a guest silently run with a CPUID that diverges from what
   userspace provided.

[*] https://lore.kernel.org/all/20240405165844.1018872-1-seanjc@google.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-02 19:07         ` Sean Christopherson
@ 2024-12-02 19:24           ` Edgecombe, Rick P
  2024-12-03  0:34             ` Sean Christopherson
  0 siblings, 1 reply; 82+ messages in thread
From: Edgecombe, Rick P @ 2024-12-02 19:24 UTC (permalink / raw)
  To: Hunter, Adrian, seanjc@google.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Huang, Kai, Zhao, Yan Y,
	dave.hansen@linux.intel.com, dmatlack@google.com, Yang, Weijiang,
	Chatre, Reinette, pbonzini@redhat.com, binbin.wu@linux.intel.com,
	Yamahata, Isaku, nik.borisov@suse.com,
	linux-kernel@vger.kernel.org, Gao, Chao,
	tony.lindgren@linux.intel.com, x86@kernel.org

On Mon, 2024-12-02 at 11:07 -0800, Sean Christopherson wrote:
> > guest_can_use() is per-vcpu whereas we are currently using the
> > CPUID from TD_PARAMS (as per spec) before there are any VCPU's.
> > It is a bit of a disconnect so let's keep tsx_supported for now.
> 
> No, as was agreed upon[*], KVM needs to ensure consistency between what KVM
> sees
> as guest CPUID and what is actually enabled/exposed to the guest.  If there
> are
> no vCPUs, then there's zero reason to snapshot the value in kvm_tdx.  And if
> there
> are vCPUs, then their CPUID info needs to be consistent with respect to
> TDPARAMS.

Small point - the last conversation[0] we had on this was to let *userspace*
ensure consistency between KVM's CPUID (i.e. KVM_SET_CPUID2) and the TDX
Module's view. So the configuration goes:
1. Userspace configures per-VM CPU features
2. Userspace gets TDX Module's final per-vCPU version of CPUID configuration via
KVM API
3. Userspace calls KVM_SET_CPUID2 with the merge of TDX Module's version, and
userspace's desired values for KVM "owned" CPUID leads (pv features, etc)

But KVM's knowledge of CPUID bits still remains per-vcpu for TDX in any case.

> 
>  - Don't hardcode fixed/required CPUID values in KVM, use available metadata
>    from TDX Module to reject "bad" guest CPUID (or let the TDX module
> reject?).
>    I.e. don't let a guest silently run with a CPUID that diverges from what
>    userspace provided.

The latest QEMU patches have this fixed bit data hardcoded in QEMU. Then the
long term solution is to make the TDX module return this data. Xiaoyao will post
a proposal on how the TDX module should expose this soon.

> 
> [*] https://lore.kernel.org/all/20240405165844.1018872-1-seanjc@google.com


[0]https://lore.kernel.org/kvm/CABgObfaobJ=G18JO9Jx6-K2mhZ2saVyLY-tHOgab1cJupOe-0Q@mail.gmail.com/


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-02 19:24           ` Edgecombe, Rick P
@ 2024-12-03  0:34             ` Sean Christopherson
  2024-12-03 17:34               ` Edgecombe, Rick P
  0 siblings, 1 reply; 82+ messages in thread
From: Sean Christopherson @ 2024-12-03  0:34 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Adrian Hunter, kvm@vger.kernel.org, Xiaoyao Li, Kai Huang,
	Yan Y Zhao, dave.hansen@linux.intel.com, dmatlack@google.com,
	Weijiang Yang, Reinette Chatre, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Isaku Yamahata, nik.borisov@suse.com,
	linux-kernel@vger.kernel.org, Chao Gao,
	tony.lindgren@linux.intel.com, x86@kernel.org

On Mon, Dec 02, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-12-02 at 11:07 -0800, Sean Christopherson wrote:
> > > guest_can_use() is per-vcpu whereas we are currently using the
> > > CPUID from TD_PARAMS (as per spec) before there are any VCPU's.
> > > It is a bit of a disconnect so let's keep tsx_supported for now.
> > 
> > No, as was agreed upon[*], KVM needs to ensure consistency between what KVM
> > sees

Rick, fix your MUA to not wrap already :-)

> > as guest CPUID and what is actually enabled/exposed to the guest.  If there
> > are no vCPUs, then there's zero reason to snapshot the value in kvm_tdx. 
> > And if there are vCPUs, then their CPUID info needs to be consistent with
> > respect to TDPARAMS.
> 
> Small point - the last conversation[0] we had on this was to let *userspace*
> ensure consistency between KVM's CPUID (i.e. KVM_SET_CPUID2) and the TDX
> Module's view.

I'm all for that, right up until KVM needs to protect itself again userspace and
flawed TDX architecture.  A relevant comment I made in that thread:

 : If the upgrade breaks a setup because it confuses _KVM_, then I'll care

As it applies here, letting vCPU CPUID and actual guest functionality diverge for
features that KVM cares about _will_ cause problems.

This will be less ugly to handle once kvm_vcpu_arch.cpu_caps is a thing.  KVM
can simply force set/clear bits to match the actual guest functionality that's
hardcoded by the TDX Module or defined by TDPARAMS.

> So the configuration goes:
> 1. Userspace configures per-VM CPU features
> 2. Userspace gets TDX Module's final per-vCPU version of CPUID configuration via
> KVM API
> 3. Userspace calls KVM_SET_CPUID2 with the merge of TDX Module's version, and
> userspace's desired values for KVM "owned" CPUID leads (pv features, etc)
> 
> But KVM's knowledge of CPUID bits still remains per-vcpu for TDX in any case.
> 
> > 
> >  - Don't hardcode fixed/required CPUID values in KVM, use available metadata
> >    from TDX Module to reject "bad" guest CPUID (or let the TDX module reject?).
> >    I.e. don't let a guest silently run with a CPUID that diverges from what
> >    userspace provided.
> 
> The latest QEMU patches have this fixed bit data hardcoded in QEMU. Then the
> long term solution is to make the TDX module return this data. Xiaoyao will post
> a proposal on how the TDX module should expose this soon.

Punting the "merge" to userspace is fine, but KVM still needs to ensure it doesn't
have holes where userspace can attack the kernel by lying about what features the
guest has access to.  And that means forcing bits in kvm_vcpu_arch.cpu_caps;
anything else is just asking for problems.

> > [*] https://lore.kernel.org/all/20240405165844.1018872-1-seanjc@google.com
> 
> 
> [0]https://lore.kernel.org/kvm/CABgObfaobJ=G18JO9Jx6-K2mhZ2saVyLY-tHOgab1cJupOe-0Q@mail.gmail.com/

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-03  0:34             ` Sean Christopherson
@ 2024-12-03 17:34               ` Edgecombe, Rick P
  2024-12-03 19:17                 ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Edgecombe, Rick P @ 2024-12-03 17:34 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Huang, Kai, Zhao, Yan Y,
	dave.hansen@linux.intel.com, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, Gao, Chao, x86@kernel.org

On Mon, 2024-12-02 at 16:34 -0800, Sean Christopherson wrote:
> > Small point - the last conversation[0] we had on this was to let *userspace*
> > ensure consistency between KVM's CPUID (i.e. KVM_SET_CPUID2) and the TDX
> > Module's view.
> 
> I'm all for that, right up until KVM needs to protect itself again userspace
> and
> flawed TDX architecture.  A relevant comment I made in that thread:
> 
>  : If the upgrade breaks a setup because it confuses _KVM_, then I'll care
> 
> As it applies here, letting vCPU CPUID and actual guest functionality diverge
> for
> features that KVM cares about _will_ cause problems.

Right, just wanted to make sure we don't need to re-open the major design.

> 
> This will be less ugly to handle once kvm_vcpu_arch.cpu_caps is a thing.  KVM
> can simply force set/clear bits to match the actual guest functionality that's
> hardcoded by the TDX Module or defined by TDPARAMS.
> 
> > So the configuration goes:
> > 1. Userspace configures per-VM CPU features
> > 2. Userspace gets TDX Module's final per-vCPU version of CPUID configuration
> > via
> > KVM API
> > 3. Userspace calls KVM_SET_CPUID2 with the merge of TDX Module's version,
> > and
> > userspace's desired values for KVM "owned" CPUID leads (pv features, etc)
> > 
> > But KVM's knowledge of CPUID bits still remains per-vcpu for TDX in any
> > case.
> > 
> > > 
> > >  - Don't hardcode fixed/required CPUID values in KVM, use available
> > > metadata
> > >    from TDX Module to reject "bad" guest CPUID (or let the TDX module
> > > reject?).
> > >    I.e. don't let a guest silently run with a CPUID that diverges from
> > > what
> > >    userspace provided.
> > 
> > The latest QEMU patches have this fixed bit data hardcoded in QEMU. Then the
> > long term solution is to make the TDX module return this data. Xiaoyao will
> > post
> > a proposal on how the TDX module should expose this soon.
> 
> Punting the "merge" to userspace is fine, but KVM still needs to ensure it
> doesn't
> have holes where userspace can attack the kernel by lying about what features
> the
> guest has access to.  And that means forcing bits in kvm_vcpu_arch.cpu_caps;
> anything else is just asking for problems.

Ok, then for now let's just address them on a case-by-case basis for logic that
protects KVM. I'll add to look at using kvm_vcpu_arch.cpu_caps to our future-
things todo list.

I think Adrian is going post a proposal for how to handle this case better.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-03 17:34               ` Edgecombe, Rick P
@ 2024-12-03 19:17                 ` Adrian Hunter
  2024-12-04  1:25                   ` Chao Gao
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-12-03 19:17 UTC (permalink / raw)
  To: Edgecombe, Rick P, seanjc@google.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Huang, Kai, Zhao, Yan Y,
	dave.hansen@linux.intel.com, linux-kernel@vger.kernel.org,
	Yang, Weijiang, binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, Gao, Chao, x86@kernel.org

On 3/12/24 19:34, Edgecombe, Rick P wrote:
> On Mon, 2024-12-02 at 16:34 -0800, Sean Christopherson wrote:
>>> Small point - the last conversation[0] we had on this was to let *userspace*
>>> ensure consistency between KVM's CPUID (i.e. KVM_SET_CPUID2) and the TDX
>>> Module's view.
>>
>> I'm all for that, right up until KVM needs to protect itself again userspace
>> and
>> flawed TDX architecture.  A relevant comment I made in that thread:
>>
>>  : If the upgrade breaks a setup because it confuses _KVM_, then I'll care
>>
>> As it applies here, letting vCPU CPUID and actual guest functionality diverge
>> for
>> features that KVM cares about _will_ cause problems.
> 
> Right, just wanted to make sure we don't need to re-open the major design.
> 
>>
>> This will be less ugly to handle once kvm_vcpu_arch.cpu_caps is a thing.  KVM
>> can simply force set/clear bits to match the actual guest functionality that's
>> hardcoded by the TDX Module or defined by TDPARAMS.
>>
>>> So the configuration goes:
>>> 1. Userspace configures per-VM CPU features
>>> 2. Userspace gets TDX Module's final per-vCPU version of CPUID configuration
>>> via
>>> KVM API
>>> 3. Userspace calls KVM_SET_CPUID2 with the merge of TDX Module's version,
>>> and
>>> userspace's desired values for KVM "owned" CPUID leads (pv features, etc)
>>>
>>> But KVM's knowledge of CPUID bits still remains per-vcpu for TDX in any
>>> case.
>>>
>>>>
>>>>  - Don't hardcode fixed/required CPUID values in KVM, use available
>>>> metadata
>>>>    from TDX Module to reject "bad" guest CPUID (or let the TDX module
>>>> reject?).
>>>>    I.e. don't let a guest silently run with a CPUID that diverges from
>>>> what
>>>>    userspace provided.
>>>
>>> The latest QEMU patches have this fixed bit data hardcoded in QEMU. Then the
>>> long term solution is to make the TDX module return this data. Xiaoyao will
>>> post
>>> a proposal on how the TDX module should expose this soon.
>>
>> Punting the "merge" to userspace is fine, but KVM still needs to ensure it
>> doesn't
>> have holes where userspace can attack the kernel by lying about what features
>> the
>> guest has access to.  And that means forcing bits in kvm_vcpu_arch.cpu_caps;
>> anything else is just asking for problems.
> 
> Ok, then for now let's just address them on a case-by-case basis for logic that
> protects KVM. I'll add to look at using kvm_vcpu_arch.cpu_caps to our future-
> things todo list.
> 
> I think Adrian is going post a proposal for how to handle this case better.

Perhaps just do without TSX support to start with e.g. drop
this "KVM: TDX: Add TSX_CTRL msr into uret_msrs list" patch
and instead add the following:


From: Adrian Hunter <adrian.hunter@intel.com>
Date: Tue, 3 Dec 2024 08:20:03 +0200
Subject: [PATCH] KVM: TDX: Disable support for TSX and WAITPKG

Support for restoring IA32_TSX_CTRL MSR and IA32_UMWAIT_CONTROL MSR is not
yet implemented, so disable support for TSX and WAITPKG for now.  Clear the
associated CPUID bits returned by KVM_TDX_CAPABILITIES, and return an error
if those bits are set in KVM_TDX_INIT_VM.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 43 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 69bb3136076d..947f78dc3429 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -105,6 +105,44 @@ static u32 tdx_set_guest_phys_addr_bits(const u32 eax, int addr_bits)
 	return (eax & ~GENMASK(23, 16)) | (addr_bits & 0xff) << 16;
 }
 
+#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
+
+static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
+{
+	return entry->function == 7 && entry->index == 0 &&
+	       (entry->ebx & TDX_FEATURE_TSX);
+}
+
+static void clear_tsx(struct kvm_cpuid_entry2 *entry)
+{
+	entry->ebx &= ~TDX_FEATURE_TSX;
+}
+
+static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
+{
+	return entry->function == 7 && entry->index == 0 &&
+	       (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
+}
+
+static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
+{
+	entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
+}
+
+static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
+{
+	if (has_tsx(entry))
+		clear_tsx(entry);
+
+	if (has_waitpkg(entry))
+		clear_waitpkg(entry);
+}
+
+static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
+{
+	return has_tsx(entry) || has_waitpkg(entry);
+}
+
 #define KVM_TDX_CPUID_NO_SUBLEAF	((__u32)-1)
 
 static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx)
@@ -124,6 +162,8 @@ static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char i
 	/* Work around missing support on old TDX modules */
 	if (entry->function == 0x80000008)
 		entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
+
+	tdx_clear_unsupported_cpuid(entry);
 }
 
 static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
@@ -1235,6 +1275,9 @@ static int setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
 		if (!entry)
 			continue;
 
+		if (tdx_unsupported_cpuid(entry))
+			return -EINVAL;
+
 		copy_cnt++;
 
 		value = &td_params->cpuid_values[i];
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-03 19:17                 ` Adrian Hunter
@ 2024-12-04  1:25                   ` Chao Gao
  2024-12-04  6:18                     ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Chao Gao @ 2024-12-04  1:25 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

>+#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>+
>+static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>+{
>+	return entry->function == 7 && entry->index == 0 &&
>+	       (entry->ebx & TDX_FEATURE_TSX);
>+}
>+
>+static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>+{
>+	entry->ebx &= ~TDX_FEATURE_TSX;
>+}
>+
>+static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>+{
>+	return entry->function == 7 && entry->index == 0 &&
>+	       (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>+}
>+
>+static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>+{
>+	entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>+}
>+
>+static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>+{
>+	if (has_tsx(entry))
>+		clear_tsx(entry);
>+
>+	if (has_waitpkg(entry))
>+		clear_waitpkg(entry);
>+}
>+
>+static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>+{
>+	return has_tsx(entry) || has_waitpkg(entry);
>+}

No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
ensures that unconfigurable bits are not set by userspace.

>+
> #define KVM_TDX_CPUID_NO_SUBLEAF	((__u32)-1)
> 
> static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx)
>@@ -124,6 +162,8 @@ static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char i
> 	/* Work around missing support on old TDX modules */
> 	if (entry->function == 0x80000008)
> 		entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
>+
>+	tdx_clear_unsupported_cpuid(entry);
> }
> 
> static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
>@@ -1235,6 +1275,9 @@ static int setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
> 		if (!entry)
> 			continue;
> 
>+		if (tdx_unsupported_cpuid(entry))
>+			return -EINVAL;
>+
> 		copy_cnt++;
> 
> 		value = &td_params->cpuid_values[i];
>-- 
>2.43.0
>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-04  1:25                   ` Chao Gao
@ 2024-12-04  6:18                     ` Adrian Hunter
  2024-12-04  6:37                       ` Chao Gao
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-12-04  6:18 UTC (permalink / raw)
  To: Chao Gao
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On 4/12/24 03:25, Chao Gao wrote:
>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>> +
>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>> +{
>> +	return entry->function == 7 && entry->index == 0 &&
>> +	       (entry->ebx & TDX_FEATURE_TSX);
>> +}
>> +
>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>> +{
>> +	entry->ebx &= ~TDX_FEATURE_TSX;
>> +}
>> +
>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>> +{
>> +	return entry->function == 7 && entry->index == 0 &&
>> +	       (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>> +}
>> +
>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>> +{
>> +	entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>> +}
>> +
>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>> +{
>> +	if (has_tsx(entry))
>> +		clear_tsx(entry);
>> +
>> +	if (has_waitpkg(entry))
>> +		clear_waitpkg(entry);
>> +}
>> +
>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>> +{
>> +	return has_tsx(entry) || has_waitpkg(entry);
>> +}
> 
> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
> ensures that unconfigurable bits are not set by userspace.

Aren't they configurable?

> 
>> +
>> #define KVM_TDX_CPUID_NO_SUBLEAF	((__u32)-1)
>>
>> static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx)
>> @@ -124,6 +162,8 @@ static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char i
>> 	/* Work around missing support on old TDX modules */
>> 	if (entry->function == 0x80000008)
>> 		entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
>> +
>> +	tdx_clear_unsupported_cpuid(entry);
>> }
>>
>> static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
>> @@ -1235,6 +1275,9 @@ static int setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
>> 		if (!entry)
>> 			continue;
>>
>> +		if (tdx_unsupported_cpuid(entry))
>> +			return -EINVAL;
>> +
>> 		copy_cnt++;
>>
>> 		value = &td_params->cpuid_values[i];
>> -- 
>> 2.43.0
>>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-04  6:18                     ` Adrian Hunter
@ 2024-12-04  6:37                       ` Chao Gao
  2024-12-04  6:57                         ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Chao Gao @ 2024-12-04  6:37 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On Wed, Dec 04, 2024 at 08:18:32AM +0200, Adrian Hunter wrote:
>On 4/12/24 03:25, Chao Gao wrote:
>>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>>> +
>>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>>> +{
>>> +	return entry->function == 7 && entry->index == 0 &&
>>> +	       (entry->ebx & TDX_FEATURE_TSX);
>>> +}
>>> +
>>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>>> +{
>>> +	entry->ebx &= ~TDX_FEATURE_TSX;
>>> +}
>>> +
>>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>>> +{
>>> +	return entry->function == 7 && entry->index == 0 &&
>>> +	       (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>>> +}
>>> +
>>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>>> +{
>>> +	entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>>> +}
>>> +
>>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>>> +{
>>> +	if (has_tsx(entry))
>>> +		clear_tsx(entry);
>>> +
>>> +	if (has_waitpkg(entry))
>>> +		clear_waitpkg(entry);
>>> +}
>>> +
>>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>>> +{
>>> +	return has_tsx(entry) || has_waitpkg(entry);
>>> +}
>> 
>> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
>> ensures that unconfigurable bits are not set by userspace.
>
>Aren't they configurable?

They are cleared from the configurable bitmap by tdx_clear_unsupported_cpuid(),
so they are not configurable from a userspace perspective. Did I miss anything?
KVM should check user inputs against its adjusted configurable bitmap, right?

>
>> 
>>> +
>>> #define KVM_TDX_CPUID_NO_SUBLEAF	((__u32)-1)
>>>
>>> static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx)
>>> @@ -124,6 +162,8 @@ static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char i
>>> 	/* Work around missing support on old TDX modules */
>>> 	if (entry->function == 0x80000008)
>>> 		entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
>>> +
>>> +	tdx_clear_unsupported_cpuid(entry);
>>> }
>>>
>>> static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
>>> @@ -1235,6 +1275,9 @@ static int setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
>>> 		if (!entry)
>>> 			continue;
>>>
>>> +		if (tdx_unsupported_cpuid(entry))
>>> +			return -EINVAL;
>>> +
>>> 		copy_cnt++;
>>>
>>> 		value = &td_params->cpuid_values[i];
>>> -- 
>>> 2.43.0
>>>
>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-04  6:37                       ` Chao Gao
@ 2024-12-04  6:57                         ` Adrian Hunter
  2024-12-04 11:13                           ` Chao Gao
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-12-04  6:57 UTC (permalink / raw)
  To: Chao Gao
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On 4/12/24 08:37, Chao Gao wrote:
> On Wed, Dec 04, 2024 at 08:18:32AM +0200, Adrian Hunter wrote:
>> On 4/12/24 03:25, Chao Gao wrote:
>>>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>>>> +
>>>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>>>> +{
>>>> +	return entry->function == 7 && entry->index == 0 &&
>>>> +	       (entry->ebx & TDX_FEATURE_TSX);
>>>> +}
>>>> +
>>>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>>>> +{
>>>> +	entry->ebx &= ~TDX_FEATURE_TSX;
>>>> +}
>>>> +
>>>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>>>> +{
>>>> +	return entry->function == 7 && entry->index == 0 &&
>>>> +	       (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>>>> +}
>>>> +
>>>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>>>> +{
>>>> +	entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>>>> +}
>>>> +
>>>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>>>> +{
>>>> +	if (has_tsx(entry))
>>>> +		clear_tsx(entry);
>>>> +
>>>> +	if (has_waitpkg(entry))
>>>> +		clear_waitpkg(entry);
>>>> +}
>>>> +
>>>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>>>> +{
>>>> +	return has_tsx(entry) || has_waitpkg(entry);
>>>> +}
>>>
>>> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
>>> ensures that unconfigurable bits are not set by userspace.
>>
>> Aren't they configurable?
> 
> They are cleared from the configurable bitmap by tdx_clear_unsupported_cpuid(),
> so they are not configurable from a userspace perspective. Did I miss anything?
> KVM should check user inputs against its adjusted configurable bitmap, right?

Maybe I misunderstand but we rely on the TDX module to reject
invalid configuration.  We don't check exactly what is configurable
for the TDX Module.

TSX and WAITPKG are not invalid for the TDX Module, but KVM
must either support them by restoring their MSRs, or disallow
them.  This patch disallows them for now.

> 
>>
>>>
>>>> +
>>>> #define KVM_TDX_CPUID_NO_SUBLEAF	((__u32)-1)
>>>>
>>>> static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx)
>>>> @@ -124,6 +162,8 @@ static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char i
>>>> 	/* Work around missing support on old TDX modules */
>>>> 	if (entry->function == 0x80000008)
>>>> 		entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
>>>> +
>>>> +	tdx_clear_unsupported_cpuid(entry);
>>>> }
>>>>
>>>> static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
>>>> @@ -1235,6 +1275,9 @@ static int setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
>>>> 		if (!entry)
>>>> 			continue;
>>>>
>>>> +		if (tdx_unsupported_cpuid(entry))
>>>> +			return -EINVAL;
>>>> +
>>>> 		copy_cnt++;
>>>>
>>>> 		value = &td_params->cpuid_values[i];
>>>> -- 
>>>> 2.43.0
>>>>
>>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-04  6:57                         ` Adrian Hunter
@ 2024-12-04 11:13                           ` Chao Gao
  2024-12-04 11:55                             ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Chao Gao @ 2024-12-04 11:13 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On Wed, Dec 04, 2024 at 08:57:23AM +0200, Adrian Hunter wrote:
>On 4/12/24 08:37, Chao Gao wrote:
>> On Wed, Dec 04, 2024 at 08:18:32AM +0200, Adrian Hunter wrote:
>>> On 4/12/24 03:25, Chao Gao wrote:
>>>>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>>>>> +
>>>>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>>>>> +{
>>>>> +	return entry->function == 7 && entry->index == 0 &&
>>>>> +	       (entry->ebx & TDX_FEATURE_TSX);
>>>>> +}
>>>>> +
>>>>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>>>>> +{
>>>>> +	entry->ebx &= ~TDX_FEATURE_TSX;
>>>>> +}
>>>>> +
>>>>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>>>>> +{
>>>>> +	return entry->function == 7 && entry->index == 0 &&
>>>>> +	       (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>>>>> +}
>>>>> +
>>>>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>>>>> +{
>>>>> +	entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>>>>> +}
>>>>> +
>>>>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>>>>> +{
>>>>> +	if (has_tsx(entry))
>>>>> +		clear_tsx(entry);
>>>>> +
>>>>> +	if (has_waitpkg(entry))
>>>>> +		clear_waitpkg(entry);
>>>>> +}
>>>>> +
>>>>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>>>>> +{
>>>>> +	return has_tsx(entry) || has_waitpkg(entry);
>>>>> +}
>>>>
>>>> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
>>>> ensures that unconfigurable bits are not set by userspace.
>>>
>>> Aren't they configurable?
>> 
>> They are cleared from the configurable bitmap by tdx_clear_unsupported_cpuid(),
>> so they are not configurable from a userspace perspective. Did I miss anything?
>> KVM should check user inputs against its adjusted configurable bitmap, right?
>
>Maybe I misunderstand but we rely on the TDX module to reject
>invalid configuration.  We don't check exactly what is configurable
>for the TDX Module.

Ok, this is what I missed. I thought KVM validated user input and masked
out all unsupported features. sorry for this.

>
>TSX and WAITPKG are not invalid for the TDX Module, but KVM
>must either support them by restoring their MSRs, or disallow
>them.  This patch disallows them for now.

Yes. I agree. what if a new feature (supported by a future TDX module) also
needs KVM to restore some MSRs? current KVM will allow it to be exposed (since
only TSX/WAITPKG are checked); then some MSRs may get corrupted. I may think
this is not a good design. Current KVM should work with future TDX modules.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-04 11:13                           ` Chao Gao
@ 2024-12-04 11:55                             ` Adrian Hunter
  2024-12-04 15:33                               ` Xiaoyao Li
  2024-12-04 23:40                               ` Edgecombe, Rick P
  0 siblings, 2 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-12-04 11:55 UTC (permalink / raw)
  To: Chao Gao
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On 4/12/24 13:13, Chao Gao wrote:
> On Wed, Dec 04, 2024 at 08:57:23AM +0200, Adrian Hunter wrote:
>> On 4/12/24 08:37, Chao Gao wrote:
>>> On Wed, Dec 04, 2024 at 08:18:32AM +0200, Adrian Hunter wrote:
>>>> On 4/12/24 03:25, Chao Gao wrote:
>>>>>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>>>>>> +
>>>>>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>>>>>> +{
>>>>>> +	return entry->function == 7 && entry->index == 0 &&
>>>>>> +	       (entry->ebx & TDX_FEATURE_TSX);
>>>>>> +}
>>>>>> +
>>>>>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>>>>>> +{
>>>>>> +	entry->ebx &= ~TDX_FEATURE_TSX;
>>>>>> +}
>>>>>> +
>>>>>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>>>>>> +{
>>>>>> +	return entry->function == 7 && entry->index == 0 &&
>>>>>> +	       (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>>>>>> +}
>>>>>> +
>>>>>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>>>>>> +{
>>>>>> +	entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>>>>>> +}
>>>>>> +
>>>>>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>>>>>> +{
>>>>>> +	if (has_tsx(entry))
>>>>>> +		clear_tsx(entry);
>>>>>> +
>>>>>> +	if (has_waitpkg(entry))
>>>>>> +		clear_waitpkg(entry);
>>>>>> +}
>>>>>> +
>>>>>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>>>>>> +{
>>>>>> +	return has_tsx(entry) || has_waitpkg(entry);
>>>>>> +}
>>>>>
>>>>> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
>>>>> ensures that unconfigurable bits are not set by userspace.
>>>>
>>>> Aren't they configurable?
>>>
>>> They are cleared from the configurable bitmap by tdx_clear_unsupported_cpuid(),
>>> so they are not configurable from a userspace perspective. Did I miss anything?
>>> KVM should check user inputs against its adjusted configurable bitmap, right?
>>
>> Maybe I misunderstand but we rely on the TDX module to reject
>> invalid configuration.  We don't check exactly what is configurable
>> for the TDX Module.
> 
> Ok, this is what I missed. I thought KVM validated user input and masked
> out all unsupported features. sorry for this.
> 
>>
>> TSX and WAITPKG are not invalid for the TDX Module, but KVM
>> must either support them by restoring their MSRs, or disallow
>> them.  This patch disallows them for now.
> 
> Yes. I agree. what if a new feature (supported by a future TDX module) also
> needs KVM to restore some MSRs? current KVM will allow it to be exposed (since
> only TSX/WAITPKG are checked); then some MSRs may get corrupted. I may think
> this is not a good design. Current KVM should work with future TDX modules.

With respect to CPUID, I gather this kind of thing has been
discussed, such as here:

	https://lore.kernel.org/all/ZhVsHVqaff7AKagu@google.com/

and Rick and Xiaoyao are working on something.

In general, I would expect a new TDX Module would advertise support for
new features, but KVM would have to opt in to use them.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-04 11:55                             ` Adrian Hunter
@ 2024-12-04 15:33                               ` Xiaoyao Li
  2024-12-04 23:51                                 ` Edgecombe, Rick P
  2024-12-05 17:31                                 ` Adrian Hunter
  2024-12-04 23:40                               ` Edgecombe, Rick P
  1 sibling, 2 replies; 82+ messages in thread
From: Xiaoyao Li @ 2024-12-04 15:33 UTC (permalink / raw)
  To: Adrian Hunter, Chao Gao
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On 12/4/2024 7:55 PM, Adrian Hunter wrote:
> On 4/12/24 13:13, Chao Gao wrote:
>> On Wed, Dec 04, 2024 at 08:57:23AM +0200, Adrian Hunter wrote:
>>> On 4/12/24 08:37, Chao Gao wrote:
>>>> On Wed, Dec 04, 2024 at 08:18:32AM +0200, Adrian Hunter wrote:
>>>>> On 4/12/24 03:25, Chao Gao wrote:
>>>>>>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>>>>>>> +
>>>>>>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>>>>>>> +{
>>>>>>> +	return entry->function == 7 && entry->index == 0 &&
>>>>>>> +	       (entry->ebx & TDX_FEATURE_TSX);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>>>>>>> +{
>>>>>>> +	entry->ebx &= ~TDX_FEATURE_TSX;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>>>>>>> +{
>>>>>>> +	return entry->function == 7 && entry->index == 0 &&
>>>>>>> +	       (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>>>>>>> +{
>>>>>>> +	entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>>>>>>> +{
>>>>>>> +	if (has_tsx(entry))
>>>>>>> +		clear_tsx(entry);
>>>>>>> +
>>>>>>> +	if (has_waitpkg(entry))
>>>>>>> +		clear_waitpkg(entry);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>>>>>>> +{
>>>>>>> +	return has_tsx(entry) || has_waitpkg(entry);
>>>>>>> +}
>>>>>>
>>>>>> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
>>>>>> ensures that unconfigurable bits are not set by userspace.
>>>>>
>>>>> Aren't they configurable?
>>>>
>>>> They are cleared from the configurable bitmap by tdx_clear_unsupported_cpuid(),
>>>> so they are not configurable from a userspace perspective. Did I miss anything?
>>>> KVM should check user inputs against its adjusted configurable bitmap, right?
>>>
>>> Maybe I misunderstand but we rely on the TDX module to reject
>>> invalid configuration.  We don't check exactly what is configurable
>>> for the TDX Module.
>>
>> Ok, this is what I missed. I thought KVM validated user input and masked
>> out all unsupported features. sorry for this.
>>
>>>
>>> TSX and WAITPKG are not invalid for the TDX Module, but KVM
>>> must either support them by restoring their MSRs, or disallow
>>> them.  This patch disallows them for now.
>>
>> Yes. I agree. what if a new feature (supported by a future TDX module) also
>> needs KVM to restore some MSRs? current KVM will allow it to be exposed (since
>> only TSX/WAITPKG are checked); then some MSRs may get corrupted. I may think
>> this is not a good design. Current KVM should work with future TDX modules.
> 
> With respect to CPUID, I gather this kind of thing has been
> discussed, such as here:
> 
> 	https://lore.kernel.org/all/ZhVsHVqaff7AKagu@google.com/
> 
> and Rick and Xiaoyao are working on something.
> 
> In general, I would expect a new TDX Module would advertise support for
> new features, but KVM would have to opt in to use them.
> 

There were discussion[1] on whether KVM to gatekeep the 
configurable/supported CPUIDs for TDX. I stand by Sean that KVM needs to 
do so.

Regarding KVM opt in the new feature, KVM gatekeeps the CPUID bit that 
can be set by userspace is exactly the behavior of opt-in. i.e., for a 
given KVM, it only allows a CPUID set {S} to be configured by userspace, 
if new TDX module supports new feature X, it needs KVM to opt-in X by 
adding X to {S} so that X is allowed to be configured by userspace.

Besides, I find current interface between KVM and userspace lacks the 
ability to tell userspace what bits are not supported by KVM. 
KVM_TDX_CAPABILITIES.cpuid doesn't work because it represents the 
configurable CPUIDs, not supported CPUIDs (I think we might rename it to 
configurable_cpuid to better reflect its meaning). So userspace has to 
hardcode that TSX and WAITPKG is not support itself.

[1] https://lore.kernel.org/all/ZuM12EFbOXmpHHVQ@google.com/


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-11-28 11:13         ` Adrian Hunter
@ 2024-12-04 15:58           ` Adrian Hunter
  2024-12-11 18:43             ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-12-04 15:58 UTC (permalink / raw)
  To: Dave Hansen, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang

On 28/11/24 13:13, Adrian Hunter wrote:
> On 25/11/24 15:40, Adrian Hunter wrote:
>> On 22/11/24 18:33, Dave Hansen wrote:
>>> On 11/22/24 03:10, Adrian Hunter wrote:
>>>> +struct tdh_vp_enter_tdcall {
>>>> +	u64	reg_mask	: 32,
>>>> +		vm_idx		:  2,
>>>> +		reserved_0	: 30;
>>>> +	u64	data[TDX_ERR_DATA_PART_2];
>>>> +	u64	fn;	/* Non-zero for hypercalls, zero otherwise */
>>>> +	u64	subfn;
>>>> +	union {
>>>> +		struct tdh_vp_enter_vmcall 		vmcall;
>>>> +		struct tdh_vp_enter_gettdvmcallinfo	gettdvmcallinfo;
>>>> +		struct tdh_vp_enter_mapgpa		mapgpa;
>>>> +		struct tdh_vp_enter_getquote		getquote;
>>>> +		struct tdh_vp_enter_reportfatalerror	reportfatalerror;
>>>> +		struct tdh_vp_enter_cpuid		cpuid;
>>>> +		struct tdh_vp_enter_mmio		mmio;
>>>> +		struct tdh_vp_enter_hlt			hlt;
>>>> +		struct tdh_vp_enter_io			io;
>>>> +		struct tdh_vp_enter_rd			rd;
>>>> +		struct tdh_vp_enter_wr			wr;
>>>> +	};
>>>> +};
>>>
>>> Let's say someone declares this:
>>>
>>> struct tdh_vp_enter_mmio {
>>> 	u64	size;
>>> 	u64	mmio_addr;
>>> 	u64	direction;
>>> 	u64	value;
>>> };
>>>
>>> How long is that going to take you to debug?
>>
>> When adding a new hardware definition, it would be sensible
>> to check the hardware definition first before checking anything
>> else.
>>
>> However, to stop existing members being accidentally moved,
>> could add:
>>
>> #define CHECK_OFFSETS_EQ(reg, member) \
>> 	BUILD_BUG_ON(offsetof(struct tdx_module_args, reg) != offsetof(union tdh_vp_enter_args, member));
>>
>> 	CHECK_OFFSETS_EQ(r12, tdcall.mmio.size);
>> 	CHECK_OFFSETS_EQ(r13, tdcall.mmio.direction);
>> 	CHECK_OFFSETS_EQ(r14, tdcall.mmio.mmio_addr);
>> 	CHECK_OFFSETS_EQ(r15, tdcall.mmio.value);
>>
> 
> Note, struct tdh_vp_enter_tdcall is an output format.  The tdcall
> arguments come directly from the guest with no validation by the
> TDX Module.  It could be rubbish, or even malicious rubbish.  The
> exit handlers validate the values before using them.
> 
> WRT the TDCALL input format (response by the host VMM), 'ret_code'
> and 'failed_gpa' could use types other than 'u64', but the other
> members are really 'u64'.
> 
> /* TDH.VP.ENTER Input Format #2 : Following a previous TDCALL(TDG.VP.VMCALL) */
> struct tdh_vp_enter_in {
> 	u64	__vcpu_handle_and_flags; /* Don't use. tdh_vp_enter() will take care of it */
> 	u64	unused[3];
> 	u64	ret_code;
> 	union {
> 		u64 gettdvmcallinfo[4];
> 		struct {
> 			u64	failed_gpa;
> 		} mapgpa;
> 		struct {
> 			u64	unused;
> 			u64	eax;
> 			u64	ebx;
> 			u64	ecx;
> 			u64	edx;
> 		} cpuid;
> 		/* Value read for IO, MMIO or RDMSR */
> 		struct {
> 			u64	value;
> 		} read;
> 	};
> };
> 
> Another different alternative could be to use an opaque structure,
> not visible to KVM, and then all accesses to it become helper
> functions like:
> 
> struct tdx_args;
> 
> int tdx_args_get_mmio(struct tdx_args *args,
> 		      enum tdx_access_size *size,
> 		      enum tdx_access_dir *direction,
> 		      gpa_t *addr,
> 		      u64 *value);
> 
> void tdx_args_set_failed_gpa(struct tdx_args *args, gpa_t gpa);
> void tdx_args_set_ret_code(struct tdx_args *args, enum tdx_ret_code ret_code);
> etc
> 
> For the 'get' functions, that would tend to imply the helpers
> would do some validation.
> 

IIRC Dave said something like, if the wrapper doesn't add any
value, then it is just as well not to have it at all.

So that option would be to drop patch "x86/virt/tdx: Add SEAMCALL
wrapper to enter/exit TDX guest" with tdh_vp_enter() and instead
just call __seamcall_saved_ret() directly, noting that:

 - __seamcall_saved_ret() is only used for TDH.VP.ENTER
 - KVM seems likely to be the only code that would ever
 need to use TDH.VP.ENTER


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-04 11:55                             ` Adrian Hunter
  2024-12-04 15:33                               ` Xiaoyao Li
@ 2024-12-04 23:40                               ` Edgecombe, Rick P
  1 sibling, 0 replies; 82+ messages in thread
From: Edgecombe, Rick P @ 2024-12-04 23:40 UTC (permalink / raw)
  To: Hunter, Adrian, Gao, Chao
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Huang, Kai, Zhao, Yan Y,
	dave.hansen@linux.intel.com, linux-kernel@vger.kernel.org,
	Yang, Weijiang, Chatre, Reinette, pbonzini@redhat.com,
	seanjc@google.com, Yamahata, Isaku, nik.borisov@suse.com,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	tony.lindgren@linux.intel.com, x86@kernel.org

On Wed, 2024-12-04 at 13:55 +0200, Adrian Hunter wrote:
> > > > 
> > > > They are cleared from the configurable bitmap by
> > > > tdx_clear_unsupported_cpuid(),
> > > > so they are not configurable from a userspace perspective. Did I miss
> > > > anything?
> > > > KVM should check user inputs against its adjusted configurable bitmap,
> > > > right?
> > > 
> > > Maybe I misunderstand but we rely on the TDX module to reject
> > > invalid configuration.  We don't check exactly what is configurable
> > > for the TDX Module.
> > 
> > Ok, this is what I missed. I thought KVM validated user input and masked
> > out all unsupported features. sorry for this.

This used to be how it behaved, but IIRC Paolo had suggested to simplify it by
letting the TDX module do the rejection. But that was under the assumption there
wasn't any TDX supported CPUID configuration that was harmful to KVM.

We also used to filter which CPUID features that weren't supported by KVM, but
this was also dropped to make things simpler.

> > 
> > > 
> > > TSX and WAITPKG are not invalid for the TDX Module, but KVM
> > > must either support them by restoring their MSRs, or disallow
> > > them.  This patch disallows them for now.
> > 
> > Yes. I agree. what if a new feature (supported by a future TDX module) also
> > needs KVM to restore some MSRs? current KVM will allow it to be exposed
> > (since
> > only TSX/WAITPKG are checked); then some MSRs may get corrupted. I may think
> > this is not a good design. Current KVM should work with future TDX modules.
> 
> With respect to CPUID, I gather this kind of thing has been
> discussed, such as here:
> 
> 	https://lore.kernel.org/all/ZhVsHVqaff7AKagu@google.com/
> 
> and Rick and Xiaoyao are working on something.

This is around fixed 0/1 bits, and just to help userspace understand what
configurations are possible. It isn't intended to filter any bits on the KVM
side.

> 
> In general, I would expect a new TDX Module would advertise support for
> new features, but KVM would have to opt in to use them.

This is true for attributes/xfam, but isn't for CPUID leafs. We used to filter
them in various ways but it was messy. The suggestion was to simplify it. The
current approach is to treat any TDX module changes that break the host like
that as TDX module bugs. See this coverletter for more info on the history:

	https://lore.kernel.org/kvm/20240812224820.34826-1-rick.p.edgecombe@intel.com/

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-04 15:33                               ` Xiaoyao Li
@ 2024-12-04 23:51                                 ` Edgecombe, Rick P
  2024-12-05 17:31                                 ` Adrian Hunter
  1 sibling, 0 replies; 82+ messages in thread
From: Edgecombe, Rick P @ 2024-12-04 23:51 UTC (permalink / raw)
  To: Li, Xiaoyao, Hunter, Adrian, Gao, Chao
  Cc: Yang, Weijiang, x86@kernel.org, seanjc@google.com,
	dave.hansen@linux.intel.com, binbin.wu@linux.intel.com,
	Chatre, Reinette, Huang, Kai, linux-kernel@vger.kernel.org,
	Zhao, Yan Y, tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	dmatlack@google.com, Yamahata, Isaku, pbonzini@redhat.com,
	nik.borisov@suse.com

On Wed, 2024-12-04 at 23:33 +0800, Xiaoyao Li wrote:
> > 
> 
> There were discussion[1] on whether KVM to gatekeep the 
> configurable/supported CPUIDs for TDX. I stand by Sean that KVM needs to 
> do so.
> 
> Regarding KVM opt in the new feature, KVM gatekeeps the CPUID bit that 
> can be set by userspace is exactly the behavior of opt-in. i.e., for a 
> given KVM, it only allows a CPUID set {S} to be configured by userspace, 
> if new TDX module supports new feature X, it needs KVM to opt-in X by 
> adding X to {S} so that X is allowed to be configured by userspace.
> 
> Besides, I find current interface between KVM and userspace lacks the 
> ability to tell userspace what bits are not supported by KVM. 
> KVM_TDX_CAPABILITIES.cpuid doesn't work because it represents the 
> configurable CPUIDs, not supported CPUIDs (I think we might rename it to 
> configurable_cpuid to better reflect its meaning). So userspace has to 
> hardcode that TSX and WAITPKG is not support itself.
> 
> [1] https://lore.kernel.org/all/ZuM12EFbOXmpHHVQ@google.com/

I mean, this is kind of a good example for why we might want to go back to
filtering CPUID bits. It kind of depends on how KVM wants to treat the TDX
module. Hostile like userspace, or more trusting. KVM maintaining code to let
the TDX module evolve unsafely would be unfortunate though.

If we keep the current approach, it probably would be educational to highlight
this example to the TDX module team. "Don't do this or you will have a bug on
your side".

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-04 15:33                               ` Xiaoyao Li
  2024-12-04 23:51                                 ` Edgecombe, Rick P
@ 2024-12-05 17:31                                 ` Adrian Hunter
  2024-12-06  3:37                                   ` Xiaoyao Li
  1 sibling, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-12-05 17:31 UTC (permalink / raw)
  To: Xiaoyao Li, Chao Gao
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On 4/12/24 17:33, Xiaoyao Li wrote:
> On 12/4/2024 7:55 PM, Adrian Hunter wrote:
>> On 4/12/24 13:13, Chao Gao wrote:
>>> On Wed, Dec 04, 2024 at 08:57:23AM +0200, Adrian Hunter wrote:
>>>> On 4/12/24 08:37, Chao Gao wrote:
>>>>> On Wed, Dec 04, 2024 at 08:18:32AM +0200, Adrian Hunter wrote:
>>>>>> On 4/12/24 03:25, Chao Gao wrote:
>>>>>>>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>>>>>>>> +
>>>>>>>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>>>>>>>> +{
>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>> +           (entry->ebx & TDX_FEATURE_TSX);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>>>>>>>> +{
>>>>>>>> +    entry->ebx &= ~TDX_FEATURE_TSX;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>>>>>>>> +{
>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>> +           (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>>>>>>>> +{
>>>>>>>> +    entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>>>>>>>> +{
>>>>>>>> +    if (has_tsx(entry))
>>>>>>>> +        clear_tsx(entry);
>>>>>>>> +
>>>>>>>> +    if (has_waitpkg(entry))
>>>>>>>> +        clear_waitpkg(entry);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>>>>>>>> +{
>>>>>>>> +    return has_tsx(entry) || has_waitpkg(entry);
>>>>>>>> +}
>>>>>>>
>>>>>>> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
>>>>>>> ensures that unconfigurable bits are not set by userspace.
>>>>>>
>>>>>> Aren't they configurable?
>>>>>
>>>>> They are cleared from the configurable bitmap by tdx_clear_unsupported_cpuid(),
>>>>> so they are not configurable from a userspace perspective. Did I miss anything?
>>>>> KVM should check user inputs against its adjusted configurable bitmap, right?
>>>>
>>>> Maybe I misunderstand but we rely on the TDX module to reject
>>>> invalid configuration.  We don't check exactly what is configurable
>>>> for the TDX Module.
>>>
>>> Ok, this is what I missed. I thought KVM validated user input and masked
>>> out all unsupported features. sorry for this.
>>>
>>>>
>>>> TSX and WAITPKG are not invalid for the TDX Module, but KVM
>>>> must either support them by restoring their MSRs, or disallow
>>>> them.  This patch disallows them for now.
>>>
>>> Yes. I agree. what if a new feature (supported by a future TDX module) also
>>> needs KVM to restore some MSRs? current KVM will allow it to be exposed (since
>>> only TSX/WAITPKG are checked); then some MSRs may get corrupted. I may think
>>> this is not a good design. Current KVM should work with future TDX modules.
>>
>> With respect to CPUID, I gather this kind of thing has been
>> discussed, such as here:
>>
>>     https://lore.kernel.org/all/ZhVsHVqaff7AKagu@google.com/
>>
>> and Rick and Xiaoyao are working on something.
>>
>> In general, I would expect a new TDX Module would advertise support for
>> new features, but KVM would have to opt in to use them.
>>
> 
> There were discussion[1] on whether KVM to gatekeep the configurable/supported CPUIDs for TDX. I stand by Sean that KVM needs to do so.
> 
> Regarding KVM opt in the new feature, KVM gatekeeps the CPUID bit that can be set by userspace is exactly the behavior of opt-in. i.e., for a given KVM, it only allows a CPUID set {S} to be configured by userspace, if new TDX module supports new feature X, it needs KVM to opt-in X by adding X to {S} so that X is allowed to be configured by userspace.
> 
> Besides, I find current interface between KVM and userspace lacks the ability to tell userspace what bits are not supported by KVM. KVM_TDX_CAPABILITIES.cpuid doesn't work because it represents the configurable CPUIDs, not supported CPUIDs (I think we might rename it to configurable_cpuid to better reflect its meaning). So userspace has to hardcode that TSX and WAITPKG is not support itself.

I don't follow why hardcoding would be necessary.

If the leaf is represented in KVM_TDX_CAPABILITIES.cpuid, and
the bits are 0 there, why would userspace try to set them to 1?

> 
> [1] https://lore.kernel.org/all/ZuM12EFbOXmpHHVQ@google.com/
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-05 17:31                                 ` Adrian Hunter
@ 2024-12-06  3:37                                   ` Xiaoyao Li
  2024-12-06 14:40                                     ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Xiaoyao Li @ 2024-12-06  3:37 UTC (permalink / raw)
  To: Adrian Hunter, Chao Gao
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On 12/6/2024 1:31 AM, Adrian Hunter wrote:
> On 4/12/24 17:33, Xiaoyao Li wrote:
>> On 12/4/2024 7:55 PM, Adrian Hunter wrote:
>>> On 4/12/24 13:13, Chao Gao wrote:
>>>> On Wed, Dec 04, 2024 at 08:57:23AM +0200, Adrian Hunter wrote:
>>>>> On 4/12/24 08:37, Chao Gao wrote:
>>>>>> On Wed, Dec 04, 2024 at 08:18:32AM +0200, Adrian Hunter wrote:
>>>>>>> On 4/12/24 03:25, Chao Gao wrote:
>>>>>>>>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>>>>>>>>> +
>>>>>>>>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>> +{
>>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>>> +           (entry->ebx & TDX_FEATURE_TSX);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>>>>>>>>> +{
>>>>>>>>> +    entry->ebx &= ~TDX_FEATURE_TSX;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>> +{
>>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>>> +           (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>>>>>>>>> +{
>>>>>>>>> +    entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>>>>>>>>> +{
>>>>>>>>> +    if (has_tsx(entry))
>>>>>>>>> +        clear_tsx(entry);
>>>>>>>>> +
>>>>>>>>> +    if (has_waitpkg(entry))
>>>>>>>>> +        clear_waitpkg(entry);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>> +{
>>>>>>>>> +    return has_tsx(entry) || has_waitpkg(entry);
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
>>>>>>>> ensures that unconfigurable bits are not set by userspace.
>>>>>>>
>>>>>>> Aren't they configurable?
>>>>>>
>>>>>> They are cleared from the configurable bitmap by tdx_clear_unsupported_cpuid(),
>>>>>> so they are not configurable from a userspace perspective. Did I miss anything?
>>>>>> KVM should check user inputs against its adjusted configurable bitmap, right?
>>>>>
>>>>> Maybe I misunderstand but we rely on the TDX module to reject
>>>>> invalid configuration.  We don't check exactly what is configurable
>>>>> for the TDX Module.
>>>>
>>>> Ok, this is what I missed. I thought KVM validated user input and masked
>>>> out all unsupported features. sorry for this.
>>>>
>>>>>
>>>>> TSX and WAITPKG are not invalid for the TDX Module, but KVM
>>>>> must either support them by restoring their MSRs, or disallow
>>>>> them.  This patch disallows them for now.
>>>>
>>>> Yes. I agree. what if a new feature (supported by a future TDX module) also
>>>> needs KVM to restore some MSRs? current KVM will allow it to be exposed (since
>>>> only TSX/WAITPKG are checked); then some MSRs may get corrupted. I may think
>>>> this is not a good design. Current KVM should work with future TDX modules.
>>>
>>> With respect to CPUID, I gather this kind of thing has been
>>> discussed, such as here:
>>>
>>>      https://lore.kernel.org/all/ZhVsHVqaff7AKagu@google.com/
>>>
>>> and Rick and Xiaoyao are working on something.
>>>
>>> In general, I would expect a new TDX Module would advertise support for
>>> new features, but KVM would have to opt in to use them.
>>>
>>
>> There were discussion[1] on whether KVM to gatekeep the configurable/supported CPUIDs for TDX. I stand by Sean that KVM needs to do so.
>>
>> Regarding KVM opt in the new feature, KVM gatekeeps the CPUID bit that can be set by userspace is exactly the behavior of opt-in. i.e., for a given KVM, it only allows a CPUID set {S} to be configured by userspace, if new TDX module supports new feature X, it needs KVM to opt-in X by adding X to {S} so that X is allowed to be configured by userspace.
>>
>> Besides, I find current interface between KVM and userspace lacks the ability to tell userspace what bits are not supported by KVM. KVM_TDX_CAPABILITIES.cpuid doesn't work because it represents the configurable CPUIDs, not supported CPUIDs (I think we might rename it to configurable_cpuid to better reflect its meaning). So userspace has to hardcode that TSX and WAITPKG is not support itself.
> 
> I don't follow why hardcoding would be necessary.
> 
> If the leaf is represented in KVM_TDX_CAPABILITIES.cpuid, and
> the bits are 0 there, why would userspace try to set them to 1?

Userspace doesn't set the bit to 1 in kvm_tdx_init_vm.cpuid, doesn't 
mean userspace wants the bit to be 0.

Note, KVM_TDX_CAPABILITIES.cpuid reports the configurable bits. The 
value 0 of a bit in KVM_TDX_CAPABILITIES.cpuid means the bit is not 
configurable, not means the bit is unsupported.

For kvm_tdx_init_vm.cpuid,
  - if the corresponding bit is reported as 1 in 
KVM_TDX_CAPABILITIES.cpuid, then a value 0 in kvm_tdx_init_vm.cpuid 
means userspace wants to configure it as 0.
  - if the corresponding bit is reported as 0 in 
KVM_TDX_CAPABILITIES.cpuid, then userspace has to pass a value 0 in 
kvm_tdx_init_vm.cpuid. But it doesn't mean the value of the bit will be 0.

e.g., X2APIC bit is 0 in KVM_TDX_CAPABILITIES.cpuid, and it's also 0 in 
kvm_tdx_init_vm.cpuid, but TD guest sees a value of 1. In the view of 
QEMU, it maintains the bit of X2APIC as 1, and QEMU filters X2APIC bit 
when calling KVM_TDX_INIT_VM because X2APIC is not configurable.

So when it comes to TSX and WAITPKG, QEMU also needs an interface to be 
informed that they are unsupported. Without the interface of fixed0 bits 
reported by KVM, QEMU needs to hardcode itself like [1]. The problem of 
hardcode is that it will conflict when future KVM allows them to be 
configurable.

In the future, if we have interface from KVM to report the fixed0 and 
fixed1 bit (on top of the proposal [2]), userspace can drop the 
hardcoded one it maintains. At that time, KVM can ensure no conflict by 
removing the bits from fixed0/1 array when allowing them to be 
configurable.

[1] 
https://lore.kernel.org/qemu-devel/20241105062408.3533704-49-xiaoyao.li@intel.com/
[2] 
https://lore.kernel.org/all/43b26df1-4c27-41ff-a482-e258f872cc31@intel.com/

>>
>> [1] https://lore.kernel.org/all/ZuM12EFbOXmpHHVQ@google.com/
>>
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-06  3:37                                   ` Xiaoyao Li
@ 2024-12-06 14:40                                     ` Adrian Hunter
  2024-12-09  2:46                                       ` Xiaoyao Li
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-12-06 14:40 UTC (permalink / raw)
  To: Xiaoyao Li, Chao Gao
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On 6/12/24 05:37, Xiaoyao Li wrote:
> On 12/6/2024 1:31 AM, Adrian Hunter wrote:
>> On 4/12/24 17:33, Xiaoyao Li wrote:
>>> On 12/4/2024 7:55 PM, Adrian Hunter wrote:
>>>> On 4/12/24 13:13, Chao Gao wrote:
>>>>> On Wed, Dec 04, 2024 at 08:57:23AM +0200, Adrian Hunter wrote:
>>>>>> On 4/12/24 08:37, Chao Gao wrote:
>>>>>>> On Wed, Dec 04, 2024 at 08:18:32AM +0200, Adrian Hunter wrote:
>>>>>>>> On 4/12/24 03:25, Chao Gao wrote:
>>>>>>>>>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>>>>>>>>>> +
>>>>>>>>>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>> +{
>>>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>>>> +           (entry->ebx & TDX_FEATURE_TSX);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>> +{
>>>>>>>>>> +    entry->ebx &= ~TDX_FEATURE_TSX;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>> +{
>>>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>>>> +           (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>> +{
>>>>>>>>>> +    entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>> +{
>>>>>>>>>> +    if (has_tsx(entry))
>>>>>>>>>> +        clear_tsx(entry);
>>>>>>>>>> +
>>>>>>>>>> +    if (has_waitpkg(entry))
>>>>>>>>>> +        clear_waitpkg(entry);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>> +{
>>>>>>>>>> +    return has_tsx(entry) || has_waitpkg(entry);
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
>>>>>>>>> ensures that unconfigurable bits are not set by userspace.
>>>>>>>>
>>>>>>>> Aren't they configurable?
>>>>>>>
>>>>>>> They are cleared from the configurable bitmap by tdx_clear_unsupported_cpuid(),
>>>>>>> so they are not configurable from a userspace perspective. Did I miss anything?
>>>>>>> KVM should check user inputs against its adjusted configurable bitmap, right?
>>>>>>
>>>>>> Maybe I misunderstand but we rely on the TDX module to reject
>>>>>> invalid configuration.  We don't check exactly what is configurable
>>>>>> for the TDX Module.
>>>>>
>>>>> Ok, this is what I missed. I thought KVM validated user input and masked
>>>>> out all unsupported features. sorry for this.
>>>>>
>>>>>>
>>>>>> TSX and WAITPKG are not invalid for the TDX Module, but KVM
>>>>>> must either support them by restoring their MSRs, or disallow
>>>>>> them.  This patch disallows them for now.
>>>>>
>>>>> Yes. I agree. what if a new feature (supported by a future TDX module) also
>>>>> needs KVM to restore some MSRs? current KVM will allow it to be exposed (since
>>>>> only TSX/WAITPKG are checked); then some MSRs may get corrupted. I may think
>>>>> this is not a good design. Current KVM should work with future TDX modules.
>>>>
>>>> With respect to CPUID, I gather this kind of thing has been
>>>> discussed, such as here:
>>>>
>>>>      https://lore.kernel.org/all/ZhVsHVqaff7AKagu@google.com/
>>>>
>>>> and Rick and Xiaoyao are working on something.
>>>>
>>>> In general, I would expect a new TDX Module would advertise support for
>>>> new features, but KVM would have to opt in to use them.
>>>>
>>>
>>> There were discussion[1] on whether KVM to gatekeep the configurable/supported CPUIDs for TDX. I stand by Sean that KVM needs to do so.
>>>
>>> Regarding KVM opt in the new feature, KVM gatekeeps the CPUID bit that can be set by userspace is exactly the behavior of opt-in. i.e., for a given KVM, it only allows a CPUID set {S} to be configured by userspace, if new TDX module supports new feature X, it needs KVM to opt-in X by adding X to {S} so that X is allowed to be configured by userspace.
>>>
>>> Besides, I find current interface between KVM and userspace lacks the ability to tell userspace what bits are not supported by KVM. KVM_TDX_CAPABILITIES.cpuid doesn't work because it represents the configurable CPUIDs, not supported CPUIDs (I think we might rename it to configurable_cpuid to better reflect its meaning). So userspace has to hardcode that TSX and WAITPKG is not support itself.
>>
>> I don't follow why hardcoding would be necessary.
>>
>> If the leaf is represented in KVM_TDX_CAPABILITIES.cpuid, and
>> the bits are 0 there, why would userspace try to set them to 1?
> 
> Userspace doesn't set the bit to 1 in kvm_tdx_init_vm.cpuid, doesn't mean userspace wants the bit to be 0.
> 
> Note, KVM_TDX_CAPABILITIES.cpuid reports the configurable bits. The value 0 of a bit in KVM_TDX_CAPABILITIES.cpuid means the bit is not configurable, not means the bit is unsupported.

For features configurable by CPUID like TSX and WAITPKG,
a value of 0 does mean unsupported, because the value
has to be 1 to enable the feature.

From the TDX Module Base spec:

	11.18. Transactional Synchronization Extensions (TSX)
	The host VMM can enable TSX for a TD by configuring the following CPUID bits as enabled in the TD_PARAMS input to
	TDH.MNG.INIT:
	- CPUID(7,0).EBX[4] (HLE)
	- CPUID(7,0).EBX[11] (RTM)
	etc

	11.19.4. WAITPKG: TPAUSE, UMONITOR and UMWAIT Instructions
	The host VMM may allow guest TDs to use the TPAUSE, UMONITOR and UMWAIT instructions, if the CPU supports them,
	by configuring the virtual value of CPUID(7,0).ECX[5] (WAITPKG) to 1 using the CPUID configuration table which is part
	the TD_PARAMS input to TDH.MNG.INIT. Enabling CPUID(7,0).ECX[5] also enables TD access to IA32_UMWAIT_CONTROL
	(MSR 0xE1).
	If not allowed, then TD execution of TPAUSE, UMONITOR or UMWAIT results in a #UD, and access to
	IA32_UMWAIT_CONTROL results in a #GP(0).

> For kvm_tdx_init_vm.cpuid,
>  - if the corresponding bit is reported as 1 in KVM_TDX_CAPABILITIES.cpuid, then a value 0 in kvm_tdx_init_vm.cpuid means userspace wants to configure it as 0.
>  - if the corresponding bit is reported as 0 in KVM_TDX_CAPABILITIES.cpuid, then userspace has to pass a value 0 in kvm_tdx_init_vm.cpuid. But it doesn't mean the value of the bit will be 0.
> 
> e.g., X2APIC bit is 0 in KVM_TDX_CAPABILITIES.cpuid, and it's also 0 in kvm_tdx_init_vm.cpuid, but TD guest sees a value of 1. In the view of QEMU, it maintains the bit of X2APIC as 1, and QEMU filters X2APIC bit when calling KVM_TDX_INIT_VM because X2APIC is not configurable.
> 
> So when it comes to TSX and WAITPKG, QEMU also needs an interface to be informed that they are unsupported. Without the interface of fixed0 bits reported by KVM, QEMU needs to hardcode itself like [1]. The problem of hardcode is that it will conflict when future KVM allows them to be configurable.

So TSX and WAITPKG support should be based on KVM_TDX_CAPABILITIES.cpuid, and not hardcoded.

> 
> In the future, if we have interface from KVM to report the fixed0 and fixed1 bit (on top of the proposal [2]), userspace can drop the hardcoded one it maintains. At that time, KVM can ensure no conflict by removing the bits from fixed0/1 array when allowing them to be configurable.
> 
> [1] https://lore.kernel.org/qemu-devel/20241105062408.3533704-49-xiaoyao.li@intel.com/
> [2] https://lore.kernel.org/all/43b26df1-4c27-41ff-a482-e258f872cc31@intel.com/
> 
>>>
>>> [1] https://lore.kernel.org/all/ZuM12EFbOXmpHHVQ@google.com/
>>>
>>
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-06 14:40                                     ` Adrian Hunter
@ 2024-12-09  2:46                                       ` Xiaoyao Li
  2024-12-09  7:08                                         ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Xiaoyao Li @ 2024-12-09  2:46 UTC (permalink / raw)
  To: Adrian Hunter, Chao Gao
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On 12/6/2024 10:40 PM, Adrian Hunter wrote:
> On 6/12/24 05:37, Xiaoyao Li wrote:
>> On 12/6/2024 1:31 AM, Adrian Hunter wrote:
>>> On 4/12/24 17:33, Xiaoyao Li wrote:
>>>> On 12/4/2024 7:55 PM, Adrian Hunter wrote:
>>>>> On 4/12/24 13:13, Chao Gao wrote:
>>>>>> On Wed, Dec 04, 2024 at 08:57:23AM +0200, Adrian Hunter wrote:
>>>>>>> On 4/12/24 08:37, Chao Gao wrote:
>>>>>>>> On Wed, Dec 04, 2024 at 08:18:32AM +0200, Adrian Hunter wrote:
>>>>>>>>> On 4/12/24 03:25, Chao Gao wrote:
>>>>>>>>>>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>>>>>>>>>>> +
>>>>>>>>>>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>> +{
>>>>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>>>>> +           (entry->ebx & TDX_FEATURE_TSX);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>> +{
>>>>>>>>>>> +    entry->ebx &= ~TDX_FEATURE_TSX;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>> +{
>>>>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>>>>> +           (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>> +{
>>>>>>>>>>> +    entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>> +{
>>>>>>>>>>> +    if (has_tsx(entry))
>>>>>>>>>>> +        clear_tsx(entry);
>>>>>>>>>>> +
>>>>>>>>>>> +    if (has_waitpkg(entry))
>>>>>>>>>>> +        clear_waitpkg(entry);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>> +{
>>>>>>>>>>> +    return has_tsx(entry) || has_waitpkg(entry);
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
>>>>>>>>>> ensures that unconfigurable bits are not set by userspace.
>>>>>>>>>
>>>>>>>>> Aren't they configurable?
>>>>>>>>
>>>>>>>> They are cleared from the configurable bitmap by tdx_clear_unsupported_cpuid(),
>>>>>>>> so they are not configurable from a userspace perspective. Did I miss anything?
>>>>>>>> KVM should check user inputs against its adjusted configurable bitmap, right?
>>>>>>>
>>>>>>> Maybe I misunderstand but we rely on the TDX module to reject
>>>>>>> invalid configuration.  We don't check exactly what is configurable
>>>>>>> for the TDX Module.
>>>>>>
>>>>>> Ok, this is what I missed. I thought KVM validated user input and masked
>>>>>> out all unsupported features. sorry for this.
>>>>>>
>>>>>>>
>>>>>>> TSX and WAITPKG are not invalid for the TDX Module, but KVM
>>>>>>> must either support them by restoring their MSRs, or disallow
>>>>>>> them.  This patch disallows them for now.
>>>>>>
>>>>>> Yes. I agree. what if a new feature (supported by a future TDX module) also
>>>>>> needs KVM to restore some MSRs? current KVM will allow it to be exposed (since
>>>>>> only TSX/WAITPKG are checked); then some MSRs may get corrupted. I may think
>>>>>> this is not a good design. Current KVM should work with future TDX modules.
>>>>>
>>>>> With respect to CPUID, I gather this kind of thing has been
>>>>> discussed, such as here:
>>>>>
>>>>>       https://lore.kernel.org/all/ZhVsHVqaff7AKagu@google.com/
>>>>>
>>>>> and Rick and Xiaoyao are working on something.
>>>>>
>>>>> In general, I would expect a new TDX Module would advertise support for
>>>>> new features, but KVM would have to opt in to use them.
>>>>>
>>>>
>>>> There were discussion[1] on whether KVM to gatekeep the configurable/supported CPUIDs for TDX. I stand by Sean that KVM needs to do so.
>>>>
>>>> Regarding KVM opt in the new feature, KVM gatekeeps the CPUID bit that can be set by userspace is exactly the behavior of opt-in. i.e., for a given KVM, it only allows a CPUID set {S} to be configured by userspace, if new TDX module supports new feature X, it needs KVM to opt-in X by adding X to {S} so that X is allowed to be configured by userspace.
>>>>
>>>> Besides, I find current interface between KVM and userspace lacks the ability to tell userspace what bits are not supported by KVM. KVM_TDX_CAPABILITIES.cpuid doesn't work because it represents the configurable CPUIDs, not supported CPUIDs (I think we might rename it to configurable_cpuid to better reflect its meaning). So userspace has to hardcode that TSX and WAITPKG is not support itself.
>>>
>>> I don't follow why hardcoding would be necessary.
>>>
>>> If the leaf is represented in KVM_TDX_CAPABILITIES.cpuid, and
>>> the bits are 0 there, why would userspace try to set them to 1?
>>
>> Userspace doesn't set the bit to 1 in kvm_tdx_init_vm.cpuid, doesn't mean userspace wants the bit to be 0.
>>
>> Note, KVM_TDX_CAPABILITIES.cpuid reports the configurable bits. The value 0 of a bit in KVM_TDX_CAPABILITIES.cpuid means the bit is not configurable, not means the bit is unsupported.
> 
> For features configurable by CPUID like TSX and WAITPKG,
> a value of 0 does mean unsupported, because the value
> has to be 1 to enable the feature.
> 
>  From the TDX Module Base spec:
> 
> 	11.18. Transactional Synchronization Extensions (TSX)
> 	The host VMM can enable TSX for a TD by configuring the following CPUID bits as enabled in the TD_PARAMS input to
> 	TDH.MNG.INIT:
> 	- CPUID(7,0).EBX[4] (HLE)
> 	- CPUID(7,0).EBX[11] (RTM)
> 	etc
> 
> 	11.19.4. WAITPKG: TPAUSE, UMONITOR and UMWAIT Instructions
> 	The host VMM may allow guest TDs to use the TPAUSE, UMONITOR and UMWAIT instructions, if the CPU supports them,
> 	by configuring the virtual value of CPUID(7,0).ECX[5] (WAITPKG) to 1 using the CPUID configuration table which is part
> 	the TD_PARAMS input to TDH.MNG.INIT. Enabling CPUID(7,0).ECX[5] also enables TD access to IA32_UMWAIT_CONTROL
> 	(MSR 0xE1).
> 	If not allowed, then TD execution of TPAUSE, UMONITOR or UMWAIT results in a #UD, and access to
> 	IA32_UMWAIT_CONTROL results in a #GP(0).
> 
>> For kvm_tdx_init_vm.cpuid,
>>   - if the corresponding bit is reported as 1 in KVM_TDX_CAPABILITIES.cpuid, then a value 0 in kvm_tdx_init_vm.cpuid means userspace wants to configure it as 0.
>>   - if the corresponding bit is reported as 0 in KVM_TDX_CAPABILITIES.cpuid, then userspace has to pass a value 0 in kvm_tdx_init_vm.cpuid. But it doesn't mean the value of the bit will be 0.
>>
>> e.g., X2APIC bit is 0 in KVM_TDX_CAPABILITIES.cpuid, and it's also 0 in kvm_tdx_init_vm.cpuid, but TD guest sees a value of 1. In the view of QEMU, it maintains the bit of X2APIC as 1, and QEMU filters X2APIC bit when calling KVM_TDX_INIT_VM because X2APIC is not configurable.
>>
>> So when it comes to TSX and WAITPKG, QEMU also needs an interface to be informed that they are unsupported. Without the interface of fixed0 bits reported by KVM, QEMU needs to hardcode itself like [1]. The problem of hardcode is that it will conflict when future KVM allows them to be configurable.
> 
> So TSX and WAITPKG support should be based on KVM_TDX_CAPABILITIES.cpuid, and not hardcoded.

This requires Userspace to have the knowledge that "TSX and WAITPKG is 
supposed to be configurable and reported as 1 in 
KVM_TDX_CAPABILITIES.cpuid in normal case". And if userspace gets a 
value 0 of TSX and Waitpkg in KVM_TDX_CAPABILITIES.cpuid, it means 
either KVM doesn't support/allow it or the underlying TDX module doesn't.

I think it's an implicit form of hardcode. From the perspective of 
userspace, I would like to avoid all the special handling.

>>
>> In the future, if we have interface from KVM to report the fixed0 and fixed1 bit (on top of the proposal [2]), userspace can drop the hardcoded one it maintains. At that time, KVM can ensure no conflict by removing the bits from fixed0/1 array when allowing them to be configurable.
>>
>> [1] https://lore.kernel.org/qemu-devel/20241105062408.3533704-49-xiaoyao.li@intel.com/
>> [2] https://lore.kernel.org/all/43b26df1-4c27-41ff-a482-e258f872cc31@intel.com/
>>
>>>>
>>>> [1] https://lore.kernel.org/all/ZuM12EFbOXmpHHVQ@google.com/
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-09  2:46                                       ` Xiaoyao Li
@ 2024-12-09  7:08                                         ` Adrian Hunter
  2024-12-10  2:45                                           ` Xiaoyao Li
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-12-09  7:08 UTC (permalink / raw)
  To: Xiaoyao Li, Chao Gao
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On 9/12/24 04:46, Xiaoyao Li wrote:
> On 12/6/2024 10:40 PM, Adrian Hunter wrote:
>> On 6/12/24 05:37, Xiaoyao Li wrote:
>>> On 12/6/2024 1:31 AM, Adrian Hunter wrote:
>>>> On 4/12/24 17:33, Xiaoyao Li wrote:
>>>>> On 12/4/2024 7:55 PM, Adrian Hunter wrote:
>>>>>> On 4/12/24 13:13, Chao Gao wrote:
>>>>>>> On Wed, Dec 04, 2024 at 08:57:23AM +0200, Adrian Hunter wrote:
>>>>>>>> On 4/12/24 08:37, Chao Gao wrote:
>>>>>>>>> On Wed, Dec 04, 2024 at 08:18:32AM +0200, Adrian Hunter wrote:
>>>>>>>>>> On 4/12/24 03:25, Chao Gao wrote:
>>>>>>>>>>>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>>>>>>>>>>>> +
>>>>>>>>>>>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>>>>>> +           (entry->ebx & TDX_FEATURE_TSX);
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    entry->ebx &= ~TDX_FEATURE_TSX;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>>>>>> +           (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    if (has_tsx(entry))
>>>>>>>>>>>> +        clear_tsx(entry);
>>>>>>>>>>>> +
>>>>>>>>>>>> +    if (has_waitpkg(entry))
>>>>>>>>>>>> +        clear_waitpkg(entry);
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    return has_tsx(entry) || has_waitpkg(entry);
>>>>>>>>>>>> +}
>>>>>>>>>>>
>>>>>>>>>>> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
>>>>>>>>>>> ensures that unconfigurable bits are not set by userspace.
>>>>>>>>>>
>>>>>>>>>> Aren't they configurable?
>>>>>>>>>
>>>>>>>>> They are cleared from the configurable bitmap by tdx_clear_unsupported_cpuid(),
>>>>>>>>> so they are not configurable from a userspace perspective. Did I miss anything?
>>>>>>>>> KVM should check user inputs against its adjusted configurable bitmap, right?
>>>>>>>>
>>>>>>>> Maybe I misunderstand but we rely on the TDX module to reject
>>>>>>>> invalid configuration.  We don't check exactly what is configurable
>>>>>>>> for the TDX Module.
>>>>>>>
>>>>>>> Ok, this is what I missed. I thought KVM validated user input and masked
>>>>>>> out all unsupported features. sorry for this.
>>>>>>>
>>>>>>>>
>>>>>>>> TSX and WAITPKG are not invalid for the TDX Module, but KVM
>>>>>>>> must either support them by restoring their MSRs, or disallow
>>>>>>>> them.  This patch disallows them for now.
>>>>>>>
>>>>>>> Yes. I agree. what if a new feature (supported by a future TDX module) also
>>>>>>> needs KVM to restore some MSRs? current KVM will allow it to be exposed (since
>>>>>>> only TSX/WAITPKG are checked); then some MSRs may get corrupted. I may think
>>>>>>> this is not a good design. Current KVM should work with future TDX modules.
>>>>>>
>>>>>> With respect to CPUID, I gather this kind of thing has been
>>>>>> discussed, such as here:
>>>>>>
>>>>>>       https://lore.kernel.org/all/ZhVsHVqaff7AKagu@google.com/
>>>>>>
>>>>>> and Rick and Xiaoyao are working on something.
>>>>>>
>>>>>> In general, I would expect a new TDX Module would advertise support for
>>>>>> new features, but KVM would have to opt in to use them.
>>>>>>
>>>>>
>>>>> There were discussion[1] on whether KVM to gatekeep the configurable/supported CPUIDs for TDX. I stand by Sean that KVM needs to do so.
>>>>>
>>>>> Regarding KVM opt in the new feature, KVM gatekeeps the CPUID bit that can be set by userspace is exactly the behavior of opt-in. i.e., for a given KVM, it only allows a CPUID set {S} to be configured by userspace, if new TDX module supports new feature X, it needs KVM to opt-in X by adding X to {S} so that X is allowed to be configured by userspace.
>>>>>
>>>>> Besides, I find current interface between KVM and userspace lacks the ability to tell userspace what bits are not supported by KVM. KVM_TDX_CAPABILITIES.cpuid doesn't work because it represents the configurable CPUIDs, not supported CPUIDs (I think we might rename it to configurable_cpuid to better reflect its meaning). So userspace has to hardcode that TSX and WAITPKG is not support itself.
>>>>
>>>> I don't follow why hardcoding would be necessary.
>>>>
>>>> If the leaf is represented in KVM_TDX_CAPABILITIES.cpuid, and
>>>> the bits are 0 there, why would userspace try to set them to 1?
>>>
>>> Userspace doesn't set the bit to 1 in kvm_tdx_init_vm.cpuid, doesn't mean userspace wants the bit to be 0.
>>>
>>> Note, KVM_TDX_CAPABILITIES.cpuid reports the configurable bits. The value 0 of a bit in KVM_TDX_CAPABILITIES.cpuid means the bit is not configurable, not means the bit is unsupported.
>>
>> For features configurable by CPUID like TSX and WAITPKG,
>> a value of 0 does mean unsupported, because the value
>> has to be 1 to enable the feature.
>>
>>  From the TDX Module Base spec:
>>
>>     11.18. Transactional Synchronization Extensions (TSX)
>>     The host VMM can enable TSX for a TD by configuring the following CPUID bits as enabled in the TD_PARAMS input to
>>     TDH.MNG.INIT:
>>     - CPUID(7,0).EBX[4] (HLE)
>>     - CPUID(7,0).EBX[11] (RTM)
>>     etc
>>
>>     11.19.4. WAITPKG: TPAUSE, UMONITOR and UMWAIT Instructions
>>     The host VMM may allow guest TDs to use the TPAUSE, UMONITOR and UMWAIT instructions, if the CPU supports them,
>>     by configuring the virtual value of CPUID(7,0).ECX[5] (WAITPKG) to 1 using the CPUID configuration table which is part
>>     the TD_PARAMS input to TDH.MNG.INIT. Enabling CPUID(7,0).ECX[5] also enables TD access to IA32_UMWAIT_CONTROL
>>     (MSR 0xE1).
>>     If not allowed, then TD execution of TPAUSE, UMONITOR or UMWAIT results in a #UD, and access to
>>     IA32_UMWAIT_CONTROL results in a #GP(0).
>>
>>> For kvm_tdx_init_vm.cpuid,
>>>   - if the corresponding bit is reported as 1 in KVM_TDX_CAPABILITIES.cpuid, then a value 0 in kvm_tdx_init_vm.cpuid means userspace wants to configure it as 0.
>>>   - if the corresponding bit is reported as 0 in KVM_TDX_CAPABILITIES.cpuid, then userspace has to pass a value 0 in kvm_tdx_init_vm.cpuid. But it doesn't mean the value of the bit will be 0.
>>>
>>> e.g., X2APIC bit is 0 in KVM_TDX_CAPABILITIES.cpuid, and it's also 0 in kvm_tdx_init_vm.cpuid, but TD guest sees a value of 1. In the view of QEMU, it maintains the bit of X2APIC as 1, and QEMU filters X2APIC bit when calling KVM_TDX_INIT_VM because X2APIC is not configurable.
>>>
>>> So when it comes to TSX and WAITPKG, QEMU also needs an interface to be informed that they are unsupported. Without the interface of fixed0 bits reported by KVM, QEMU needs to hardcode itself like [1]. The problem of hardcode is that it will conflict when future KVM allows them to be configurable.
>>
>> So TSX and WAITPKG support should be based on KVM_TDX_CAPABILITIES.cpuid, and not hardcoded.
> 
> This requires Userspace to have the knowledge that "TSX and WAITPKG is supposed to be configurable and reported as 1 in KVM_TDX_CAPABILITIES.cpuid in normal case". And if userspace gets a value 0 of TSX and Waitpkg in KVM_TDX_CAPABILITIES.cpuid, it means either KVM doesn't support/allow it or the underlying TDX module doesn't.
> 
> I think it's an implicit form of hardcode. From the perspective of userspace, I would like to avoid all the special handling.

It is consistent with what is done for TD attributes and XFAM.  Presumably feature names must be mapped to capability bits and configuration bits, and that information has to be represented somewhere.

> 
>>>
>>> In the future, if we have interface from KVM to report the fixed0 and fixed1 bit (on top of the proposal [2]), userspace can drop the hardcoded one it maintains. At that time, KVM can ensure no conflict by removing the bits from fixed0/1 array when allowing them to be configurable.
>>>
>>> [1] https://lore.kernel.org/qemu-devel/20241105062408.3533704-49-xiaoyao.li@intel.com/
>>> [2] https://lore.kernel.org/all/43b26df1-4c27-41ff-a482-e258f872cc31@intel.com/
>>>
>>>>>
>>>>> [1] https://lore.kernel.org/all/ZuM12EFbOXmpHHVQ@google.com/
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list
  2024-12-09  7:08                                         ` Adrian Hunter
@ 2024-12-10  2:45                                           ` Xiaoyao Li
  0 siblings, 0 replies; 82+ messages in thread
From: Xiaoyao Li @ 2024-12-10  2:45 UTC (permalink / raw)
  To: Adrian Hunter, Chao Gao
  Cc: Edgecombe, Rick P, seanjc@google.com, kvm@vger.kernel.org,
	Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, Yang, Weijiang,
	binbin.wu@linux.intel.com, dmatlack@google.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com, nik.borisov@suse.com,
	Chatre, Reinette, x86@kernel.org

On 12/9/2024 3:08 PM, Adrian Hunter wrote:
> On 9/12/24 04:46, Xiaoyao Li wrote:
>> On 12/6/2024 10:40 PM, Adrian Hunter wrote:
>>> On 6/12/24 05:37, Xiaoyao Li wrote:
>>>> On 12/6/2024 1:31 AM, Adrian Hunter wrote:
>>>>> On 4/12/24 17:33, Xiaoyao Li wrote:
>>>>>> On 12/4/2024 7:55 PM, Adrian Hunter wrote:
>>>>>>> On 4/12/24 13:13, Chao Gao wrote:
>>>>>>>> On Wed, Dec 04, 2024 at 08:57:23AM +0200, Adrian Hunter wrote:
>>>>>>>>> On 4/12/24 08:37, Chao Gao wrote:
>>>>>>>>>> On Wed, Dec 04, 2024 at 08:18:32AM +0200, Adrian Hunter wrote:
>>>>>>>>>>> On 4/12/24 03:25, Chao Gao wrote:
>>>>>>>>>>>>> +#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>>>>>>> +           (entry->ebx & TDX_FEATURE_TSX);
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +static void clear_tsx(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    entry->ebx &= ~TDX_FEATURE_TSX;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    return entry->function == 7 && entry->index == 0 &&
>>>>>>>>>>>>> +           (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    if (has_tsx(entry))
>>>>>>>>>>>>> +        clear_tsx(entry);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    if (has_waitpkg(entry))
>>>>>>>>>>>>> +        clear_waitpkg(entry);
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    return has_tsx(entry) || has_waitpkg(entry);
>>>>>>>>>>>>> +}
>>>>>>>>>>>>
>>>>>>>>>>>> No need to check TSX/WAITPKG explicitly because setup_tdparams_cpuids() already
>>>>>>>>>>>> ensures that unconfigurable bits are not set by userspace.
>>>>>>>>>>>
>>>>>>>>>>> Aren't they configurable?
>>>>>>>>>>
>>>>>>>>>> They are cleared from the configurable bitmap by tdx_clear_unsupported_cpuid(),
>>>>>>>>>> so they are not configurable from a userspace perspective. Did I miss anything?
>>>>>>>>>> KVM should check user inputs against its adjusted configurable bitmap, right?
>>>>>>>>>
>>>>>>>>> Maybe I misunderstand but we rely on the TDX module to reject
>>>>>>>>> invalid configuration.  We don't check exactly what is configurable
>>>>>>>>> for the TDX Module.
>>>>>>>>
>>>>>>>> Ok, this is what I missed. I thought KVM validated user input and masked
>>>>>>>> out all unsupported features. sorry for this.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> TSX and WAITPKG are not invalid for the TDX Module, but KVM
>>>>>>>>> must either support them by restoring their MSRs, or disallow
>>>>>>>>> them.  This patch disallows them for now.
>>>>>>>>
>>>>>>>> Yes. I agree. what if a new feature (supported by a future TDX module) also
>>>>>>>> needs KVM to restore some MSRs? current KVM will allow it to be exposed (since
>>>>>>>> only TSX/WAITPKG are checked); then some MSRs may get corrupted. I may think
>>>>>>>> this is not a good design. Current KVM should work with future TDX modules.
>>>>>>>
>>>>>>> With respect to CPUID, I gather this kind of thing has been
>>>>>>> discussed, such as here:
>>>>>>>
>>>>>>>        https://lore.kernel.org/all/ZhVsHVqaff7AKagu@google.com/
>>>>>>>
>>>>>>> and Rick and Xiaoyao are working on something.
>>>>>>>
>>>>>>> In general, I would expect a new TDX Module would advertise support for
>>>>>>> new features, but KVM would have to opt in to use them.
>>>>>>>
>>>>>>
>>>>>> There were discussion[1] on whether KVM to gatekeep the configurable/supported CPUIDs for TDX. I stand by Sean that KVM needs to do so.
>>>>>>
>>>>>> Regarding KVM opt in the new feature, KVM gatekeeps the CPUID bit that can be set by userspace is exactly the behavior of opt-in. i.e., for a given KVM, it only allows a CPUID set {S} to be configured by userspace, if new TDX module supports new feature X, it needs KVM to opt-in X by adding X to {S} so that X is allowed to be configured by userspace.
>>>>>>
>>>>>> Besides, I find current interface between KVM and userspace lacks the ability to tell userspace what bits are not supported by KVM. KVM_TDX_CAPABILITIES.cpuid doesn't work because it represents the configurable CPUIDs, not supported CPUIDs (I think we might rename it to configurable_cpuid to better reflect its meaning). So userspace has to hardcode that TSX and WAITPKG is not support itself.
>>>>>
>>>>> I don't follow why hardcoding would be necessary.
>>>>>
>>>>> If the leaf is represented in KVM_TDX_CAPABILITIES.cpuid, and
>>>>> the bits are 0 there, why would userspace try to set them to 1?
>>>>
>>>> Userspace doesn't set the bit to 1 in kvm_tdx_init_vm.cpuid, doesn't mean userspace wants the bit to be 0.
>>>>
>>>> Note, KVM_TDX_CAPABILITIES.cpuid reports the configurable bits. The value 0 of a bit in KVM_TDX_CAPABILITIES.cpuid means the bit is not configurable, not means the bit is unsupported.
>>>
>>> For features configurable by CPUID like TSX and WAITPKG,
>>> a value of 0 does mean unsupported, because the value
>>> has to be 1 to enable the feature.
>>>
>>>   From the TDX Module Base spec:
>>>
>>>      11.18. Transactional Synchronization Extensions (TSX)
>>>      The host VMM can enable TSX for a TD by configuring the following CPUID bits as enabled in the TD_PARAMS input to
>>>      TDH.MNG.INIT:
>>>      - CPUID(7,0).EBX[4] (HLE)
>>>      - CPUID(7,0).EBX[11] (RTM)
>>>      etc
>>>
>>>      11.19.4. WAITPKG: TPAUSE, UMONITOR and UMWAIT Instructions
>>>      The host VMM may allow guest TDs to use the TPAUSE, UMONITOR and UMWAIT instructions, if the CPU supports them,
>>>      by configuring the virtual value of CPUID(7,0).ECX[5] (WAITPKG) to 1 using the CPUID configuration table which is part
>>>      the TD_PARAMS input to TDH.MNG.INIT. Enabling CPUID(7,0).ECX[5] also enables TD access to IA32_UMWAIT_CONTROL
>>>      (MSR 0xE1).
>>>      If not allowed, then TD execution of TPAUSE, UMONITOR or UMWAIT results in a #UD, and access to
>>>      IA32_UMWAIT_CONTROL results in a #GP(0).
>>>
>>>> For kvm_tdx_init_vm.cpuid,
>>>>    - if the corresponding bit is reported as 1 in KVM_TDX_CAPABILITIES.cpuid, then a value 0 in kvm_tdx_init_vm.cpuid means userspace wants to configure it as 0.
>>>>    - if the corresponding bit is reported as 0 in KVM_TDX_CAPABILITIES.cpuid, then userspace has to pass a value 0 in kvm_tdx_init_vm.cpuid. But it doesn't mean the value of the bit will be 0.
>>>>
>>>> e.g., X2APIC bit is 0 in KVM_TDX_CAPABILITIES.cpuid, and it's also 0 in kvm_tdx_init_vm.cpuid, but TD guest sees a value of 1. In the view of QEMU, it maintains the bit of X2APIC as 1, and QEMU filters X2APIC bit when calling KVM_TDX_INIT_VM because X2APIC is not configurable.
>>>>
>>>> So when it comes to TSX and WAITPKG, QEMU also needs an interface to be informed that they are unsupported. Without the interface of fixed0 bits reported by KVM, QEMU needs to hardcode itself like [1]. The problem of hardcode is that it will conflict when future KVM allows them to be configurable.
>>>
>>> So TSX and WAITPKG support should be based on KVM_TDX_CAPABILITIES.cpuid, and not hardcoded.
>>
>> This requires Userspace to have the knowledge that "TSX and WAITPKG is supposed to be configurable and reported as 1 in KVM_TDX_CAPABILITIES.cpuid in normal case". And if userspace gets a value 0 of TSX and Waitpkg in KVM_TDX_CAPABILITIES.cpuid, it means either KVM doesn't support/allow it or the underlying TDX module doesn't.
>>
>> I think it's an implicit form of hardcode. From the perspective of userspace, I would like to avoid all the special handling.
> 
> It is consistent with what is done for TD attributes and XFAM.  Presumably feature names must be mapped to capability bits and configuration bits, and that information has to be represented somewhere.

Attrbutes and XFAM are different to cpuid either in KVM_TDX_CAPABILITIES 
or in struct kvm_tdx_init_vm.

KVM_TDX_CAPABILITIES reports the *supported* bits for Attributes and 
XFAM while it reports the configurable bits for CPUID.

The bits passed in struct kvm_tdx_init_vm for Attributes and XFAM are a 
whole of the final value that will be set for TD guest. While the bits 
passed in struct kvm_tdx_init_vm for cpuid are only the value of 
configurable bits, the final value that will be set for TD guest needs 
to be OR'ed with fixed1 bits.

 From the perspective of userspace,the information in 
KVM_TDX_CAPABILITIES.supported_attributes and 
KVM_TDX_CAPABILITIES.supported_xfam are enough, bits reported as 0 in 
them means they are not supported. While for cpuid, in general, the bits 
reported as 0 in KVM_TDX_CAPABILITIES.cpuid doesn't mean they are not 
supported.

Anyway, my point is, in general, value 0 of a bit in 
KVM_TDX_CAPABILITIES.cpuid doesn't mean the bit is not supported for TD 
guest. It's better to have KVM expose an interface to report the 
unsupported bits. Without it, userspace has to hardcode some information 
and maintain it itself.

QEMU already hardcodes the fixed0 and fixed1 bits itself[1] according to 
a specific version of TDX spec, while the tweak to TSX and WAITPKG in 
KVM will require more specific handling in QEMU. It's not the fault of 
the change to TSX/WAITPKG in KVM_TDX_CAPABILITIES.cpuid. It somehow 
reveals it's getting messier when KVM cannot report enough information 
to userspace.

[1] 
https://lore.kernel.org/qemu-devel/20241105062408.3533704-49-xiaoyao.li@intel.com/
>>
>>>>
>>>> In the future, if we have interface from KVM to report the fixed0 and fixed1 bit (on top of the proposal [2]), userspace can drop the hardcoded one it maintains. At that time, KVM can ensure no conflict by removing the bits from fixed0/1 array when allowing them to be configurable.
>>>>
>>>> [1] https://lore.kernel.org/qemu-devel/20241105062408.3533704-49-xiaoyao.li@intel.com/
>>>> [2] https://lore.kernel.org/all/43b26df1-4c27-41ff-a482-e258f872cc31@intel.com/
>>>>
>>>>>>
>>>>>> [1] https://lore.kernel.org/all/ZuM12EFbOXmpHHVQ@google.com/
>>>>>>
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
  2024-11-21 20:14 [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (7 preceding siblings ...)
  2024-11-25  1:25 ` [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Binbin Wu
@ 2024-12-10 18:23 ` Paolo Bonzini
  8 siblings, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2024-12-10 18:23 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: pbonzini, seanjc, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	chao.gao, weijiang.yang

Applied to kvm-coco-queue, thanks.

Paolo


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-12-04 15:58           ` Adrian Hunter
@ 2024-12-11 18:43             ` Adrian Hunter
  2024-12-13 15:45               ` Adrian Hunter
  2024-12-13 16:16               ` Dave Hansen
  0 siblings, 2 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-12-11 18:43 UTC (permalink / raw)
  To: Dave Hansen, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang

The diff below shows another alternative.  This time using
structs not a union.  The structs are easier to read than
the union, and require copying arguments, which also allows
using types that have sizes other than a GPR's (u64) size.

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 192ae798b214..85f87d90ac89 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -21,20 +21,6 @@
 /* TDCS fields. To be used by TDG.VM.WR and TDG.VM.RD module calls */
 #define TDCS_NOTIFY_ENABLES		0x9100000000000010
 
-/* TDX hypercall Leaf IDs */
-#define TDVMCALL_GET_TD_VM_CALL_INFO	0x10000
-#define TDVMCALL_MAP_GPA		0x10001
-#define TDVMCALL_GET_QUOTE		0x10002
-#define TDVMCALL_REPORT_FATAL_ERROR	0x10003
-
-/*
- * TDG.VP.VMCALL Status Codes (returned in R10)
- */
-#define TDVMCALL_STATUS_SUCCESS		0x0000000000000000ULL
-#define TDVMCALL_STATUS_RETRY		0x0000000000000001ULL
-#define TDVMCALL_STATUS_INVALID_OPERAND	0x8000000000000000ULL
-#define TDVMCALL_STATUS_ALIGN_ERROR	0x8000000000000002ULL
-
 /*
  * Bitmasks of exposed registers (with VMM).
  */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 01409a59224d..e4a45378a84b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -33,6 +33,7 @@
 
 #ifndef __ASSEMBLY__
 
+#include <linux/kvm_types.h>
 #include <uapi/asm/mce.h>
 #include "tdx_global_metadata.h"
 
@@ -96,6 +97,7 @@ u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
 void tdx_init(void);
 
 #include <asm/archrandom.h>
+#include <asm/vmx.h>
 
 typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
 
@@ -123,8 +125,122 @@ const struct tdx_sys_info *tdx_get_sysinfo(void);
 int tdx_guest_keyid_alloc(void);
 void tdx_guest_keyid_free(unsigned int keyid);
 
+/* TDG.VP.VMCALL Sub-function */
+enum tdvmcall_subfn {
+	TDVMCALL_NONE			= -1, /* Not a TDG.VP.VMCALL */
+	TDVMCALL_GET_TD_VM_CALL_INFO	= 0x10000,
+	TDVMCALL_MAP_GPA		= 0x10001,
+	TDVMCALL_GET_QUOTE		= 0x10002,
+	TDVMCALL_REPORT_FATAL_ERROR	= 0x10003,
+	TDVMCALL_CPUID			= EXIT_REASON_CPUID,
+	TDVMCALL_HLT			= EXIT_REASON_HLT,
+	TDVMCALL_IO			= EXIT_REASON_IO_INSTRUCTION,
+	TDVMCALL_RDMSR			= EXIT_REASON_MSR_READ,
+	TDVMCALL_WRMSR			= EXIT_REASON_MSR_WRITE,
+	TDVMCALL_MMIO			= EXIT_REASON_EPT_VIOLATION,
+};
+
+enum tdx_io_direction {
+	TDX_READ,
+	TDX_WRITE
+};
+
+/* TDG.VP.VMCALL Sub-function Completion Status Codes */
+enum tdvmcall_status {
+	TDVMCALL_STATUS_SUCCESS		= 0x0000000000000000ULL,
+	TDVMCALL_STATUS_RETRY		= 0x0000000000000001ULL,
+	TDVMCALL_STATUS_INVALID_OPERAND	= 0x8000000000000000ULL,
+	TDVMCALL_STATUS_ALIGN_ERROR	= 0x8000000000000002ULL,
+};
+
+struct tdh_vp_enter_in {
+	/* TDG.VP.VMCALL common */
+	enum tdvmcall_status	ret_code;
+
+	/* TDG.VP.VMCALL Sub-function return information */
+
+	/* TDVMCALL_GET_TD_VM_CALL_INFO */
+	u64			gettdvmcallinfo[4];
+
+	/* TDVMCALL_MAP_GPA */
+	gpa_t			failed_gpa;
+
+	/* TDVMCALL_CPUID */
+	u32			eax;
+	u32			ebx;
+	u32			ecx;
+	u32			edx;
+
+	/* TDVMCALL_IO (read), TDVMCALL_RDMSR or TDVMCALL_MMIO (read) */
+	u64			value_read;
+};
+
+#define TDX_ERR_DATA_SZ 8
+
+struct tdh_vp_enter_out {
+	u64			exit_qual;
+	u32			intr_info;
+	u64			ext_exit_qual;
+	gpa_t			gpa;
+
+	/* TDG.VP.VMCALL common */
+	u32			reg_mask;
+	u64			fn;		/* Non-zero for KVM hypercalls, zero otherwise */
+	enum tdvmcall_subfn	subfn;
+
+	/* TDG.VP.VMCALL Sub-function arguments */
+
+	/* KVM hypercall */
+	u64			nr;
+	u64			p1;
+	u64			p2;
+	u64			p3;
+	u64			p4;
+
+	/* TDVMCALL_GET_TD_VM_CALL_INFO */
+	u64			leaf;
+
+	/* TDVMCALL_MAP_GPA */
+	gpa_t			map_gpa;
+	u64			map_gpa_size;
+
+	/* TDVMCALL_GET_QUOTE */
+	gpa_t			shared_gpa;
+	u64			shared_gpa_size;
+
+	/* TDVMCALL_REPORT_FATAL_ERROR */
+	u64			err_codes;
+	gpa_t			err_data_gpa;
+	u64			err_data[TDX_ERR_DATA_SZ];
+
+	/* TDVMCALL_CPUID */
+	u32			cpuid_leaf;
+	u32			cpuid_subleaf;
+
+	/* TDVMCALL_MMIO */
+	int			mmio_size;
+	enum tdx_io_direction	mmio_direction;
+	gpa_t			mmio_addr;
+	u32			mmio_value;
+
+	/* TDVMCALL_HLT */
+	bool			intr_blocked_flag;
+
+	/* TDVMCALL_IO_INSTRUCTION */
+	int			io_size;
+	enum tdx_io_direction	io_direction;
+	u16			io_port;
+	u32			io_value;
+
+	/* TDVMCALL_MSR_READ or TDVMCALL_MSR_WRITE */
+	u32			msr;
+
+	/* TDVMCALL_MSR_WRITE */
+	u64			write_value;
+};
+
 /* SEAMCALL wrappers for creating/destroying/running TDX guests */
-u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args);
+u64 tdh_vp_enter(u64 tdvpr, const struct tdh_vp_enter_in *in, struct tdh_vp_enter_out *out);
 u64 tdh_mng_addcx(u64 tdr, u64 tdcs);
 u64 tdh_mem_page_add(u64 tdr, u64 gpa, u64 hpa, u64 source, u64 *rcx, u64 *rdx);
 u64 tdh_mem_sept_add(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 218801618e9a..a8283a03fdd4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -256,57 +256,41 @@ static __always_inline bool tdx_check_exit_reason(struct kvm_vcpu *vcpu, u16 rea
 
 static __always_inline unsigned long tdexit_exit_qual(struct kvm_vcpu *vcpu)
 {
-	return kvm_rcx_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_out.exit_qual;
 }
 
 static __always_inline unsigned long tdexit_ext_exit_qual(struct kvm_vcpu *vcpu)
 {
-	return kvm_rdx_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_out.ext_exit_qual;
 }
 
-static __always_inline unsigned long tdexit_gpa(struct kvm_vcpu *vcpu)
+static __always_inline gpa_t tdexit_gpa(struct kvm_vcpu *vcpu)
 {
-	return kvm_r8_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_out.gpa;
 }
 
 static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
 {
-	return kvm_r9_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_out.intr_info;
 }
 
-#define BUILD_TDVMCALL_ACCESSORS(param, gpr)				\
-static __always_inline							\
-unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu)		\
-{									\
-	return kvm_##gpr##_read(vcpu);					\
-}									\
-static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu, \
-						     unsigned long val)  \
-{									\
-	kvm_##gpr##_write(vcpu, val);					\
-}
-BUILD_TDVMCALL_ACCESSORS(a0, r12);
-BUILD_TDVMCALL_ACCESSORS(a1, r13);
-BUILD_TDVMCALL_ACCESSORS(a2, r14);
-BUILD_TDVMCALL_ACCESSORS(a3, r15);
-
-static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
+static __always_inline unsigned long tdvmcall_fn(struct kvm_vcpu *vcpu)
 {
-	return kvm_r10_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_out.fn;
 }
-static __always_inline unsigned long tdvmcall_leaf(struct kvm_vcpu *vcpu)
+static __always_inline enum tdvmcall_subfn tdvmcall_subfn(struct kvm_vcpu *vcpu)
 {
-	return kvm_r11_read(vcpu);
+	return to_tdx(vcpu)->vp_enter_out.subfn;
 }
 static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
-						     long val)
+						     enum tdvmcall_status val)
 {
-	kvm_r10_write(vcpu, val);
+	to_tdx(vcpu)->vp_enter_in.ret_code = val;
 }
 static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
 						    unsigned long val)
 {
-	kvm_r11_write(vcpu, val);
+	to_tdx(vcpu)->vp_enter_in.value_read = val;
 }
 
 static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
@@ -786,10 +770,10 @@ bool tdx_interrupt_allowed(struct kvm_vcpu *vcpu)
 	 * passes the interrupt block flag.
 	 */
 	if (!tdx_check_exit_reason(vcpu, EXIT_REASON_TDCALL) ||
-	    tdvmcall_exit_type(vcpu) || tdvmcall_leaf(vcpu) != EXIT_REASON_HLT)
+	    tdvmcall_fn(vcpu) || tdvmcall_subfn(vcpu) != TDVMCALL_HLT)
 	    return true;
 
-	return !tdvmcall_a0_read(vcpu);
+	return !to_tdx(vcpu)->vp_enter_out.intr_blocked_flag;
 }
 
 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
@@ -945,51 +929,10 @@ static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
 static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
-	struct tdx_module_args args;
 
 	guest_state_enter_irqoff();
 
-	/*
-	 * TODO: optimization:
-	 * - Eliminate copy between args and vcpu->arch.regs.
-	 * - copyin/copyout registers only if (tdx->tdvmvall.regs_mask != 0)
-	 *   which means TDG.VP.VMCALL.
-	 */
-	args = (struct tdx_module_args) {
-		.rcx = tdx->tdvpr_pa,
-#define REG(reg, REG)	.reg = vcpu->arch.regs[VCPU_REGS_ ## REG]
-		REG(rdx, RDX),
-		REG(r8,  R8),
-		REG(r9,  R9),
-		REG(r10, R10),
-		REG(r11, R11),
-		REG(r12, R12),
-		REG(r13, R13),
-		REG(r14, R14),
-		REG(r15, R15),
-		REG(rbx, RBX),
-		REG(rdi, RDI),
-		REG(rsi, RSI),
-#undef REG
-	};
-
-	tdx->vp_enter_ret = tdh_vp_enter(tdx->tdvpr_pa, &args);
-
-#define REG(reg, REG)	vcpu->arch.regs[VCPU_REGS_ ## REG] = args.reg
-	REG(rcx, RCX);
-	REG(rdx, RDX);
-	REG(r8,  R8);
-	REG(r9,  R9);
-	REG(r10, R10);
-	REG(r11, R11);
-	REG(r12, R12);
-	REG(r13, R13);
-	REG(r14, R14);
-	REG(r15, R15);
-	REG(rbx, RBX);
-	REG(rdi, RDI);
-	REG(rsi, RSI);
-#undef REG
+	tdx->vp_enter_ret = tdh_vp_enter(tdx->tdvpr_pa, &tdx->vp_enter_in, &tdx->vp_enter_out);
 
 	if (tdx_check_exit_reason(vcpu, EXIT_REASON_EXCEPTION_NMI) &&
 	    is_nmi(tdexit_intr_info(vcpu)))
@@ -1128,8 +1071,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 
 static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	int r;
 
+	kvm_r10_write(vcpu, tdx->vp_enter_out.nr);
+	kvm_r11_write(vcpu, tdx->vp_enter_out.p1);
+	kvm_r12_write(vcpu, tdx->vp_enter_out.p2);
+	kvm_r13_write(vcpu, tdx->vp_enter_out.p3);
+	kvm_r14_write(vcpu, tdx->vp_enter_out.p4);
+
 	/*
 	 * ABI for KVM tdvmcall argument:
 	 * In Guest-Hypervisor Communication Interface(GHCI) specification,
@@ -1137,13 +1087,12 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
 	 * vendor-specific.  KVM uses this for KVM hypercall.  NOTE: KVM
 	 * hypercall number starts from one.  Zero isn't used for KVM hypercall
 	 * number.
-	 *
-	 * R10: KVM hypercall number
-	 * arguments: R11, R12, R13, R14.
 	 */
 	r = __kvm_emulate_hypercall(vcpu, r10, r11, r12, r13, r14, true, 0,
 				    complete_hypercall_exit);
 
+	tdvmcall_set_return_code(vcpu, kvm_r10_read(vcpu));
+
 	return r > 0;
 }
 
@@ -1161,7 +1110,7 @@ static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
 
 	if(vcpu->run->hypercall.ret) {
 		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
-		kvm_r11_write(vcpu, tdx->map_gpa_next);
+		tdx->vp_enter_in.failed_gpa = tdx->map_gpa_next;
 		return 1;
 	}
 
@@ -1182,7 +1131,7 @@ static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
 	if (pi_has_pending_interrupt(vcpu) ||
 	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
 		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
-		kvm_r11_write(vcpu, tdx->map_gpa_next);
+		tdx->vp_enter_in.failed_gpa = tdx->map_gpa_next;
 		return 1;
 	}
 
@@ -1214,8 +1163,8 @@ static void __tdx_map_gpa(struct vcpu_tdx * tdx)
 static int tdx_map_gpa(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx * tdx = to_tdx(vcpu);
-	u64 gpa = tdvmcall_a0_read(vcpu);
-	u64 size = tdvmcall_a1_read(vcpu);
+	u64 gpa  = tdx->vp_enter_out.map_gpa;
+	u64 size = tdx->vp_enter_out.map_gpa_size;
 	u64 ret;
 
 	/*
@@ -1251,14 +1200,17 @@ static int tdx_map_gpa(struct kvm_vcpu *vcpu)
 
 error:
 	tdvmcall_set_return_code(vcpu, ret);
-	kvm_r11_write(vcpu, gpa);
+	tdx->vp_enter_in.failed_gpa = gpa;
 	return 1;
 }
 
 static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
 {
-	u64 reg_mask = kvm_rcx_read(vcpu);
-	u64* opt_regs;
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	__u64 *data = &vcpu->run->system_event.data[0];
+	u64 reg_mask = tdx->vp_enter_out.reg_mask;
+	const int mask[] = {14, 15, 3, 7, 6, 8, 9, 2};
+	int cnt = 0;
 
 	/*
 	 * Skip sanity checks and let userspace decide what to do if sanity
@@ -1266,32 +1218,20 @@ static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
 	 */
 	vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
 	vcpu->run->system_event.type = KVM_SYSTEM_EVENT_TDX_FATAL;
-	vcpu->run->system_event.ndata = 10;
 	/* Error codes. */
-	vcpu->run->system_event.data[0] = tdvmcall_a0_read(vcpu);
+	data[cnt++] = tdx->vp_enter_out.err_codes;
 	/* GPA of additional information page. */
-	vcpu->run->system_event.data[1] = tdvmcall_a1_read(vcpu);
+	data[cnt++] = tdx->vp_enter_out.err_data_gpa;
+
 	/* Information passed via registers (up to 64 bytes). */
-	opt_regs = &vcpu->run->system_event.data[2];
+	for (int i = 0; i < TDX_ERR_DATA_SZ; i++) {
+		if (reg_mask & BIT_ULL(mask[i]))
+			data[cnt++] = tdx->vp_enter_out.err_data[i];
+		else
+			data[cnt++] = 0;
+	}
 
-#define COPY_REG(REG, MASK)						\
-	do {								\
-		if (reg_mask & MASK)					\
-			*opt_regs = kvm_ ## REG ## _read(vcpu);		\
-		else							\
-			*opt_regs = 0;					\
-		opt_regs++;						\
-	} while (0)
-
-	/* The order is defined in GHCI. */
-	COPY_REG(r14, BIT_ULL(14));
-	COPY_REG(r15, BIT_ULL(15));
-	COPY_REG(rbx, BIT_ULL(3));
-	COPY_REG(rdi, BIT_ULL(7));
-	COPY_REG(rsi, BIT_ULL(6));
-	COPY_REG(r8, BIT_ULL(8));
-	COPY_REG(r9, BIT_ULL(9));
-	COPY_REG(rdx, BIT_ULL(2));
+	vcpu->run->system_event.ndata = cnt;
 
 	/*
 	 * Set the status code according to GHCI spec, although the vCPU may
@@ -1305,18 +1245,18 @@ static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
 
 static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	u32 eax, ebx, ecx, edx;
 
-	/* EAX and ECX for cpuid is stored in R12 and R13. */
-	eax = tdvmcall_a0_read(vcpu);
-	ecx = tdvmcall_a1_read(vcpu);
+	eax = tdx->vp_enter_out.cpuid_leaf;
+	ecx = tdx->vp_enter_out.cpuid_subleaf;
 
 	kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, false);
 
-	tdvmcall_a0_write(vcpu, eax);
-	tdvmcall_a1_write(vcpu, ebx);
-	tdvmcall_a2_write(vcpu, ecx);
-	tdvmcall_a3_write(vcpu, edx);
+	tdx->vp_enter_in.eax = eax;
+	tdx->vp_enter_in.ebx = ebx;
+	tdx->vp_enter_in.ecx = ecx;
+	tdx->vp_enter_in.edx = edx;
 
 	tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
 
@@ -1356,6 +1296,7 @@ static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
 static int tdx_emulate_io(struct kvm_vcpu *vcpu)
 {
 	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	unsigned long val = 0;
 	unsigned int port;
 	int size, ret;
@@ -1363,9 +1304,9 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
 
 	++vcpu->stat.io_exits;
 
-	size = tdvmcall_a0_read(vcpu);
-	write = tdvmcall_a1_read(vcpu);
-	port = tdvmcall_a2_read(vcpu);
+	size  = tdx->vp_enter_out.io_size;
+	write = tdx->vp_enter_out.io_direction == TDX_WRITE;
+	port  = tdx->vp_enter_out.io_port;
 
 	if (size != 1 && size != 2 && size != 4) {
 		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
@@ -1373,7 +1314,7 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
 	}
 
 	if (write) {
-		val = tdvmcall_a3_read(vcpu);
+		val = tdx->vp_enter_out.io_value;
 		ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
 	} else {
 		ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
@@ -1443,14 +1384,15 @@ static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
 
 static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	int size, write, r;
 	unsigned long val;
 	gpa_t gpa;
 
-	size = tdvmcall_a0_read(vcpu);
-	write = tdvmcall_a1_read(vcpu);
-	gpa = tdvmcall_a2_read(vcpu);
-	val = write ? tdvmcall_a3_read(vcpu) : 0;
+	size  = tdx->vp_enter_out.mmio_size;
+	write = tdx->vp_enter_out.mmio_direction == TDX_WRITE;
+	gpa   = tdx->vp_enter_out.mmio_addr;
+	val = write ? tdx->vp_enter_out.mmio_value : 0;
 
 	if (size != 1 && size != 2 && size != 4 && size != 8)
 		goto error;
@@ -1502,7 +1444,7 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
 
 static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
 {
-	u32 index = tdvmcall_a0_read(vcpu);
+	u32 index = to_tdx(vcpu)->vp_enter_out.msr;
 	u64 data;
 
 	if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ) ||
@@ -1520,8 +1462,8 @@ static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
 
 static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
 {
-	u32 index = tdvmcall_a0_read(vcpu);
-	u64 data = tdvmcall_a1_read(vcpu);
+	u32 index = to_tdx(vcpu)->vp_enter_out.msr;
+	u64 data  = to_tdx(vcpu)->vp_enter_out.write_value;
 
 	if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE) ||
 	    kvm_set_msr(vcpu, index, data)) {
@@ -1537,39 +1479,41 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
 
 static int tdx_get_td_vm_call_info(struct kvm_vcpu *vcpu)
 {
-	if (tdvmcall_a0_read(vcpu))
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (tdx->vp_enter_out.leaf) {
 		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
-	else {
+	} else {
 		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
-		kvm_r11_write(vcpu, 0);
-		tdvmcall_a0_write(vcpu, 0);
-		tdvmcall_a1_write(vcpu, 0);
-		tdvmcall_a2_write(vcpu, 0);
+		tdx->vp_enter_in.gettdvmcallinfo[0] = 0;
+		tdx->vp_enter_in.gettdvmcallinfo[1] = 0;
+		tdx->vp_enter_in.gettdvmcallinfo[2] = 0;
+		tdx->vp_enter_in.gettdvmcallinfo[3] = 0;
 	}
 	return 1;
 }
 
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
-	if (tdvmcall_exit_type(vcpu))
+	if (tdvmcall_fn(vcpu))
 		return tdx_emulate_vmcall(vcpu);
 
-	switch (tdvmcall_leaf(vcpu)) {
+	switch (tdvmcall_subfn(vcpu)) {
 	case TDVMCALL_MAP_GPA:
 		return tdx_map_gpa(vcpu);
 	case TDVMCALL_REPORT_FATAL_ERROR:
 		return tdx_report_fatal_error(vcpu);
-	case EXIT_REASON_CPUID:
+	case TDVMCALL_CPUID:
 		return tdx_emulate_cpuid(vcpu);
-	case EXIT_REASON_HLT:
+	case TDVMCALL_HLT:
 		return tdx_emulate_hlt(vcpu);
-	case EXIT_REASON_IO_INSTRUCTION:
+	case TDVMCALL_IO:
 		return tdx_emulate_io(vcpu);
-	case EXIT_REASON_EPT_VIOLATION:
+	case TDVMCALL_MMIO:
 		return tdx_emulate_mmio(vcpu);
-	case EXIT_REASON_MSR_READ:
+	case TDVMCALL_RDMSR:
 		return tdx_emulate_rdmsr(vcpu);
-	case EXIT_REASON_MSR_WRITE:
+	case TDVMCALL_WRMSR:
 		return tdx_emulate_wrmsr(vcpu);
 	case TDVMCALL_GET_TD_VM_CALL_INFO:
 		return tdx_get_td_vm_call_info(vcpu);
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 008180c0c30f..63d8b3359b10 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -69,6 +69,8 @@ struct vcpu_tdx {
 	struct list_head cpu_list;
 
 	u64 vp_enter_ret;
+	struct tdh_vp_enter_in vp_enter_in;
+	struct tdh_vp_enter_out vp_enter_out;
 
 	enum vcpu_tdx_state state;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 16e0b598c4ec..895d9ea4aeba 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -33,6 +33,7 @@
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
+#include <asm/vmx.h>
 #include <asm/tdx.h>
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -1600,11 +1601,122 @@ static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
 	return ret;
 }
 
-noinstr u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args)
+noinstr u64 tdh_vp_enter(u64 tdvpr, const struct tdh_vp_enter_in *in, struct tdh_vp_enter_out *out)
 {
-	args->rcx = tdvpr;
+	struct tdx_module_args args = {
+		.rcx = tdvpr,
+		.r10 = in->ret_code,
+	};
+	u64 ret;
 
-	return __seamcall_saved_ret(TDH_VP_ENTER, args);
+	/* If previous exit was TDG.VP.VMCALL */
+	switch (out->subfn) {
+	case TDVMCALL_GET_TD_VM_CALL_INFO:
+		args.r11 = in->gettdvmcallinfo[0];
+		args.r12 = in->gettdvmcallinfo[1];
+		args.r13 = in->gettdvmcallinfo[2];
+		args.r14 = in->gettdvmcallinfo[3];
+		break;
+	case TDVMCALL_MAP_GPA:
+		args.r11 = in->failed_gpa;
+		break;
+	case TDVMCALL_CPUID:
+		args.r12 = in->eax;
+		args.r13 = in->ebx;
+		args.r14 = in->ecx;
+		args.r15 = in->edx;
+		break;
+	case TDVMCALL_IO:
+	case TDVMCALL_RDMSR:
+	case TDVMCALL_MMIO:
+		args.r11 = in->value_read;
+		break;
+	case TDVMCALL_NONE:
+	case TDVMCALL_GET_QUOTE:
+	case TDVMCALL_REPORT_FATAL_ERROR:
+	case TDVMCALL_HLT:
+	case TDVMCALL_WRMSR:
+		break;
+	}
+
+	ret = __seamcall_saved_ret(TDH_VP_ENTER, &args);
+
+	if ((u16)ret == EXIT_REASON_TDCALL) {
+		out->reg_mask		= args.rcx;
+		out->fn = args.r10;
+		if (out->fn) {
+			out->nr		= args.r10;
+			out->p1		= args.r11;
+			out->p2		= args.r12;
+			out->p3		= args.r13;
+			out->p4		= args.r14;
+			out->subfn	= TDVMCALL_NONE;
+		} else {
+			out->subfn	= args.r11;
+		}
+	} else {
+		out->exit_qual		= args.rcx;
+		out->ext_exit_qual	= args.rdx;
+		out->gpa		= args.r8;
+		out->intr_info		= args.r9;
+		out->subfn		= TDVMCALL_NONE;
+	}
+
+	switch (out->subfn) {
+	case TDVMCALL_GET_TD_VM_CALL_INFO:
+		out->leaf		= args.r12;
+		break;
+	case TDVMCALL_MAP_GPA:
+		out->map_gpa		= args.r12;
+		out->map_gpa_size	= args.r13;
+		break;
+	case TDVMCALL_CPUID:
+		out->cpuid_leaf		= args.r12;
+		out->cpuid_subleaf	= args.r13;
+		break;
+	case TDVMCALL_IO:
+		out->io_size		= args.r12;
+		out->io_direction	= args.r13 ? TDX_WRITE : TDX_READ;
+		out->io_port		= args.r14;
+		out->io_value		= args.r15;
+		break;
+	case TDVMCALL_RDMSR:
+		out->msr		= args.r12;
+		break;
+	case TDVMCALL_MMIO:
+		out->mmio_size		= args.r12;
+		out->mmio_direction	= args.r13 ? TDX_WRITE : TDX_READ;
+		out->mmio_addr		= args.r14;
+		out->mmio_value		= args.r15;
+		break;
+	case TDVMCALL_NONE:
+		break;
+	case TDVMCALL_GET_QUOTE:
+		out->shared_gpa		= args.r12;
+		out->shared_gpa_size	= args.r13;
+		break;
+	case TDVMCALL_REPORT_FATAL_ERROR:
+		out->err_codes		= args.r12;
+		out->err_data_gpa	= args.r13;
+		out->err_data[0]	= args.r14;
+		out->err_data[1]	= args.r15;
+		out->err_data[2]	= args.rbx;
+		out->err_data[3]	= args.rdi;
+		out->err_data[4]	= args.rsi;
+		out->err_data[5]	= args.r8;
+		out->err_data[6]	= args.r9;
+		out->err_data[7]	= args.rdx;
+		break;
+	case TDVMCALL_HLT:
+		out->intr_blocked_flag	= args.r12;
+		break;
+	case TDVMCALL_WRMSR:
+		out->msr		= args.r12;
+		out->write_value	= args.r13;
+		break;
+	}
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(tdh_vp_enter);
 

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-12-11 18:43             ` Adrian Hunter
@ 2024-12-13 15:45               ` Adrian Hunter
  2024-12-13 16:16               ` Dave Hansen
  1 sibling, 0 replies; 82+ messages in thread
From: Adrian Hunter @ 2024-12-13 15:45 UTC (permalink / raw)
  To: Dave Hansen, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang

On 11/12/24 20:43, Adrian Hunter wrote:
> The diff below shows another alternative.  This time using
> structs not a union.  The structs are easier to read than
> the union, and require copying arguments, which also allows
> using types that have sizes other than a GPR's (u64) size.

Dave, any comments on this one?

> 
> diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
> index 192ae798b214..85f87d90ac89 100644
> --- a/arch/x86/include/asm/shared/tdx.h
> +++ b/arch/x86/include/asm/shared/tdx.h
> @@ -21,20 +21,6 @@
>  /* TDCS fields. To be used by TDG.VM.WR and TDG.VM.RD module calls */
>  #define TDCS_NOTIFY_ENABLES		0x9100000000000010
>  
> -/* TDX hypercall Leaf IDs */
> -#define TDVMCALL_GET_TD_VM_CALL_INFO	0x10000
> -#define TDVMCALL_MAP_GPA		0x10001
> -#define TDVMCALL_GET_QUOTE		0x10002
> -#define TDVMCALL_REPORT_FATAL_ERROR	0x10003
> -
> -/*
> - * TDG.VP.VMCALL Status Codes (returned in R10)
> - */
> -#define TDVMCALL_STATUS_SUCCESS		0x0000000000000000ULL
> -#define TDVMCALL_STATUS_RETRY		0x0000000000000001ULL
> -#define TDVMCALL_STATUS_INVALID_OPERAND	0x8000000000000000ULL
> -#define TDVMCALL_STATUS_ALIGN_ERROR	0x8000000000000002ULL
> -
>  /*
>   * Bitmasks of exposed registers (with VMM).
>   */
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 01409a59224d..e4a45378a84b 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -33,6 +33,7 @@
>  
>  #ifndef __ASSEMBLY__
>  
> +#include <linux/kvm_types.h>
>  #include <uapi/asm/mce.h>
>  #include "tdx_global_metadata.h"
>  
> @@ -96,6 +97,7 @@ u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
>  void tdx_init(void);
>  
>  #include <asm/archrandom.h>
> +#include <asm/vmx.h>
>  
>  typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
>  
> @@ -123,8 +125,122 @@ const struct tdx_sys_info *tdx_get_sysinfo(void);
>  int tdx_guest_keyid_alloc(void);
>  void tdx_guest_keyid_free(unsigned int keyid);
>  
> +/* TDG.VP.VMCALL Sub-function */
> +enum tdvmcall_subfn {
> +	TDVMCALL_NONE			= -1, /* Not a TDG.VP.VMCALL */
> +	TDVMCALL_GET_TD_VM_CALL_INFO	= 0x10000,
> +	TDVMCALL_MAP_GPA		= 0x10001,
> +	TDVMCALL_GET_QUOTE		= 0x10002,
> +	TDVMCALL_REPORT_FATAL_ERROR	= 0x10003,
> +	TDVMCALL_CPUID			= EXIT_REASON_CPUID,
> +	TDVMCALL_HLT			= EXIT_REASON_HLT,
> +	TDVMCALL_IO			= EXIT_REASON_IO_INSTRUCTION,
> +	TDVMCALL_RDMSR			= EXIT_REASON_MSR_READ,
> +	TDVMCALL_WRMSR			= EXIT_REASON_MSR_WRITE,
> +	TDVMCALL_MMIO			= EXIT_REASON_EPT_VIOLATION,
> +};
> +
> +enum tdx_io_direction {
> +	TDX_READ,
> +	TDX_WRITE
> +};
> +
> +/* TDG.VP.VMCALL Sub-function Completion Status Codes */
> +enum tdvmcall_status {
> +	TDVMCALL_STATUS_SUCCESS		= 0x0000000000000000ULL,
> +	TDVMCALL_STATUS_RETRY		= 0x0000000000000001ULL,
> +	TDVMCALL_STATUS_INVALID_OPERAND	= 0x8000000000000000ULL,
> +	TDVMCALL_STATUS_ALIGN_ERROR	= 0x8000000000000002ULL,
> +};
> +
> +struct tdh_vp_enter_in {
> +	/* TDG.VP.VMCALL common */
> +	enum tdvmcall_status	ret_code;
> +
> +	/* TDG.VP.VMCALL Sub-function return information */
> +
> +	/* TDVMCALL_GET_TD_VM_CALL_INFO */
> +	u64			gettdvmcallinfo[4];
> +
> +	/* TDVMCALL_MAP_GPA */
> +	gpa_t			failed_gpa;
> +
> +	/* TDVMCALL_CPUID */
> +	u32			eax;
> +	u32			ebx;
> +	u32			ecx;
> +	u32			edx;
> +
> +	/* TDVMCALL_IO (read), TDVMCALL_RDMSR or TDVMCALL_MMIO (read) */
> +	u64			value_read;
> +};
> +
> +#define TDX_ERR_DATA_SZ 8
> +
> +struct tdh_vp_enter_out {
> +	u64			exit_qual;
> +	u32			intr_info;
> +	u64			ext_exit_qual;
> +	gpa_t			gpa;
> +
> +	/* TDG.VP.VMCALL common */
> +	u32			reg_mask;
> +	u64			fn;		/* Non-zero for KVM hypercalls, zero otherwise */
> +	enum tdvmcall_subfn	subfn;
> +
> +	/* TDG.VP.VMCALL Sub-function arguments */
> +
> +	/* KVM hypercall */
> +	u64			nr;
> +	u64			p1;
> +	u64			p2;
> +	u64			p3;
> +	u64			p4;
> +
> +	/* TDVMCALL_GET_TD_VM_CALL_INFO */
> +	u64			leaf;
> +
> +	/* TDVMCALL_MAP_GPA */
> +	gpa_t			map_gpa;
> +	u64			map_gpa_size;
> +
> +	/* TDVMCALL_GET_QUOTE */
> +	gpa_t			shared_gpa;
> +	u64			shared_gpa_size;
> +
> +	/* TDVMCALL_REPORT_FATAL_ERROR */
> +	u64			err_codes;
> +	gpa_t			err_data_gpa;
> +	u64			err_data[TDX_ERR_DATA_SZ];
> +
> +	/* TDVMCALL_CPUID */
> +	u32			cpuid_leaf;
> +	u32			cpuid_subleaf;
> +
> +	/* TDVMCALL_MMIO */
> +	int			mmio_size;
> +	enum tdx_io_direction	mmio_direction;
> +	gpa_t			mmio_addr;
> +	u32			mmio_value;
> +
> +	/* TDVMCALL_HLT */
> +	bool			intr_blocked_flag;
> +
> +	/* TDVMCALL_IO_INSTRUCTION */
> +	int			io_size;
> +	enum tdx_io_direction	io_direction;
> +	u16			io_port;
> +	u32			io_value;
> +
> +	/* TDVMCALL_MSR_READ or TDVMCALL_MSR_WRITE */
> +	u32			msr;
> +
> +	/* TDVMCALL_MSR_WRITE */
> +	u64			write_value;
> +};
> +
>  /* SEAMCALL wrappers for creating/destroying/running TDX guests */
> -u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args);
> +u64 tdh_vp_enter(u64 tdvpr, const struct tdh_vp_enter_in *in, struct tdh_vp_enter_out *out);
>  u64 tdh_mng_addcx(u64 tdr, u64 tdcs);
>  u64 tdh_mem_page_add(u64 tdr, u64 gpa, u64 hpa, u64 source, u64 *rcx, u64 *rdx);
>  u64 tdh_mem_sept_add(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx);
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 218801618e9a..a8283a03fdd4 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -256,57 +256,41 @@ static __always_inline bool tdx_check_exit_reason(struct kvm_vcpu *vcpu, u16 rea
>  
>  static __always_inline unsigned long tdexit_exit_qual(struct kvm_vcpu *vcpu)
>  {
> -	return kvm_rcx_read(vcpu);
> +	return to_tdx(vcpu)->vp_enter_out.exit_qual;
>  }
>  
>  static __always_inline unsigned long tdexit_ext_exit_qual(struct kvm_vcpu *vcpu)
>  {
> -	return kvm_rdx_read(vcpu);
> +	return to_tdx(vcpu)->vp_enter_out.ext_exit_qual;
>  }
>  
> -static __always_inline unsigned long tdexit_gpa(struct kvm_vcpu *vcpu)
> +static __always_inline gpa_t tdexit_gpa(struct kvm_vcpu *vcpu)
>  {
> -	return kvm_r8_read(vcpu);
> +	return to_tdx(vcpu)->vp_enter_out.gpa;
>  }
>  
>  static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
>  {
> -	return kvm_r9_read(vcpu);
> +	return to_tdx(vcpu)->vp_enter_out.intr_info;
>  }
>  
> -#define BUILD_TDVMCALL_ACCESSORS(param, gpr)				\
> -static __always_inline							\
> -unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu)		\
> -{									\
> -	return kvm_##gpr##_read(vcpu);					\
> -}									\
> -static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu, \
> -						     unsigned long val)  \
> -{									\
> -	kvm_##gpr##_write(vcpu, val);					\
> -}
> -BUILD_TDVMCALL_ACCESSORS(a0, r12);
> -BUILD_TDVMCALL_ACCESSORS(a1, r13);
> -BUILD_TDVMCALL_ACCESSORS(a2, r14);
> -BUILD_TDVMCALL_ACCESSORS(a3, r15);
> -
> -static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
> +static __always_inline unsigned long tdvmcall_fn(struct kvm_vcpu *vcpu)
>  {
> -	return kvm_r10_read(vcpu);
> +	return to_tdx(vcpu)->vp_enter_out.fn;
>  }
> -static __always_inline unsigned long tdvmcall_leaf(struct kvm_vcpu *vcpu)
> +static __always_inline enum tdvmcall_subfn tdvmcall_subfn(struct kvm_vcpu *vcpu)
>  {
> -	return kvm_r11_read(vcpu);
> +	return to_tdx(vcpu)->vp_enter_out.subfn;
>  }
>  static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
> -						     long val)
> +						     enum tdvmcall_status val)
>  {
> -	kvm_r10_write(vcpu, val);
> +	to_tdx(vcpu)->vp_enter_in.ret_code = val;
>  }
>  static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
>  						    unsigned long val)
>  {
> -	kvm_r11_write(vcpu, val);
> +	to_tdx(vcpu)->vp_enter_in.value_read = val;
>  }
>  
>  static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> @@ -786,10 +770,10 @@ bool tdx_interrupt_allowed(struct kvm_vcpu *vcpu)
>  	 * passes the interrupt block flag.
>  	 */
>  	if (!tdx_check_exit_reason(vcpu, EXIT_REASON_TDCALL) ||
> -	    tdvmcall_exit_type(vcpu) || tdvmcall_leaf(vcpu) != EXIT_REASON_HLT)
> +	    tdvmcall_fn(vcpu) || tdvmcall_subfn(vcpu) != TDVMCALL_HLT)
>  	    return true;
>  
> -	return !tdvmcall_a0_read(vcpu);
> +	return !to_tdx(vcpu)->vp_enter_out.intr_blocked_flag;
>  }
>  
>  bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
> @@ -945,51 +929,10 @@ static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>  static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
> -	struct tdx_module_args args;
>  
>  	guest_state_enter_irqoff();
>  
> -	/*
> -	 * TODO: optimization:
> -	 * - Eliminate copy between args and vcpu->arch.regs.
> -	 * - copyin/copyout registers only if (tdx->tdvmvall.regs_mask != 0)
> -	 *   which means TDG.VP.VMCALL.
> -	 */
> -	args = (struct tdx_module_args) {
> -		.rcx = tdx->tdvpr_pa,
> -#define REG(reg, REG)	.reg = vcpu->arch.regs[VCPU_REGS_ ## REG]
> -		REG(rdx, RDX),
> -		REG(r8,  R8),
> -		REG(r9,  R9),
> -		REG(r10, R10),
> -		REG(r11, R11),
> -		REG(r12, R12),
> -		REG(r13, R13),
> -		REG(r14, R14),
> -		REG(r15, R15),
> -		REG(rbx, RBX),
> -		REG(rdi, RDI),
> -		REG(rsi, RSI),
> -#undef REG
> -	};
> -
> -	tdx->vp_enter_ret = tdh_vp_enter(tdx->tdvpr_pa, &args);
> -
> -#define REG(reg, REG)	vcpu->arch.regs[VCPU_REGS_ ## REG] = args.reg
> -	REG(rcx, RCX);
> -	REG(rdx, RDX);
> -	REG(r8,  R8);
> -	REG(r9,  R9);
> -	REG(r10, R10);
> -	REG(r11, R11);
> -	REG(r12, R12);
> -	REG(r13, R13);
> -	REG(r14, R14);
> -	REG(r15, R15);
> -	REG(rbx, RBX);
> -	REG(rdi, RDI);
> -	REG(rsi, RSI);
> -#undef REG
> +	tdx->vp_enter_ret = tdh_vp_enter(tdx->tdvpr_pa, &tdx->vp_enter_in, &tdx->vp_enter_out);
>  
>  	if (tdx_check_exit_reason(vcpu, EXIT_REASON_EXCEPTION_NMI) &&
>  	    is_nmi(tdexit_intr_info(vcpu)))
> @@ -1128,8 +1071,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
>  
>  static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
>  {
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>  	int r;
>  
> +	kvm_r10_write(vcpu, tdx->vp_enter_out.nr);
> +	kvm_r11_write(vcpu, tdx->vp_enter_out.p1);
> +	kvm_r12_write(vcpu, tdx->vp_enter_out.p2);
> +	kvm_r13_write(vcpu, tdx->vp_enter_out.p3);
> +	kvm_r14_write(vcpu, tdx->vp_enter_out.p4);
> +
>  	/*
>  	 * ABI for KVM tdvmcall argument:
>  	 * In Guest-Hypervisor Communication Interface(GHCI) specification,
> @@ -1137,13 +1087,12 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
>  	 * vendor-specific.  KVM uses this for KVM hypercall.  NOTE: KVM
>  	 * hypercall number starts from one.  Zero isn't used for KVM hypercall
>  	 * number.
> -	 *
> -	 * R10: KVM hypercall number
> -	 * arguments: R11, R12, R13, R14.
>  	 */
>  	r = __kvm_emulate_hypercall(vcpu, r10, r11, r12, r13, r14, true, 0,
>  				    complete_hypercall_exit);
>  
> +	tdvmcall_set_return_code(vcpu, kvm_r10_read(vcpu));
> +
>  	return r > 0;
>  }
>  
> @@ -1161,7 +1110,7 @@ static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
>  
>  	if(vcpu->run->hypercall.ret) {
>  		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
> -		kvm_r11_write(vcpu, tdx->map_gpa_next);
> +		tdx->vp_enter_in.failed_gpa = tdx->map_gpa_next;
>  		return 1;
>  	}
>  
> @@ -1182,7 +1131,7 @@ static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
>  	if (pi_has_pending_interrupt(vcpu) ||
>  	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
>  		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
> -		kvm_r11_write(vcpu, tdx->map_gpa_next);
> +		tdx->vp_enter_in.failed_gpa = tdx->map_gpa_next;
>  		return 1;
>  	}
>  
> @@ -1214,8 +1163,8 @@ static void __tdx_map_gpa(struct vcpu_tdx * tdx)
>  static int tdx_map_gpa(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_tdx * tdx = to_tdx(vcpu);
> -	u64 gpa = tdvmcall_a0_read(vcpu);
> -	u64 size = tdvmcall_a1_read(vcpu);
> +	u64 gpa  = tdx->vp_enter_out.map_gpa;
> +	u64 size = tdx->vp_enter_out.map_gpa_size;
>  	u64 ret;
>  
>  	/*
> @@ -1251,14 +1200,17 @@ static int tdx_map_gpa(struct kvm_vcpu *vcpu)
>  
>  error:
>  	tdvmcall_set_return_code(vcpu, ret);
> -	kvm_r11_write(vcpu, gpa);
> +	tdx->vp_enter_in.failed_gpa = gpa;
>  	return 1;
>  }
>  
>  static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
>  {
> -	u64 reg_mask = kvm_rcx_read(vcpu);
> -	u64* opt_regs;
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	__u64 *data = &vcpu->run->system_event.data[0];
> +	u64 reg_mask = tdx->vp_enter_out.reg_mask;
> +	const int mask[] = {14, 15, 3, 7, 6, 8, 9, 2};
> +	int cnt = 0;
>  
>  	/*
>  	 * Skip sanity checks and let userspace decide what to do if sanity
> @@ -1266,32 +1218,20 @@ static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
>  	 */
>  	vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
>  	vcpu->run->system_event.type = KVM_SYSTEM_EVENT_TDX_FATAL;
> -	vcpu->run->system_event.ndata = 10;
>  	/* Error codes. */
> -	vcpu->run->system_event.data[0] = tdvmcall_a0_read(vcpu);
> +	data[cnt++] = tdx->vp_enter_out.err_codes;
>  	/* GPA of additional information page. */
> -	vcpu->run->system_event.data[1] = tdvmcall_a1_read(vcpu);
> +	data[cnt++] = tdx->vp_enter_out.err_data_gpa;
> +
>  	/* Information passed via registers (up to 64 bytes). */
> -	opt_regs = &vcpu->run->system_event.data[2];
> +	for (int i = 0; i < TDX_ERR_DATA_SZ; i++) {
> +		if (reg_mask & BIT_ULL(mask[i]))
> +			data[cnt++] = tdx->vp_enter_out.err_data[i];
> +		else
> +			data[cnt++] = 0;
> +	}
>  
> -#define COPY_REG(REG, MASK)						\
> -	do {								\
> -		if (reg_mask & MASK)					\
> -			*opt_regs = kvm_ ## REG ## _read(vcpu);		\
> -		else							\
> -			*opt_regs = 0;					\
> -		opt_regs++;						\
> -	} while (0)
> -
> -	/* The order is defined in GHCI. */
> -	COPY_REG(r14, BIT_ULL(14));
> -	COPY_REG(r15, BIT_ULL(15));
> -	COPY_REG(rbx, BIT_ULL(3));
> -	COPY_REG(rdi, BIT_ULL(7));
> -	COPY_REG(rsi, BIT_ULL(6));
> -	COPY_REG(r8, BIT_ULL(8));
> -	COPY_REG(r9, BIT_ULL(9));
> -	COPY_REG(rdx, BIT_ULL(2));
> +	vcpu->run->system_event.ndata = cnt;
>  
>  	/*
>  	 * Set the status code according to GHCI spec, although the vCPU may
> @@ -1305,18 +1245,18 @@ static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
>  
>  static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
>  {
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>  	u32 eax, ebx, ecx, edx;
>  
> -	/* EAX and ECX for cpuid is stored in R12 and R13. */
> -	eax = tdvmcall_a0_read(vcpu);
> -	ecx = tdvmcall_a1_read(vcpu);
> +	eax = tdx->vp_enter_out.cpuid_leaf;
> +	ecx = tdx->vp_enter_out.cpuid_subleaf;
>  
>  	kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, false);
>  
> -	tdvmcall_a0_write(vcpu, eax);
> -	tdvmcall_a1_write(vcpu, ebx);
> -	tdvmcall_a2_write(vcpu, ecx);
> -	tdvmcall_a3_write(vcpu, edx);
> +	tdx->vp_enter_in.eax = eax;
> +	tdx->vp_enter_in.ebx = ebx;
> +	tdx->vp_enter_in.ecx = ecx;
> +	tdx->vp_enter_in.edx = edx;
>  
>  	tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
>  
> @@ -1356,6 +1296,7 @@ static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
>  static int tdx_emulate_io(struct kvm_vcpu *vcpu)
>  {
>  	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>  	unsigned long val = 0;
>  	unsigned int port;
>  	int size, ret;
> @@ -1363,9 +1304,9 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
>  
>  	++vcpu->stat.io_exits;
>  
> -	size = tdvmcall_a0_read(vcpu);
> -	write = tdvmcall_a1_read(vcpu);
> -	port = tdvmcall_a2_read(vcpu);
> +	size  = tdx->vp_enter_out.io_size;
> +	write = tdx->vp_enter_out.io_direction == TDX_WRITE;
> +	port  = tdx->vp_enter_out.io_port;
>  
>  	if (size != 1 && size != 2 && size != 4) {
>  		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
> @@ -1373,7 +1314,7 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
>  	}
>  
>  	if (write) {
> -		val = tdvmcall_a3_read(vcpu);
> +		val = tdx->vp_enter_out.io_value;
>  		ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
>  	} else {
>  		ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
> @@ -1443,14 +1384,15 @@ static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
>  
>  static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
>  {
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>  	int size, write, r;
>  	unsigned long val;
>  	gpa_t gpa;
>  
> -	size = tdvmcall_a0_read(vcpu);
> -	write = tdvmcall_a1_read(vcpu);
> -	gpa = tdvmcall_a2_read(vcpu);
> -	val = write ? tdvmcall_a3_read(vcpu) : 0;
> +	size  = tdx->vp_enter_out.mmio_size;
> +	write = tdx->vp_enter_out.mmio_direction == TDX_WRITE;
> +	gpa   = tdx->vp_enter_out.mmio_addr;
> +	val = write ? tdx->vp_enter_out.mmio_value : 0;
>  
>  	if (size != 1 && size != 2 && size != 4 && size != 8)
>  		goto error;
> @@ -1502,7 +1444,7 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
>  
>  static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
>  {
> -	u32 index = tdvmcall_a0_read(vcpu);
> +	u32 index = to_tdx(vcpu)->vp_enter_out.msr;
>  	u64 data;
>  
>  	if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ) ||
> @@ -1520,8 +1462,8 @@ static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
>  
>  static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
>  {
> -	u32 index = tdvmcall_a0_read(vcpu);
> -	u64 data = tdvmcall_a1_read(vcpu);
> +	u32 index = to_tdx(vcpu)->vp_enter_out.msr;
> +	u64 data  = to_tdx(vcpu)->vp_enter_out.write_value;
>  
>  	if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE) ||
>  	    kvm_set_msr(vcpu, index, data)) {
> @@ -1537,39 +1479,41 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
>  
>  static int tdx_get_td_vm_call_info(struct kvm_vcpu *vcpu)
>  {
> -	if (tdvmcall_a0_read(vcpu))
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	if (tdx->vp_enter_out.leaf) {
>  		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
> -	else {
> +	} else {
>  		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
> -		kvm_r11_write(vcpu, 0);
> -		tdvmcall_a0_write(vcpu, 0);
> -		tdvmcall_a1_write(vcpu, 0);
> -		tdvmcall_a2_write(vcpu, 0);
> +		tdx->vp_enter_in.gettdvmcallinfo[0] = 0;
> +		tdx->vp_enter_in.gettdvmcallinfo[1] = 0;
> +		tdx->vp_enter_in.gettdvmcallinfo[2] = 0;
> +		tdx->vp_enter_in.gettdvmcallinfo[3] = 0;
>  	}
>  	return 1;
>  }
>  
>  static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>  {
> -	if (tdvmcall_exit_type(vcpu))
> +	if (tdvmcall_fn(vcpu))
>  		return tdx_emulate_vmcall(vcpu);
>  
> -	switch (tdvmcall_leaf(vcpu)) {
> +	switch (tdvmcall_subfn(vcpu)) {
>  	case TDVMCALL_MAP_GPA:
>  		return tdx_map_gpa(vcpu);
>  	case TDVMCALL_REPORT_FATAL_ERROR:
>  		return tdx_report_fatal_error(vcpu);
> -	case EXIT_REASON_CPUID:
> +	case TDVMCALL_CPUID:
>  		return tdx_emulate_cpuid(vcpu);
> -	case EXIT_REASON_HLT:
> +	case TDVMCALL_HLT:
>  		return tdx_emulate_hlt(vcpu);
> -	case EXIT_REASON_IO_INSTRUCTION:
> +	case TDVMCALL_IO:
>  		return tdx_emulate_io(vcpu);
> -	case EXIT_REASON_EPT_VIOLATION:
> +	case TDVMCALL_MMIO:
>  		return tdx_emulate_mmio(vcpu);
> -	case EXIT_REASON_MSR_READ:
> +	case TDVMCALL_RDMSR:
>  		return tdx_emulate_rdmsr(vcpu);
> -	case EXIT_REASON_MSR_WRITE:
> +	case TDVMCALL_WRMSR:
>  		return tdx_emulate_wrmsr(vcpu);
>  	case TDVMCALL_GET_TD_VM_CALL_INFO:
>  		return tdx_get_td_vm_call_info(vcpu);
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 008180c0c30f..63d8b3359b10 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -69,6 +69,8 @@ struct vcpu_tdx {
>  	struct list_head cpu_list;
>  
>  	u64 vp_enter_ret;
> +	struct tdh_vp_enter_in vp_enter_in;
> +	struct tdh_vp_enter_out vp_enter_out;
>  
>  	enum vcpu_tdx_state state;
>  
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 16e0b598c4ec..895d9ea4aeba 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -33,6 +33,7 @@
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/cpufeature.h>
> +#include <asm/vmx.h>
>  #include <asm/tdx.h>
>  #include <asm/cpu_device_id.h>
>  #include <asm/processor.h>
> @@ -1600,11 +1601,122 @@ static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
>  	return ret;
>  }
>  
> -noinstr u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args)
> +noinstr u64 tdh_vp_enter(u64 tdvpr, const struct tdh_vp_enter_in *in, struct tdh_vp_enter_out *out)
>  {
> -	args->rcx = tdvpr;
> +	struct tdx_module_args args = {
> +		.rcx = tdvpr,
> +		.r10 = in->ret_code,
> +	};
> +	u64 ret;
>  
> -	return __seamcall_saved_ret(TDH_VP_ENTER, args);
> +	/* If previous exit was TDG.VP.VMCALL */
> +	switch (out->subfn) {
> +	case TDVMCALL_GET_TD_VM_CALL_INFO:
> +		args.r11 = in->gettdvmcallinfo[0];
> +		args.r12 = in->gettdvmcallinfo[1];
> +		args.r13 = in->gettdvmcallinfo[2];
> +		args.r14 = in->gettdvmcallinfo[3];
> +		break;
> +	case TDVMCALL_MAP_GPA:
> +		args.r11 = in->failed_gpa;
> +		break;
> +	case TDVMCALL_CPUID:
> +		args.r12 = in->eax;
> +		args.r13 = in->ebx;
> +		args.r14 = in->ecx;
> +		args.r15 = in->edx;
> +		break;
> +	case TDVMCALL_IO:
> +	case TDVMCALL_RDMSR:
> +	case TDVMCALL_MMIO:
> +		args.r11 = in->value_read;
> +		break;
> +	case TDVMCALL_NONE:
> +	case TDVMCALL_GET_QUOTE:
> +	case TDVMCALL_REPORT_FATAL_ERROR:
> +	case TDVMCALL_HLT:
> +	case TDVMCALL_WRMSR:
> +		break;
> +	}
> +
> +	ret = __seamcall_saved_ret(TDH_VP_ENTER, &args);
> +
> +	if ((u16)ret == EXIT_REASON_TDCALL) {
> +		out->reg_mask		= args.rcx;
> +		out->fn = args.r10;
> +		if (out->fn) {
> +			out->nr		= args.r10;
> +			out->p1		= args.r11;
> +			out->p2		= args.r12;
> +			out->p3		= args.r13;
> +			out->p4		= args.r14;
> +			out->subfn	= TDVMCALL_NONE;
> +		} else {
> +			out->subfn	= args.r11;
> +		}
> +	} else {
> +		out->exit_qual		= args.rcx;
> +		out->ext_exit_qual	= args.rdx;
> +		out->gpa		= args.r8;
> +		out->intr_info		= args.r9;
> +		out->subfn		= TDVMCALL_NONE;
> +	}
> +
> +	switch (out->subfn) {
> +	case TDVMCALL_GET_TD_VM_CALL_INFO:
> +		out->leaf		= args.r12;
> +		break;
> +	case TDVMCALL_MAP_GPA:
> +		out->map_gpa		= args.r12;
> +		out->map_gpa_size	= args.r13;
> +		break;
> +	case TDVMCALL_CPUID:
> +		out->cpuid_leaf		= args.r12;
> +		out->cpuid_subleaf	= args.r13;
> +		break;
> +	case TDVMCALL_IO:
> +		out->io_size		= args.r12;
> +		out->io_direction	= args.r13 ? TDX_WRITE : TDX_READ;
> +		out->io_port		= args.r14;
> +		out->io_value		= args.r15;
> +		break;
> +	case TDVMCALL_RDMSR:
> +		out->msr		= args.r12;
> +		break;
> +	case TDVMCALL_MMIO:
> +		out->mmio_size		= args.r12;
> +		out->mmio_direction	= args.r13 ? TDX_WRITE : TDX_READ;
> +		out->mmio_addr		= args.r14;
> +		out->mmio_value		= args.r15;
> +		break;
> +	case TDVMCALL_NONE:
> +		break;
> +	case TDVMCALL_GET_QUOTE:
> +		out->shared_gpa		= args.r12;
> +		out->shared_gpa_size	= args.r13;
> +		break;
> +	case TDVMCALL_REPORT_FATAL_ERROR:
> +		out->err_codes		= args.r12;
> +		out->err_data_gpa	= args.r13;
> +		out->err_data[0]	= args.r14;
> +		out->err_data[1]	= args.r15;
> +		out->err_data[2]	= args.rbx;
> +		out->err_data[3]	= args.rdi;
> +		out->err_data[4]	= args.rsi;
> +		out->err_data[5]	= args.r8;
> +		out->err_data[6]	= args.r9;
> +		out->err_data[7]	= args.rdx;
> +		break;
> +	case TDVMCALL_HLT:
> +		out->intr_blocked_flag	= args.r12;
> +		break;
> +	case TDVMCALL_WRMSR:
> +		out->msr		= args.r12;
> +		out->write_value	= args.r13;
> +		break;
> +	}
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(tdh_vp_enter);
>  


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-12-11 18:43             ` Adrian Hunter
  2024-12-13 15:45               ` Adrian Hunter
@ 2024-12-13 16:16               ` Dave Hansen
  2024-12-13 16:30                 ` Adrian Hunter
  1 sibling, 1 reply; 82+ messages in thread
From: Dave Hansen @ 2024-12-13 16:16 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang

On 12/11/24 10:43, Adrian Hunter wrote:
...
> -	size = tdvmcall_a0_read(vcpu);
> -	write = tdvmcall_a1_read(vcpu);
> -	port = tdvmcall_a2_read(vcpu);
> +	size  = tdx->vp_enter_out.io_size;
> +	write = tdx->vp_enter_out.io_direction == TDX_WRITE;
> +	port  = tdx->vp_enter_out.io_port;
...> +	case TDVMCALL_IO:
> +		out->io_size		= args.r12;
> +		out->io_direction	= args.r13 ? TDX_WRITE : TDX_READ;
> +		out->io_port		= args.r14;
> +		out->io_value		= args.r15;
> +		break;

I honestly don't understand the need for the abstracted structure to sit
in the middle. It doesn't get stored or serialized or anything, right?
So why have _another_ structure?

Why can't this just be (for instance):

	size = tdx->foo.r12;

?

Basically, you hand around the raw arguments until you need to use them.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-12-13 16:16               ` Dave Hansen
@ 2024-12-13 16:30                 ` Adrian Hunter
  2024-12-13 16:44                   ` Dave Hansen
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-12-13 16:30 UTC (permalink / raw)
  To: Dave Hansen, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang

On 13/12/24 18:16, Dave Hansen wrote:
> On 12/11/24 10:43, Adrian Hunter wrote:
> ...
>> -	size = tdvmcall_a0_read(vcpu);
>> -	write = tdvmcall_a1_read(vcpu);
>> -	port = tdvmcall_a2_read(vcpu);
>> +	size  = tdx->vp_enter_out.io_size;
>> +	write = tdx->vp_enter_out.io_direction == TDX_WRITE;
>> +	port  = tdx->vp_enter_out.io_port;
> ...> +	case TDVMCALL_IO:
>> +		out->io_size		= args.r12;
>> +		out->io_direction	= args.r13 ? TDX_WRITE : TDX_READ;
>> +		out->io_port		= args.r14;
>> +		out->io_value		= args.r15;
>> +		break;
> 
> I honestly don't understand the need for the abstracted structure to sit
> in the middle. It doesn't get stored or serialized or anything, right?
> So why have _another_ structure?
> 
> Why can't this just be (for instance):
> 
> 	size = tdx->foo.r12;
> 
> ?
> 
> Basically, you hand around the raw arguments until you need to use them.

That sounds like what we have at present?  That is:

u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args)
{
	args->rcx = tdvpr;

	return __seamcall_saved_ret(TDH_VP_ENTER, args);
}

And then either add Rick's struct tdx_vp?  Like so:

u64 tdh_vp_enter(struct tdx_vp *vp, struct tdx_module_args *args)
{
	args->rcx = tdx_tdvpr_pa(vp);

	return __seamcall_saved_ret(TDH_VP_ENTER, args);
}

Or leave it to the caller:

u64 tdh_vp_enter(struct tdx_module_args *args)
{
	return __seamcall_saved_ret(TDH_VP_ENTER, args);
}

Or forget the wrapper altogether, and let KVM call
__seamcall_saved_ret() ?


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
  2024-12-13 16:30                 ` Adrian Hunter
@ 2024-12-13 16:44                   ` Dave Hansen
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Hansen @ 2024-12-13 16:44 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc, kvm, dave.hansen
  Cc: rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, x86, yan.y.zhao, chao.gao, weijiang.yang

On 12/13/24 08:30, Adrian Hunter wrote:
> On 13/12/24 18:16, Dave Hansen wrote:
>> On 12/11/24 10:43, Adrian Hunter wrote:
>> ...
>>> -	size = tdvmcall_a0_read(vcpu);
>>> -	write = tdvmcall_a1_read(vcpu);
>>> -	port = tdvmcall_a2_read(vcpu);
>>> +	size  = tdx->vp_enter_out.io_size;
>>> +	write = tdx->vp_enter_out.io_direction == TDX_WRITE;
>>> +	port  = tdx->vp_enter_out.io_port;
>> ...> +	case TDVMCALL_IO:
>>> +		out->io_size		= args.r12;
>>> +		out->io_direction	= args.r13 ? TDX_WRITE : TDX_READ;
>>> +		out->io_port		= args.r14;
>>> +		out->io_value		= args.r15;
>>> +		break;
>>
>> I honestly don't understand the need for the abstracted structure to sit
>> in the middle. It doesn't get stored or serialized or anything, right?
>> So why have _another_ structure?
>>
>> Why can't this just be (for instance):
>>
>> 	size = tdx->foo.r12;
>>
>> ?
>>
>> Basically, you hand around the raw arguments until you need to use them.
> 
> That sounds like what we have at present?  That is:
> 
> u64 tdh_vp_enter(u64 tdvpr, struct tdx_module_args *args)
> {
> 	args->rcx = tdvpr;
> 
> 	return __seamcall_saved_ret(TDH_VP_ENTER, args);
> }
> 
> And then either add Rick's struct tdx_vp?  Like so:
> 
> u64 tdh_vp_enter(struct tdx_vp *vp, struct tdx_module_args *args)
> {
> 	args->rcx = tdx_tdvpr_pa(vp);
> 
> 	return __seamcall_saved_ret(TDH_VP_ENTER, args);
> }
> 
> Or leave it to the caller:
> 
> u64 tdh_vp_enter(struct tdx_module_args *args)
> {
> 	return __seamcall_saved_ret(TDH_VP_ENTER, args);
> }
> 
> Or forget the wrapper altogether, and let KVM call
> __seamcall_saved_ret() ?

Rick's version, please.

I don't want __seamcall_saved_ret() exported to modules. I want to at
least have a clean boundary beyond which __seamcall_saved_ret() is not
exposed.

My nit with the "u64 tdvpr" version was that there's zero type safety.

The tdvp-less tdh_vp_enter() is even *less* safe of a calling convention
and also requires that each caller do tdx_tdvpr_pa() or equivalent.

But I feel like I'm repeating myself a bit at this point.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-11-25 11:10     ` Adrian Hunter
  2024-11-26  2:20       ` Chao Gao
@ 2024-12-17 16:09       ` Sean Christopherson
  2024-12-20 15:22         ` Adrian Hunter
  1 sibling, 1 reply; 82+ messages in thread
From: Sean Christopherson @ 2024-12-17 16:09 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On Mon, Nov 25, 2024, Adrian Hunter wrote:
> On 22/11/24 07:49, Chao Gao wrote:
> >> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> >> +{
> >> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> >> +
> >> +	if (static_cpu_has(X86_FEATURE_XSAVE) &&
> >> +	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
> >> +		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
> >> +	if (static_cpu_has(X86_FEATURE_XSAVES) &&
> >> +	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
> >> +	    kvm_host.xss != (kvm_tdx->xfam &
> >> +			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
> >> +			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
> > 
> > Should we drop CET/PT from this series? I think they are worth a new
> > patch/series.
> 
> This is not really about CET/PT
> 
> What is happening here is that we are calculating the current
> MSR_IA32_XSS value based on the TDX Module spec which says the
> TDX Module sets MSR_IA32_XSS to the XSS bits from XFAM.  The
> TDX Module does that literally, from TDX Module source code:
> 
> 	#define XCR0_SUPERVISOR_BIT_MASK            0x0001FD00
> and
> 	ia32_wrmsr(IA32_XSS_MSR_ADDR, xfam & XCR0_SUPERVISOR_BIT_MASK);
> 
> For KVM, rather than:
> 
> 			kvm_tdx->xfam &
> 			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
> 			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)
> 
> it would be more direct to define the bits and enforce them
> via tdx_get_supported_xfam() e.g.
> 
> /* 
>  * Before returning from TDH.VP.ENTER, the TDX Module assigns:
>  *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9)
>  *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
>  */
> #define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9))
> #define TDX_XFAM_XSS_MASK	(GENMASK(16, 10) | BIT(8))
> #define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)
> 
> static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
> {
> 	u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;
> 
> 	/* Ensure features are in the masks */
> 	val &= TDX_XFAM_MASK;
> 
> 	if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
> 		return 0;
> 
> 	val &= td_conf->xfam_fixed0;
> 
> 	return val;
> }
> 
> and then:
> 
> 	if (static_cpu_has(X86_FEATURE_XSAVE) &&
> 	    kvm_host.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK))
> 		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
> 	if (static_cpu_has(X86_FEATURE_XSAVES) &&
> 	    kvm_host.xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK))
> 		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
> 
> > 
> >> +		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
> >> +	if (static_cpu_has(X86_FEATURE_PKU) &&
> > 
> > How about using cpu_feature_enabled()? It is used in kvm_load_host_xsave_state()
> > It handles the case where CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is not
> > enabled.

I would rather just use kvm_load_host_xsave_state(), by forcing vcpu->arch.{xcr0,xss}
to XFAM, with a comment explaining that the TDX module sets XCR0 and XSS prior to
returning from VP.ENTER.  I don't see any justificaton for maintaining a special
flow for TDX, it's just more work.  E.g. TDX is missing the optimization to elide
WRPKRU if the current value is the same as the host value.

KVM already disallows emulating a WRMSR to XSS via the tdx_has_emulated_msr()
check, and AFAICT there's no path for the guest to set KVM's view of XCR0, CR0,
or CR4, so I'm pretty sure stuffing state at vCPU creation is all that's needed.

That said, out of paranoia, KVM should disallow the guest from changing XSS if
guest state is protected, i.e. in common code, as XSS is a mandatory passthrough
for SEV-ES+, i.e. XSS is fully guest-owned for both TDX and ES+.

Ditto for CR0 and CR4 (TDX only; SEV-ES+ lets the host see the guest values).
The current TDX code lets KVM read CR0 and CR4, but KVM will always see the RESET
values, which are completely wrong for TDX.  KVM can obviously "work" without a
sane view of guest CR0/CR4, but I think having a sane view will yield code that
is easier to maintain and understand, because almost all special casing will be
in TDX's initialization flow, not spread out wherever KVM needs to know that what
KVM sees in guest state is a lie.

The guest_state_protected check in kvm_load_host_xsave_state() needs to be moved
to svm_vcpu_run(), but IMO that's where the checks belong anyways, because not
restoring host state for protected guests is obviously specific to SEV-ES+ guests,
not to all protected guests.

Side topic, tdx_cache_reg() is ridiculous.  Just mark the "cached" registers as
available on exit.  Invoking a callback just to do nothing is a complete waste.
I'm also not convinced letting KVM read garbage for RIP, RSP, CR3, or PDPTRs is
at all reasonable.  CR3 and PDPTRs should be unreachable, and I gotta imagine the
same holds true for RSP.  Allow reads/writes to RIP is fine, in that it probably
simplifies the overall code.

Something like this (probably won't apply, I have other local hacks as the result
of suggestions).

---
 arch/x86/kvm/svm/svm.c     |  7 ++++--
 arch/x86/kvm/vmx/main.c    |  4 +--
 arch/x86/kvm/vmx/tdx.c     | 50 ++++++++++----------------------------
 arch/x86/kvm/vmx/x86_ops.h |  4 ---
 arch/x86/kvm/x86.c         | 15 +++++++-----
 5 files changed, 28 insertions(+), 52 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index dd15cc635655..63df43e5dcce 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4251,7 +4251,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
 		svm_set_dr6(svm, DR6_ACTIVE_LOW);
 
 	clgi();
-	kvm_load_guest_xsave_state(vcpu);
+
+	if (!vcpu->arch.guest_state_protected)
+		kvm_load_guest_xsave_state(vcpu);
 
 	kvm_wait_lapic_expire(vcpu);
 
@@ -4280,7 +4282,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
 	if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
 		kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
 
-	kvm_load_host_xsave_state(vcpu);
+	if (!vcpu->arch.guest_state_protected)
+		kvm_load_host_xsave_state(vcpu);
 	stgi();
 
 	/* Any pending NMI will happen here */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2742f2af7f55..d2e78e6675b9 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -520,10 +520,8 @@ static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 
 static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 {
-	if (is_td_vcpu(vcpu)) {
-		tdx_cache_reg(vcpu, reg);
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
 		return;
-	}
 
 	vmx_cache_reg(vcpu, reg);
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7eff717c9d0d..b49dcf32206b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -636,6 +636,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.cr0_guest_owned_bits = -1ul;
 	vcpu->arch.cr4_guest_owned_bits = -1ul;
 
+	vcpu->arch.cr4 = <maximal value>;
+	vcpu->arch.cr0 = <maximal value, give or take>;
+
 	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
 	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
 	/*
@@ -659,6 +662,14 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 	tdx->state = VCPU_TD_STATE_UNINITIALIZED;
 
+	/*
+	 * On return from VP.ENTER, the TDX Module sets XCR0 and XSS to the
+	 * maximal values supported by the guest, so from KVM's perspective,
+	 * those are the guest's values at all times.
+	 */
+	vcpu->arch.ia32_xss = (kvm_tdx->xfam & XFEATURE_SUPERVISOR_MASK);
+	vcpu->arch.xcr0 = (kvm_tdx->xfam & XFEATURE_USE_MASK);
+
 	return 0;
 }
 
@@ -824,24 +835,6 @@ static void tdx_user_return_msr_update_cache(struct kvm_vcpu *vcpu)
 		kvm_user_return_msr_update_cache(tdx_uret_tsx_ctrl_slot, 0);
 }
 
-static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
-{
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
-
-	if (static_cpu_has(X86_FEATURE_XSAVE) &&
-	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
-		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
-	if (static_cpu_has(X86_FEATURE_XSAVES) &&
-	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
-	    kvm_host.xss != (kvm_tdx->xfam &
-			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
-			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
-		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
-	if (static_cpu_has(X86_FEATURE_PKU) &&
-	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
-		write_pkru(vcpu->arch.host_pkru);
-}
-
 static union vmx_exit_reason tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -941,10 +934,10 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	tdx_vcpu_enter_exit(vcpu);
 
 	tdx_user_return_msr_update_cache(vcpu);
-	tdx_restore_host_xsave_state(vcpu);
+	kvm_load_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
-	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
+	vcpu->arch.regs_avail = TDX_REGS_UNSUPPORTED_SET;
 
 	if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
 		return EXIT_FASTPATH_NONE;
@@ -1963,23 +1956,6 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	}
 }
 
-void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
-{
-	kvm_register_mark_available(vcpu, reg);
-	switch (reg) {
-	case VCPU_REGS_RSP:
-	case VCPU_REGS_RIP:
-	case VCPU_EXREG_PDPTR:
-	case VCPU_EXREG_CR0:
-	case VCPU_EXREG_CR3:
-	case VCPU_EXREG_CR4:
-		break;
-	default:
-		KVM_BUG_ON(1, vcpu->kvm);
-		break;
-	}
-}
-
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index ef60eb7b1245..efa6723837c6 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -145,8 +145,6 @@ bool tdx_has_emulated_msr(u32 index);
 int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
-void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
-
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
 int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
@@ -193,8 +191,6 @@ static inline bool tdx_has_emulated_msr(u32 index) { return false; }
 static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 
-static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
-
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
 static inline int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4f94b1e24eae..d380837433c6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1184,11 +1184,9 @@ EXPORT_SYMBOL_GPL(kvm_lmsw);
 
 void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->arch.guest_state_protected)
-		return;
+	WARN_ON_ONCE(vcpu->arch.guest_state_protected);
 
 	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
-
 		if (vcpu->arch.xcr0 != kvm_host.xcr0)
 			xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
 
@@ -1207,9 +1205,6 @@ EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
 
 void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->arch.guest_state_protected)
-		return;
-
 	if (cpu_feature_enabled(X86_FEATURE_PKU) &&
 	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
 	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
@@ -3943,6 +3938,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
 			return 1;
+
+		if (vcpu->arch.guest_state_protected)
+			return 1;
+
 		/*
 		 * KVM supports exposing PT to the guest, but does not support
 		 * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
@@ -4402,6 +4401,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
 			return 1;
+
+		if (vcpu->arch.guest_state_protected)
+			return 1;
+
 		msr_info->data = vcpu->arch.ia32_xss;
 		break;
 	case MSR_K7_CLK_CTL:

base-commit: 0319082fc23089f516618e193d94da18c837e35a
-- 

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-12-17 16:09       ` Sean Christopherson
@ 2024-12-20 15:22         ` Adrian Hunter
  2024-12-20 16:22           ` Sean Christopherson
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2024-12-20 15:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On 17/12/24 18:09, Sean Christopherson wrote:
> On Mon, Nov 25, 2024, Adrian Hunter wrote:
> I would rather just use kvm_load_host_xsave_state(), by forcing vcpu->arch.{xcr0,xss}
> to XFAM, with a comment explaining that the TDX module sets XCR0 and XSS prior to
> returning from VP.ENTER.  I don't see any justificaton for maintaining a special
> flow for TDX, it's just more work.  E.g. TDX is missing the optimization to elide
> WRPKRU if the current value is the same as the host value.

Not entirely missing since write_pkru() does do that by itself:

static inline void write_pkru(u32 pkru)
{
	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
		return;
	/*
	 * WRPKRU is relatively expensive compared to RDPKRU.
	 * Avoid WRPKRU when it would not change the value.
	 */
	if (pkru != rdpkru())
		wrpkru(pkru);
}

For TDX, we don't really need rdpkru() since the TDX Module
clears it, so it could be:

	if (pkru)
		wrpkru(pkru);

> 
> KVM already disallows emulating a WRMSR to XSS via the tdx_has_emulated_msr()
> check, and AFAICT there's no path for the guest to set KVM's view of XCR0, CR0,
> or CR4, so I'm pretty sure stuffing state at vCPU creation is all that's needed.
> 
> That said, out of paranoia, KVM should disallow the guest from changing XSS if
> guest state is protected, i.e. in common code, as XSS is a mandatory passthrough
> for SEV-ES+, i.e. XSS is fully guest-owned for both TDX and ES+.
> 
> Ditto for CR0 and CR4 (TDX only; SEV-ES+ lets the host see the guest values).
> The current TDX code lets KVM read CR0 and CR4, but KVM will always see the RESET
> values, which are completely wrong for TDX.  KVM can obviously "work" without a
> sane view of guest CR0/CR4, but I think having a sane view will yield code that
> is easier to maintain and understand, because almost all special casing will be
> in TDX's initialization flow, not spread out wherever KVM needs to know that what
> KVM sees in guest state is a lie.
> 
> The guest_state_protected check in kvm_load_host_xsave_state() needs to be moved
> to svm_vcpu_run(), but IMO that's where the checks belong anyways, because not
> restoring host state for protected guests is obviously specific to SEV-ES+ guests,
> not to all protected guests.
> 
> Side topic, tdx_cache_reg() is ridiculous.  Just mark the "cached" registers as
> available on exit.  Invoking a callback just to do nothing is a complete waste.
> I'm also not convinced letting KVM read garbage for RIP, RSP, CR3, or PDPTRs is
> at all reasonable.  CR3 and PDPTRs should be unreachable, and I gotta imagine the
> same holds true for RSP.  Allow reads/writes to RIP is fine, in that it probably
> simplifies the overall code.
> 
> Something like this (probably won't apply, I have other local hacks as the result
> of suggestions).
> 
> ---
>  arch/x86/kvm/svm/svm.c     |  7 ++++--
>  arch/x86/kvm/vmx/main.c    |  4 +--
>  arch/x86/kvm/vmx/tdx.c     | 50 ++++++++++----------------------------
>  arch/x86/kvm/vmx/x86_ops.h |  4 ---
>  arch/x86/kvm/x86.c         | 15 +++++++-----
>  5 files changed, 28 insertions(+), 52 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index dd15cc635655..63df43e5dcce 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4251,7 +4251,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>  		svm_set_dr6(svm, DR6_ACTIVE_LOW);
>  
>  	clgi();
> -	kvm_load_guest_xsave_state(vcpu);
> +
> +	if (!vcpu->arch.guest_state_protected)
> +		kvm_load_guest_xsave_state(vcpu);
>  
>  	kvm_wait_lapic_expire(vcpu);
>  
> @@ -4280,7 +4282,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>  	if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
>  		kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
>  
> -	kvm_load_host_xsave_state(vcpu);
> +	if (!vcpu->arch.guest_state_protected)
> +		kvm_load_host_xsave_state(vcpu);
>  	stgi();
>  
>  	/* Any pending NMI will happen here */
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 2742f2af7f55..d2e78e6675b9 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -520,10 +520,8 @@ static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
>  
>  static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
>  {
> -	if (is_td_vcpu(vcpu)) {
> -		tdx_cache_reg(vcpu, reg);
> +	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
>  		return;
> -	}
>  
>  	vmx_cache_reg(vcpu, reg);
>  }
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 7eff717c9d0d..b49dcf32206b 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -636,6 +636,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.cr0_guest_owned_bits = -1ul;
>  	vcpu->arch.cr4_guest_owned_bits = -1ul;
>  
> +	vcpu->arch.cr4 = <maximal value>;

Sorry for slow reply.  Seems fine except maybe CR4 usage.

TDX Module validates CR4 based on XFAM and scrubs host state
based on XFAM.  It seems like we would need to use XFAM to
manufacture a CR4 that we then effectively use as a proxy
instead of just checking XFAM.

Since only some vcpu->arch.cr4 bits will be meaningful, it also
still leaves the possibility for confusion.

Are you sure you want this?

> +	vcpu->arch.cr0 = <maximal value, give or take>;

AFAICT we don't need to care about CR0

> +
>  	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
>  	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
>  	/*
> @@ -659,6 +662,14 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  
>  	tdx->state = VCPU_TD_STATE_UNINITIALIZED;
>  
> +	/*
> +	 * On return from VP.ENTER, the TDX Module sets XCR0 and XSS to the
> +	 * maximal values supported by the guest, so from KVM's perspective,
> +	 * those are the guest's values at all times.
> +	 */
> +	vcpu->arch.ia32_xss = (kvm_tdx->xfam & XFEATURE_SUPERVISOR_MASK);
> +	vcpu->arch.xcr0 = (kvm_tdx->xfam & XFEATURE_USE_MASK);
> +
>  	return 0;
>  }
>  
> @@ -824,24 +835,6 @@ static void tdx_user_return_msr_update_cache(struct kvm_vcpu *vcpu)
>  		kvm_user_return_msr_update_cache(tdx_uret_tsx_ctrl_slot, 0);
>  }
>  
> -static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> -{
> -	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> -
> -	if (static_cpu_has(X86_FEATURE_XSAVE) &&
> -	    kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
> -		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
> -	if (static_cpu_has(X86_FEATURE_XSAVES) &&
> -	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
> -	    kvm_host.xss != (kvm_tdx->xfam &
> -			 (kvm_caps.supported_xss | XFEATURE_MASK_PT |
> -			  XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)))
> -		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
> -	if (static_cpu_has(X86_FEATURE_PKU) &&
> -	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
> -		write_pkru(vcpu->arch.host_pkru);
> -}
> -
>  static union vmx_exit_reason tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -941,10 +934,10 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>  	tdx_vcpu_enter_exit(vcpu);
>  
>  	tdx_user_return_msr_update_cache(vcpu);
> -	tdx_restore_host_xsave_state(vcpu);
> +	kvm_load_host_xsave_state(vcpu);
>  	tdx->host_state_need_restore = true;
>  
> -	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> +	vcpu->arch.regs_avail = TDX_REGS_UNSUPPORTED_SET;
>  
>  	if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
>  		return EXIT_FASTPATH_NONE;
> @@ -1963,23 +1956,6 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
>  	}
>  }
>  
> -void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> -{
> -	kvm_register_mark_available(vcpu, reg);
> -	switch (reg) {
> -	case VCPU_REGS_RSP:
> -	case VCPU_REGS_RIP:
> -	case VCPU_EXREG_PDPTR:
> -	case VCPU_EXREG_CR0:
> -	case VCPU_EXREG_CR3:
> -	case VCPU_EXREG_CR4:
> -		break;
> -	default:
> -		KVM_BUG_ON(1, vcpu->kvm);
> -		break;
> -	}
> -}
> -
>  static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
>  {
>  	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index ef60eb7b1245..efa6723837c6 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -145,8 +145,6 @@ bool tdx_has_emulated_msr(u32 index);
>  int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
>  int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
>  
> -void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
> -
>  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>  
>  int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
> @@ -193,8 +191,6 @@ static inline bool tdx_has_emulated_msr(u32 index) { return false; }
>  static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
>  static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
>  
> -static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
> -
>  static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
>  
>  static inline int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4f94b1e24eae..d380837433c6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1184,11 +1184,9 @@ EXPORT_SYMBOL_GPL(kvm_lmsw);
>  
>  void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>  {
> -	if (vcpu->arch.guest_state_protected)
> -		return;
> +	WARN_ON_ONCE(vcpu->arch.guest_state_protected);
>  
>  	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> -
>  		if (vcpu->arch.xcr0 != kvm_host.xcr0)
>  			xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
>  
> @@ -1207,9 +1205,6 @@ EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>  
>  void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>  {
> -	if (vcpu->arch.guest_state_protected)
> -		return;
> -
>  	if (cpu_feature_enabled(X86_FEATURE_PKU) &&
>  	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>  	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
> @@ -3943,6 +3938,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (!msr_info->host_initiated &&
>  		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>  			return 1;
> +
> +		if (vcpu->arch.guest_state_protected)
> +			return 1;
> +
>  		/*
>  		 * KVM supports exposing PT to the guest, but does not support
>  		 * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
> @@ -4402,6 +4401,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (!msr_info->host_initiated &&
>  		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>  			return 1;
> +
> +		if (vcpu->arch.guest_state_protected)
> +			return 1;
> +
>  		msr_info->data = vcpu->arch.ia32_xss;
>  		break;
>  	case MSR_K7_CLK_CTL:
> 
> base-commit: 0319082fc23089f516618e193d94da18c837e35a


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-12-20 15:22         ` Adrian Hunter
@ 2024-12-20 16:22           ` Sean Christopherson
  2024-12-20 21:24             ` PKEY syscall number for selftest? (was: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD) Sean Christopherson
  2025-01-03 18:16             ` [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
  0 siblings, 2 replies; 82+ messages in thread
From: Sean Christopherson @ 2024-12-20 16:22 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On Fri, Dec 20, 2024, Adrian Hunter wrote:
> On 17/12/24 18:09, Sean Christopherson wrote:
> > On Mon, Nov 25, 2024, Adrian Hunter wrote:
> > I would rather just use kvm_load_host_xsave_state(), by forcing vcpu->arch.{xcr0,xss}
> > to XFAM, with a comment explaining that the TDX module sets XCR0 and XSS prior to
> > returning from VP.ENTER.  I don't see any justificaton for maintaining a special
> > flow for TDX, it's just more work.  E.g. TDX is missing the optimization to elide
> > WRPKRU if the current value is the same as the host value.
> 
> Not entirely missing since write_pkru() does do that by itself:
> 
> static inline void write_pkru(u32 pkru)
> {
> 	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
> 		return;
> 	/*
> 	 * WRPKRU is relatively expensive compared to RDPKRU.
> 	 * Avoid WRPKRU when it would not change the value.
> 	 */
> 	if (pkru != rdpkru())
> 		wrpkru(pkru);

Argh.  Well that's a bug.  KVM did the right thing, and then the core kernel
swizzled things around and got in the way.  I'll post this once it's fully tested:

From: Sean Christopherson <seanjc@google.com>
Date: Fri, 20 Dec 2024 07:38:39 -0800
Subject: [PATCH] KVM: x86: Avoid double RDPKRU when loading host/guest PKRU

Use the raw wrpkru() helper when loading the guest/host's PKRU on switch
to/from guest context, as the write_pkru() wrapper incurs an unnecessary
rdpkru().  In both paths, KVM is guaranteed to have performed RDPKRU since
the last possible write, i.e. KVM has a fresh cache of the current value
in hardware.

This effectively restores KVM behavior to that of KVM rior to commit
c806e88734b9 ("x86/pkeys: Provide *pkru() helpers"), which renamed the raw
helper from __write_pkru() => wrpkru(), and turned __write_pkru() into a
wrapper.  Commit 577ff465f5a6 ("x86/fpu: Only write PKRU if it is different
from current") then added the extra RDPKRU to avoid an unnecessary WRPKRU,
but completely missed that KVM already optimized away pointless writes.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: 577ff465f5a6 ("x86/fpu: Only write PKRU if it is different from current")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4320647bd78a..9d5cece9260b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1186,7 +1186,7 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
 	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
 	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
 	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)))
-		write_pkru(vcpu->arch.pkru);
+		wrpkru(vcpu->arch.pkru);
 }
 EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
 
@@ -1200,7 +1200,7 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
 	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
 		vcpu->arch.pkru = rdpkru();
 		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
-			write_pkru(vcpu->arch.host_pkru);
+			wrpkru(vcpu->arch.host_pkru);
 	}
 
 	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {

base-commit: 13e98294d7cec978e31138d16824f50556a62d17
-- 

> }
> 
> For TDX, we don't really need rdpkru() since the TDX Module
> clears it, so it could be:
> 
> 	if (pkru)
> 		wrpkru(pkru);

Ah, right, there's no need to do RDPKRU because KVM knows it's zero.  On top of
my previous suggestion:

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e23cd8231144..9e490fccf073 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -664,11 +664,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
        /*
         * On return from VP.ENTER, the TDX Module sets XCR0 and XSS to the
-        * maximal values supported by the guest, so from KVM's perspective,
-        * those are the guest's values at all times.
+        * maximal values supported by the guest, and zeroes PKRU, so from
+        * KVM's perspective, those are the guest's values at all times.
         */
        vcpu->arch.ia32_xss = (kvm_tdx->xfam & XFEATURE_SUPERVISOR_MASK);
        vcpu->arch.xcr0 = (kvm_tdx->xfam & XFEATURE_USE_MASK);
+       vcpu->arch.pkru = 0;
 
        return 0;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d380837433c6..d2ea7db896ba 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1208,7 +1208,8 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
        if (cpu_feature_enabled(X86_FEATURE_PKU) &&
            ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
             kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
-               vcpu->arch.pkru = rdpkru();
+               if (!vcpu->arch.guest_state_protected)
+                       vcpu->arch.pkru = rdpkru();
                if (vcpu->arch.pkru != vcpu->arch.host_pkru)
                        write_pkru(vcpu->arch.host_pkru);
        }

> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 7eff717c9d0d..b49dcf32206b 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -636,6 +636,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> >  	vcpu->arch.cr0_guest_owned_bits = -1ul;
> >  	vcpu->arch.cr4_guest_owned_bits = -1ul;
> >  
> > +	vcpu->arch.cr4 = <maximal value>;
> 
> Sorry for slow reply.  Seems fine except maybe CR4 usage.
> 
> TDX Module validates CR4 based on XFAM and scrubs host state
> based on XFAM.  It seems like we would need to use XFAM to
> manufacture a CR4 that we then effectively use as a proxy
> instead of just checking XFAM.

Yep.

> Since only some vcpu->arch.cr4 bits will be meaningful, it also
> still leaves the possibility for confusion.

IMO, it's less confusing having '0' for CR0 and CR4, while having accurate values
for other state.  And I'm far more worried about KVM wondering into a bad path
because CR0 and/or CR4 are completely wrong.  E.g. kvm_mmu.cpu_role would be
completely wrong at steady state, the CR4-based runtime CPUID updates would do
the wrong thing, and any helper that wraps kvm_is_cr{0,4}_bit_set() would likely
do the worst possible thing.

> Are you sure you want this?

Yeah, pretty sure.  It would be nice if the TDX Module exposed guest CR0/CR4 to
KVM, a la the traps SEV-ES+ uses, but I think the next best thing is to assume
the guest is using all features.

> > +	vcpu->arch.cr0 = <maximal value, give or take>;
> 
> AFAICT we don't need to care about CR0

Probably not, but having e.g. CR4.PAE/LA57=1 with CR0.PG/PE=0 will be quite
weird.

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* PKEY syscall number for selftest? (was: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD)
  2024-12-20 16:22           ` Sean Christopherson
@ 2024-12-20 21:24             ` Sean Christopherson
  2025-01-27 17:09               ` Sean Christopherson
  2025-01-03 18:16             ` [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
  1 sibling, 1 reply; 82+ messages in thread
From: Sean Christopherson @ 2024-12-20 21:24 UTC (permalink / raw)
  To: Dave Hansen; +Cc: kvm, linux-kernel, x86

Switching topics, dropped everyone else except the list.

On Fri, Dec 20, 2024, Sean Christopherson wrote:
>  arch/x86/kvm/x86.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4320647bd78a..9d5cece9260b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1186,7 +1186,7 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>  	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
>  	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>  	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)))
> -		write_pkru(vcpu->arch.pkru);
> +		wrpkru(vcpu->arch.pkru);
>  }
>  EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>  
> @@ -1200,7 +1200,7 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>  	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
>  		vcpu->arch.pkru = rdpkru();
>  		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
> -			write_pkru(vcpu->arch.host_pkru);
> +			wrpkru(vcpu->arch.host_pkru);
>  	}
>  
>  	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> 
> base-commit: 13e98294d7cec978e31138d16824f50556a62d17
> -- 

I tried to test this by running the mm/protection_keys selftest in a VM, but it
gives what are effectively false passes on x86-64 due to the selftest picking up
the generic syscall numbers, e.g. 289 for SYS_pkey_alloc, instead of the x86-64
numbers.

I was able to get the test to run by hacking tools/testing/selftests/mm/pkey-x86.h
to shove in the right numbers, but I can't imagine that's the intended behavior.

If I omit the #undefs from pkey-x86.h, it shows that the test is grabbing the
definitions from the generic usr/include/asm-generic/unistd.h header.

Am I doing something stupid?

Regardless of whether this is PEBKAC or working as intended, on x86, the test
should ideally assert that "ospke" support in /proc/cpuinfo is consistent with
the result of sys_pkey_alloc(), e.g. so that an failure to allocate a pkey on a
system that work is reported as an error, not a pass.

--
diff --git a/tools/testing/selftests/mm/pkey-x86.h b/tools/testing/selftests/mm/pkey-x86.h
index ac91777c8917..ccc3552e6b77 100644
--- a/tools/testing/selftests/mm/pkey-x86.h
+++ b/tools/testing/selftests/mm/pkey-x86.h
@@ -3,6 +3,10 @@
 #ifndef _PKEYS_X86_H
 #define _PKEYS_X86_H
 
+#define __NR_pkey_mprotect     329
+#define __NR_pkey_alloc                330
+#define __NR_pkey_free         331
+
 #ifdef __i386__
 
 #define REG_IP_IDX             REG_EIP
--

Yields:

$ ARCH=x86_64 make protection_keys_64
gcc -Wall -I /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../..  -isystem /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../../usr/include -isystem /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../../tools/include/uapi -no-pie -D_GNU_SOURCE=  -m64 -mxsave  protection_keys.c vm_util.c thp_settings.c -lrt -lpthread -lm -lrt -ldl -o /home/sean/go/src/kernel.org/linux/tools/testing/selftests/mm/protection_keys_64
In file included from pkey-helpers.h:102:0,
                 from protection_keys.c:49:
pkey-x86.h:6:0: warning: "__NR_pkey_mprotect" redefined
 #define __NR_pkey_mprotect 329
 
In file included from protection_keys.c:45:0:
/home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:693:0: note: this is the location of the previous definition
 #define __NR_pkey_mprotect 288
 
In file included from pkey-helpers.h:102:0,
                 from protection_keys.c:49:
pkey-x86.h:7:0: warning: "__NR_pkey_alloc" redefined
 #define __NR_pkey_alloc  330
 
In file included from protection_keys.c:45:0:
/home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:695:0: note: this is the location of the previous definition
 #define __NR_pkey_alloc 289
 
In file included from pkey-helpers.h:102:0,
                 from protection_keys.c:49:
pkey-x86.h:8:0: warning: "__NR_pkey_free" redefined
 #define __NR_pkey_free  331
 
In file included from protection_keys.c:45:0:
/home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:697:0: note: this is the location of the previous definition
 #define __NR_pkey_free 290
 


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2024-12-20 16:22           ` Sean Christopherson
  2024-12-20 21:24             ` PKEY syscall number for selftest? (was: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD) Sean Christopherson
@ 2025-01-03 18:16             ` Adrian Hunter
  2025-01-09 19:11               ` Sean Christopherson
  1 sibling, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2025-01-03 18:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On 20/12/24 18:22, Sean Christopherson wrote:
> On Fri, Dec 20, 2024, Adrian Hunter wrote:
>> On 17/12/24 18:09, Sean Christopherson wrote:
>>> On Mon, Nov 25, 2024, Adrian Hunter wrote:
>>> I would rather just use kvm_load_host_xsave_state(), by forcing vcpu->arch.{xcr0,xss}
>>> to XFAM, with a comment explaining that the TDX module sets XCR0 and XSS prior to
>>> returning from VP.ENTER.  I don't see any justificaton for maintaining a special
>>> flow for TDX, it's just more work.
>>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>>> index 7eff717c9d0d..b49dcf32206b 100644
>>> --- a/arch/x86/kvm/vmx/tdx.c
>>> +++ b/arch/x86/kvm/vmx/tdx.c
>>> @@ -636,6 +636,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>>>  	vcpu->arch.cr0_guest_owned_bits = -1ul;
>>>  	vcpu->arch.cr4_guest_owned_bits = -1ul;
>>>  
>>> +	vcpu->arch.cr4 = <maximal value>;
>> Sorry for slow reply.  Seems fine except maybe CR4 usage.
>>
>> TDX Module validates CR4 based on XFAM and scrubs host state
>> based on XFAM.  It seems like we would need to use XFAM to
>> manufacture a CR4 that we then effectively use as a proxy
>> instead of just checking XFAM.
> Yep.
> 
>> Since only some vcpu->arch.cr4 bits will be meaningful, it also
>> still leaves the possibility for confusion.
> IMO, it's less confusing having '0' for CR0 and CR4, while having accurate values
> for other state.  And I'm far more worried about KVM wondering into a bad path
> because CR0 and/or CR4 are completely wrong.  E.g. kvm_mmu.cpu_role would be
> completely wrong at steady state, the CR4-based runtime CPUID updates would do
> the wrong thing, and any helper that wraps kvm_is_cr{0,4}_bit_set() would likely
> do the worst possible thing.
> 
>> Are you sure you want this?
> Yeah, pretty sure.  It would be nice if the TDX Module exposed guest CR0/CR4 to
> KVM, a la the traps SEV-ES+ uses, but I think the next best thing is to assume
> the guest is using all features.
> 
>>> +	vcpu->arch.cr0 = <maximal value, give or take>;
>> AFAICT we don't need to care about CR0
> Probably not, but having e.g. CR4.PAE/LA57=1 with CR0.PG/PE=0 will be quite
> weird.

Below is what I have so far.  It seems to work.  Note:
 - use of MSR_IA32_VMX_CR0_FIXED1 and MSR_IA32_VMX_CR4_FIXED1
 to provide base value for CR0 and CR4
 - tdx_reinforce_guest_state() to make sure host state doesn't
 get broken because the values go wrong
 - __kvm_set_xcr() to handle guest_state_protected case
 - kvm_vcpu_reset() to handle guest_state_protected case

Please let me know your feedback.

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 10aebae5af18..2a5f756b05e2 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -351,8 +351,10 @@ static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 
 static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
-	if (is_td_vcpu(vcpu))
+	if (is_td_vcpu(vcpu)) {
+		tdx_vcpu_after_set_cpuid(vcpu);
 		return;
+	}
 
 	vmx_vcpu_after_set_cpuid(vcpu);
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 191ee209caa0..0ae427340494 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -710,6 +710,57 @@ int tdx_vm_init(struct kvm *kvm)
 	return 0;
 }
 
+/* Set a maximal guest CR0 value */
+static u64 tdx_guest_cr0(struct kvm_vcpu *vcpu, u64 cr4)
+{
+	u64 cr0;
+
+	rdmsrl(MSR_IA32_VMX_CR0_FIXED1, cr0);
+
+	if (cr4 & X86_CR4_CET)
+		cr0 |= X86_CR0_WP;
+
+	cr0 |= X86_CR0_PE | X86_CR0_NE;
+	cr0 &= ~(X86_CR0_NW | X86_CR0_CD);
+
+	return cr0;
+}
+
+/*
+ * Set a maximal guest CR4 value. Clear bits forbidden by XFAM or
+ * TD Attributes.
+ */
+static u64 tdx_guest_cr4(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	u64 cr4;
+
+	rdmsrl(MSR_IA32_VMX_CR4_FIXED1, cr4);
+
+	if (!(kvm_tdx->xfam & XFEATURE_PKRU))
+		cr4 &= ~X86_CR4_PKE;
+
+	if (!(kvm_tdx->xfam & XFEATURE_CET_USER) || !(kvm_tdx->xfam & BIT_ULL(12)))
+		cr4 &= ~X86_CR4_CET;
+
+	/* User Interrupts */
+	if (!(kvm_tdx->xfam & BIT_ULL(14)))
+		cr4 &= ~BIT_ULL(25);
+
+	if (!(kvm_tdx->attributes & TDX_TD_ATTR_LASS))
+		cr4 &= ~BIT_ULL(27);
+
+	if (!(kvm_tdx->attributes & TDX_TD_ATTR_PKS))
+		cr4 &= ~BIT_ULL(24);
+
+	if (!(kvm_tdx->attributes & TDX_TD_ATTR_KL))
+		cr4 &= ~BIT_ULL(19);
+
+	cr4 &= ~X86_CR4_SMXE;
+
+	return cr4;
+}
+
 int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -732,8 +783,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.cr0_guest_owned_bits = -1ul;
 	vcpu->arch.cr4_guest_owned_bits = -1ul;
 
-	vcpu->arch.cr4 = <maximal value>;
-	vcpu->arch.cr0 = <maximal value, give or take>;
+	vcpu->arch.cr4 = tdx_guest_cr4(vcpu);
+	vcpu->arch.cr0 = tdx_guest_cr0(vcpu, vcpu->arch.cr4);
 
 	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
 	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
@@ -767,6 +818,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+{
+	if (cpu_feature_enabled(X86_FEATURE_XSAVES))
+		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
+}
+
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -933,6 +990,24 @@ static void tdx_user_return_msr_update_cache(void)
 						 tdx_uret_msrs[i].defval);
 }
 
+static void tdx_reinforce_guest_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+	if (WARN_ON_ONCE(vcpu->arch.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK)))
+		vcpu->arch.xcr0 = kvm_tdx->xfam & TDX_XFAM_XCR0_MASK;
+	if (WARN_ON_ONCE(vcpu->arch.ia32_xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK)))
+		vcpu->arch.ia32_xss = kvm_tdx->xfam & TDX_XFAM_XSS_MASK;
+	if (WARN_ON_ONCE(vcpu->arch.pkru))
+		vcpu->arch.pkru = 0;
+	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVE) &&
+			 !kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)))
+		vcpu->arch.cr4 |= X86_CR4_OSXSAVE;
+	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVES) &&
+			 !guest_can_use(vcpu, X86_FEATURE_XSAVES)))
+		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
+}
+
 static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -1028,9 +1103,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 		update_debugctlmsr(tdx->host_debugctlmsr);
 
 	tdx_user_return_msr_update_cache();
+
+	tdx_reinforce_guest_state(vcpu);
 	kvm_load_host_xsave_state(vcpu);
 
-	vcpu->arch.regs_avail = TDX_REGS_UNSUPPORTED_SET;
+	vcpu->arch.regs_avail = ~0;
 
 	if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
 		return EXIT_FASTPATH_NONE;
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 861c0f649b69..2e0e300a1f5e 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -110,6 +110,7 @@ struct tdx_cpuid_value {
 } __packed;
 
 #define TDX_TD_ATTR_DEBUG		BIT_ULL(0)
+#define TDX_TD_ATTR_LASS		BIT_ULL(27)
 #define TDX_TD_ATTR_SEPT_VE_DISABLE	BIT_ULL(28)
 #define TDX_TD_ATTR_PKS			BIT_ULL(30)
 #define TDX_TD_ATTR_KL			BIT_ULL(31)
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 7fb1bbf12b39..7f03a6a24abc 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -126,6 +126,7 @@ void tdx_vm_free(struct kvm *kvm);
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
+void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
@@ -170,6 +171,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
+static inline void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
 static inline int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2ea7db896ba..f2b1980f830d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1240,6 +1240,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 	u64 old_xcr0 = vcpu->arch.xcr0;
 	u64 valid_bits;
 
+	if (vcpu->arch.guest_state_protected) {
+		kvm_update_cpuid_runtime(vcpu);
+		return 0;
+	}
+
 	/* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now  */
 	if (index != XCR_XFEATURE_ENABLED_MASK)
 		return 1;
@@ -12388,7 +12393,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	 * into hardware, to be zeroed at vCPU creation.  Use CRs as a sentinel
 	 * to detect improper or missing initialization.
 	 */
-	WARN_ON_ONCE(!init_event &&
+	WARN_ON_ONCE(!init_event && !vcpu->arch.guest_state_protected &&
 		     (old_cr0 || kvm_read_cr3(vcpu) || kvm_read_cr4(vcpu)));
 
 	/*
diff --git a/tools/testing/selftests/kvm/x86_64/tdx_vm_tests.c b/tools/testing/selftests/kvm/x86_64/tdx_vm_tests.c
index 075419ef3ac7..2cc9bf40a788 100644
--- a/tools/testing/selftests/kvm/x86_64/tdx_vm_tests.c
+++ b/tools/testing/selftests/kvm/x86_64/tdx_vm_tests.c
@@ -1054,7 +1054,7 @@ void verify_td_cpuid_tdcall(void)
 	TDX_TEST_ASSERT_SUCCESS(vcpu);
 
 	/* Get KVM CPUIDs for reference */
-	tmp = get_cpuid_entry(kvm_get_supported_cpuid(), 1, 0);
+	tmp = get_cpuid_entry(vcpu->cpuid, 1, 0);
 	TEST_ASSERT(tmp, "CPUID entry missing\n");
 
 	cpuid_entry = *tmp;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2025-01-03 18:16             ` [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
@ 2025-01-09 19:11               ` Sean Christopherson
  2025-01-10 14:50                 ` Adrian Hunter
  2025-01-13 19:28                 ` Adrian Hunter
  0 siblings, 2 replies; 82+ messages in thread
From: Sean Christopherson @ 2025-01-09 19:11 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On Fri, Jan 03, 2025, Adrian Hunter wrote:
> On 20/12/24 18:22, Sean Christopherson wrote:
> +/* Set a maximal guest CR0 value */
> +static u64 tdx_guest_cr0(struct kvm_vcpu *vcpu, u64 cr4)
> +{
> +	u64 cr0;
> +
> +	rdmsrl(MSR_IA32_VMX_CR0_FIXED1, cr0);
> +
> +	if (cr4 & X86_CR4_CET)
> +		cr0 |= X86_CR0_WP;
> +
> +	cr0 |= X86_CR0_PE | X86_CR0_NE;
> +	cr0 &= ~(X86_CR0_NW | X86_CR0_CD);
> +
> +	return cr0;
> +}
> +
> +/*
> + * Set a maximal guest CR4 value. Clear bits forbidden by XFAM or
> + * TD Attributes.
> + */
> +static u64 tdx_guest_cr4(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +	u64 cr4;
> +
> +	rdmsrl(MSR_IA32_VMX_CR4_FIXED1, cr4);

This won't be accurate long-term.  E.g. run KVM on hardware with CR4 bits that
neither KVM nor TDX know about, and vcpu->arch.cr4 will end up with bits set that
KVM think are illegal, which will cause it's own problems.

For CR0 and CR4, we should be able to start with KVM's set of allowed bits, not
the CPU's.  That will mean there will likely be missing bits, in vcpu->arch.cr{0,4},
but if KVM doesn't know about a bit, the fact that it's missing should be a complete
non-issue.

That also avoids weirdness for things like user-mode interrupts, LASS, PKS, etc.,
where KVM is open coding the bits.  The downside is that we'll need to remember
to update TDX when enabling those features to account for kvm_tdx->attributes,
but that's not unreasonable.

> +
> +	if (!(kvm_tdx->xfam & XFEATURE_PKRU))
> +		cr4 &= ~X86_CR4_PKE;
> +
> +	if (!(kvm_tdx->xfam & XFEATURE_CET_USER) || !(kvm_tdx->xfam & BIT_ULL(12)))
> +		cr4 &= ~X86_CR4_CET;
> +
> +	/* User Interrupts */
> +	if (!(kvm_tdx->xfam & BIT_ULL(14)))
> +		cr4 &= ~BIT_ULL(25);
> +
> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_LASS))
> +		cr4 &= ~BIT_ULL(27);
> +
> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_PKS))
> +		cr4 &= ~BIT_ULL(24);
> +
> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_KL))
> +		cr4 &= ~BIT_ULL(19);
> +
> +	cr4 &= ~X86_CR4_SMXE;
> +
> +	return cr4;
> +}
> +
>  int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> @@ -732,8 +783,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.cr0_guest_owned_bits = -1ul;
>  	vcpu->arch.cr4_guest_owned_bits = -1ul;
>  
> -	vcpu->arch.cr4 = <maximal value>;
> -	vcpu->arch.cr0 = <maximal value, give or take>;
> +	vcpu->arch.cr4 = tdx_guest_cr4(vcpu);
> +	vcpu->arch.cr0 = tdx_guest_cr0(vcpu, vcpu->arch.cr4);
>  
>  	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
>  	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
> @@ -767,6 +818,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
>  
> +void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> +{
> +	if (cpu_feature_enabled(X86_FEATURE_XSAVES))

This should use kvm_cpu_caps_has(), because strictly speaking it's KVM support
that matters.  In practice, I don't think it matters for XSAVES, but it can
matter for other features (though probably not for TDX guests).

> +		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
> +}
> +
>  void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  {
>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -933,6 +990,24 @@ static void tdx_user_return_msr_update_cache(void)
>  						 tdx_uret_msrs[i].defval);
>  }
>  
> +static void tdx_reinforce_guest_state(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +
> +	if (WARN_ON_ONCE(vcpu->arch.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK)))
> +		vcpu->arch.xcr0 = kvm_tdx->xfam & TDX_XFAM_XCR0_MASK;
> +	if (WARN_ON_ONCE(vcpu->arch.ia32_xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK)))
> +		vcpu->arch.ia32_xss = kvm_tdx->xfam & TDX_XFAM_XSS_MASK;
> +	if (WARN_ON_ONCE(vcpu->arch.pkru))
> +		vcpu->arch.pkru = 0;
> +	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVE) &&
> +			 !kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)))
> +		vcpu->arch.cr4 |= X86_CR4_OSXSAVE;
> +	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVES) &&
> +			 !guest_can_use(vcpu, X86_FEATURE_XSAVES)))
> +		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
> +}
> +
>  static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -1028,9 +1103,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>  		update_debugctlmsr(tdx->host_debugctlmsr);
>  
>  	tdx_user_return_msr_update_cache();
> +
> +	tdx_reinforce_guest_state(vcpu);

Hmm, I don't think fixing up guest state is a good idea.  It probably works?
But continuing on when we know there's a KVM bug *and* a chance for host data
corruption seems unnecessarily risky.

My vote would to KVM_BUG_ON() before entering the guest.  I think I'd also be ok
omitting the checks, it's not like the potential for KVM bugs that clobber KVM's
view of state are unique to TDX (though I do agree that the behavior of the TDX
module in this case does make them more likely).

>  	kvm_load_host_xsave_state(vcpu);
>  
> -	vcpu->arch.regs_avail = TDX_REGS_UNSUPPORTED_SET;
> +	vcpu->arch.regs_avail = ~0;
>  
>  	if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
>  		return EXIT_FASTPATH_NONE;
> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> index 861c0f649b69..2e0e300a1f5e 100644
> --- a/arch/x86/kvm/vmx/tdx_arch.h
> +++ b/arch/x86/kvm/vmx/tdx_arch.h
> @@ -110,6 +110,7 @@ struct tdx_cpuid_value {
>  } __packed;
>  
>  #define TDX_TD_ATTR_DEBUG		BIT_ULL(0)
> +#define TDX_TD_ATTR_LASS		BIT_ULL(27)
>  #define TDX_TD_ATTR_SEPT_VE_DISABLE	BIT_ULL(28)
>  #define TDX_TD_ATTR_PKS			BIT_ULL(30)
>  #define TDX_TD_ATTR_KL			BIT_ULL(31)
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 7fb1bbf12b39..7f03a6a24abc 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -126,6 +126,7 @@ void tdx_vm_free(struct kvm *kvm);
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>  
>  int tdx_vcpu_create(struct kvm_vcpu *vcpu);
> +void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>  int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
> @@ -170,6 +171,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
>  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>  
>  static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
> +static inline void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d2ea7db896ba..f2b1980f830d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1240,6 +1240,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>  	u64 old_xcr0 = vcpu->arch.xcr0;
>  	u64 valid_bits;
>  
> +	if (vcpu->arch.guest_state_protected) {

This should be a WARN_ON_ONCE() + return 1, no?

> +		kvm_update_cpuid_runtime(vcpu);
> +		return 0;
> +	}
> +
>  	/* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now  */
>  	if (index != XCR_XFEATURE_ENABLED_MASK)
>  		return 1;
> @@ -12388,7 +12393,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  	 * into hardware, to be zeroed at vCPU creation.  Use CRs as a sentinel
>  	 * to detect improper or missing initialization.
>  	 */
> -	WARN_ON_ONCE(!init_event &&
> +	WARN_ON_ONCE(!init_event && !vcpu->arch.guest_state_protected &&
>  		     (old_cr0 || kvm_read_cr3(vcpu) || kvm_read_cr4(vcpu)));

Maybe stuff state in tdx_vcpu_init() to avoid this waiver?  KVM is already
deferring APIC base and RCX initialization to that point, waiting to stuff all
TDX-specific vCPU state seems natural.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2025-01-09 19:11               ` Sean Christopherson
@ 2025-01-10 14:50                 ` Adrian Hunter
  2025-01-10 17:30                   ` Sean Christopherson
  2025-01-13 19:28                 ` Adrian Hunter
  1 sibling, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2025-01-10 14:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On 9/01/25 21:11, Sean Christopherson wrote:
> On Fri, Jan 03, 2025, Adrian Hunter wrote:
>> On 20/12/24 18:22, Sean Christopherson wrote:
>> +/* Set a maximal guest CR0 value */
>> +static u64 tdx_guest_cr0(struct kvm_vcpu *vcpu, u64 cr4)
>> +{
>> +	u64 cr0;
>> +
>> +	rdmsrl(MSR_IA32_VMX_CR0_FIXED1, cr0);
>> +
>> +	if (cr4 & X86_CR4_CET)
>> +		cr0 |= X86_CR0_WP;
>> +
>> +	cr0 |= X86_CR0_PE | X86_CR0_NE;
>> +	cr0 &= ~(X86_CR0_NW | X86_CR0_CD);
>> +
>> +	return cr0;
>> +}
>> +
>> +/*
>> + * Set a maximal guest CR4 value. Clear bits forbidden by XFAM or
>> + * TD Attributes.
>> + */
>> +static u64 tdx_guest_cr4(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> +	u64 cr4;
>> +
>> +	rdmsrl(MSR_IA32_VMX_CR4_FIXED1, cr4);
> 
> This won't be accurate long-term.  E.g. run KVM on hardware with CR4 bits that
> neither KVM nor TDX know about, and vcpu->arch.cr4 will end up with bits set that
> KVM think are illegal, which will cause it's own problems.

Currently validation of CR4 is only done when user space changes it,
which should not be allowed for TDX.  For that it looks like TDX
would need:

	kvm->arch.has_protected_state = true;

Not sure why it doesn't already?

> 
> For CR0 and CR4, we should be able to start with KVM's set of allowed bits, not
> the CPU's.  That will mean there will likely be missing bits, in vcpu->arch.cr{0,4},
> but if KVM doesn't know about a bit, the fact that it's missing should be a complete
> non-issue.

What about adding:

	cr4 &= ~cr4_reserved_bits;

and

	cr0 &= ~CR0_RESERVED_BITS
> 
> That also avoids weirdness for things like user-mode interrupts, LASS, PKS, etc.,
> where KVM is open coding the bits.  The downside is that we'll need to remember
> to update TDX when enabling those features to account for kvm_tdx->attributes,
> but that's not unreasonable.
> 
>> +
>> +	if (!(kvm_tdx->xfam & XFEATURE_PKRU))
>> +		cr4 &= ~X86_CR4_PKE;
>> +
>> +	if (!(kvm_tdx->xfam & XFEATURE_CET_USER) || !(kvm_tdx->xfam & BIT_ULL(12)))
>> +		cr4 &= ~X86_CR4_CET;
>> +
>> +	/* User Interrupts */
>> +	if (!(kvm_tdx->xfam & BIT_ULL(14)))
>> +		cr4 &= ~BIT_ULL(25);
>> +
>> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_LASS))
>> +		cr4 &= ~BIT_ULL(27);
>> +
>> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_PKS))
>> +		cr4 &= ~BIT_ULL(24);
>> +
>> +	if (!(kvm_tdx->attributes & TDX_TD_ATTR_KL))
>> +		cr4 &= ~BIT_ULL(19);
>> +
>> +	cr4 &= ~X86_CR4_SMXE;
>> +
>> +	return cr4;
>> +}
>> +
>>  int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>>  {
>>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> @@ -732,8 +783,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>>  	vcpu->arch.cr0_guest_owned_bits = -1ul;
>>  	vcpu->arch.cr4_guest_owned_bits = -1ul;
>>  
>> -	vcpu->arch.cr4 = <maximal value>;
>> -	vcpu->arch.cr0 = <maximal value, give or take>;
>> +	vcpu->arch.cr4 = tdx_guest_cr4(vcpu);
>> +	vcpu->arch.cr0 = tdx_guest_cr0(vcpu, vcpu->arch.cr4);
>>  
>>  	vcpu->arch.tsc_offset = kvm_tdx->tsc_offset;
>>  	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
>> @@ -767,6 +818,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>>  	return 0;
>>  }
>>  
>> +void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> +{
>> +	if (cpu_feature_enabled(X86_FEATURE_XSAVES))
> 
> This should use kvm_cpu_caps_has(), because strictly speaking it's KVM support
> that matters.  In practice, I don't think it matters for XSAVES, but it can
> matter for other features (though probably not for TDX guests).
> 
>> +		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
>> +}
>> +
>>  void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>  {
>>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> @@ -933,6 +990,24 @@ static void tdx_user_return_msr_update_cache(void)
>>  						 tdx_uret_msrs[i].defval);
>>  }
>>  
>> +static void tdx_reinforce_guest_state(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> +
>> +	if (WARN_ON_ONCE(vcpu->arch.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK)))
>> +		vcpu->arch.xcr0 = kvm_tdx->xfam & TDX_XFAM_XCR0_MASK;
>> +	if (WARN_ON_ONCE(vcpu->arch.ia32_xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK)))
>> +		vcpu->arch.ia32_xss = kvm_tdx->xfam & TDX_XFAM_XSS_MASK;
>> +	if (WARN_ON_ONCE(vcpu->arch.pkru))
>> +		vcpu->arch.pkru = 0;
>> +	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVE) &&
>> +			 !kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)))
>> +		vcpu->arch.cr4 |= X86_CR4_OSXSAVE;
>> +	if (WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_XSAVES) &&
>> +			 !guest_can_use(vcpu, X86_FEATURE_XSAVES)))
>> +		kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);
>> +}
>> +
>>  static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
>>  {
>>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> @@ -1028,9 +1103,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>  		update_debugctlmsr(tdx->host_debugctlmsr);
>>  
>>  	tdx_user_return_msr_update_cache();
>> +
>> +	tdx_reinforce_guest_state(vcpu);
> 
> Hmm, I don't think fixing up guest state is a good idea.  It probably works?
> But continuing on when we know there's a KVM bug *and* a chance for host data
> corruption seems unnecessarily risky.
> 
> My vote would to KVM_BUG_ON() before entering the guest.  I think I'd also be ok
> omitting the checks, it's not like the potential for KVM bugs that clobber KVM's
> view of state are unique to TDX (though I do agree that the behavior of the TDX
> module in this case does make them more likely).

If the guest state that is vital to host state restoration, goes wrong
then the machine can die without much explanation, so KVM_BUG_ON() before
entering the guest seems prudent.

> 
>>  	kvm_load_host_xsave_state(vcpu);
>>  
>> -	vcpu->arch.regs_avail = TDX_REGS_UNSUPPORTED_SET;
>> +	vcpu->arch.regs_avail = ~0;
>>  
>>  	if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
>>  		return EXIT_FASTPATH_NONE;
>> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
>> index 861c0f649b69..2e0e300a1f5e 100644
>> --- a/arch/x86/kvm/vmx/tdx_arch.h
>> +++ b/arch/x86/kvm/vmx/tdx_arch.h
>> @@ -110,6 +110,7 @@ struct tdx_cpuid_value {
>>  } __packed;
>>  
>>  #define TDX_TD_ATTR_DEBUG		BIT_ULL(0)
>> +#define TDX_TD_ATTR_LASS		BIT_ULL(27)
>>  #define TDX_TD_ATTR_SEPT_VE_DISABLE	BIT_ULL(28)
>>  #define TDX_TD_ATTR_PKS			BIT_ULL(30)
>>  #define TDX_TD_ATTR_KL			BIT_ULL(31)
>> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
>> index 7fb1bbf12b39..7f03a6a24abc 100644
>> --- a/arch/x86/kvm/vmx/x86_ops.h
>> +++ b/arch/x86/kvm/vmx/x86_ops.h
>> @@ -126,6 +126,7 @@ void tdx_vm_free(struct kvm *kvm);
>>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>>  
>>  int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>> +void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
>>  void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>>  void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>>  int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
>> @@ -170,6 +171,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
>>  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>>  
>>  static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>> +static inline void tdx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) {}
>>  static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>>  static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
>>  static inline int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index d2ea7db896ba..f2b1980f830d 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1240,6 +1240,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>>  	u64 old_xcr0 = vcpu->arch.xcr0;
>>  	u64 valid_bits;
>>  
>> +	if (vcpu->arch.guest_state_protected) {
> 
> This should be a WARN_ON_ONCE() + return 1, no?

With kvm->arch.has_protected_state = true, KVM_SET_XCRS
would fail, which would probably be fine except for KVM selftests:

Currently the KVM selftests expect to be able to set XCR0:

    td_vcpu_add()
	vm_vcpu_add()
	    vm_arch_vcpu_add()
		vcpu_init_xcrs()
		    vcpu_xcrs_set()
			vcpu_ioctl(KVM_SET_XCRS)
			    __TEST_ASSERT_VM_VCPU_IOCTL(!ret)

Seems like vm->arch.has_protected_state is needed for
KVM selftests?

> 
>> +		kvm_update_cpuid_runtime(vcpu);

And kvm_update_cpuid_runtime() never gets called otherwise.
Not sure where would be a good place to call it.

>> +		return 0;
>> +	}
>> +
>>  	/* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now  */
>>  	if (index != XCR_XFEATURE_ENABLED_MASK)
>>  		return 1;
>> @@ -12388,7 +12393,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>  	 * into hardware, to be zeroed at vCPU creation.  Use CRs as a sentinel
>>  	 * to detect improper or missing initialization.
>>  	 */
>> -	WARN_ON_ONCE(!init_event &&
>> +	WARN_ON_ONCE(!init_event && !vcpu->arch.guest_state_protected &&
>>  		     (old_cr0 || kvm_read_cr3(vcpu) || kvm_read_cr4(vcpu)));
> 
> Maybe stuff state in tdx_vcpu_init() to avoid this waiver?  KVM is already
> deferring APIC base and RCX initialization to that point, waiting to stuff all
> TDX-specific vCPU state seems natural.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2025-01-10 14:50                 ` Adrian Hunter
@ 2025-01-10 17:30                   ` Sean Christopherson
  2025-01-14 20:04                     ` Adrian Hunter
  0 siblings, 1 reply; 82+ messages in thread
From: Sean Christopherson @ 2025-01-10 17:30 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On Fri, Jan 10, 2025, Adrian Hunter wrote:
> On 9/01/25 21:11, Sean Christopherson wrote:
> > On Fri, Jan 03, 2025, Adrian Hunter wrote:
> >> +static u64 tdx_guest_cr4(struct kvm_vcpu *vcpu)
> >> +{
> >> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> >> +	u64 cr4;
> >> +
> >> +	rdmsrl(MSR_IA32_VMX_CR4_FIXED1, cr4);
> > 
> > This won't be accurate long-term.  E.g. run KVM on hardware with CR4 bits that
> > neither KVM nor TDX know about, and vcpu->arch.cr4 will end up with bits set that
> > KVM think are illegal, which will cause it's own problems.
> 
> Currently validation of CR4 is only done when user space changes it,
> which should not be allowed for TDX.  For that it looks like TDX
> would need:
> 
> 	kvm->arch.has_protected_state = true;
> 
> Not sure why it doesn't already?

Sorry, I didn't follow any of that.

> > For CR0 and CR4, we should be able to start with KVM's set of allowed bits, not
> > the CPU's.  That will mean there will likely be missing bits, in vcpu->arch.cr{0,4},
> > but if KVM doesn't know about a bit, the fact that it's missing should be a complete
> > non-issue.
> 
> What about adding:
> 
> 	cr4 &= ~cr4_reserved_bits;
> 
> and
> 
> 	cr0 &= ~CR0_RESERVED_BITS

I was thinking a much more explicit:

	vcpu->arch.cr4 = ~vcpu->arch.cr4_guest_rsvd_bits;

which if it's done in tdx_vcpu_init(), in conjunction with freezing the vCPU
model (see below), should be solid.

> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index d2ea7db896ba..f2b1980f830d 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -1240,6 +1240,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
> >>  	u64 old_xcr0 = vcpu->arch.xcr0;
> >>  	u64 valid_bits;
> >>  
> >> +	if (vcpu->arch.guest_state_protected) {
> > 
> > This should be a WARN_ON_ONCE() + return 1, no?
> 
> With kvm->arch.has_protected_state = true, KVM_SET_XCRS
> would fail, which would probably be fine except for KVM selftests:
> 
> Currently the KVM selftests expect to be able to set XCR0:
> 
>     td_vcpu_add()
> 	vm_vcpu_add()
> 	    vm_arch_vcpu_add()
> 		vcpu_init_xcrs()
> 		    vcpu_xcrs_set()
> 			vcpu_ioctl(KVM_SET_XCRS)
> 			    __TEST_ASSERT_VM_VCPU_IOCTL(!ret)
> 
> Seems like vm->arch.has_protected_state is needed for KVM selftests?

I doubt it's truly needed, my guess (without looking at the code) is that selftests
are fudging around the fact that KVM doesn't stuff arch.xcr0.

> >> +		kvm_update_cpuid_runtime(vcpu);
> 
> And kvm_update_cpuid_runtime() never gets called otherwise.
> Not sure where would be a good place to call it.

I think we should call it in tdx_vcpu_init(), and then also freeze the vCPU model
at that time.  KVM currently "freezes" the model based on last_vmentry_cpu, but
that's a bit of a hack and might even be flawed, e.g. I wouldn't be surprised if
it's possible to lead KVM astray by trying to get a signal to race with KVM_RUN
so that last_vmentry_cpu isn't set despite getting quite far into KVM_RUN.

I'll test and post a patch to add vcpu_model_is_frozen (the below will conflict
mightily with the CPUID rework that's queued for v6.14), as I think it's a good
change even if we don't end up freezing the model at tdx_vcpu_init() (though I
can't think of any reason to allow CPUID updates after that point).

---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/cpuid.c            | 2 +-
 arch/x86/kvm/mmu/mmu.c          | 4 ++--
 arch/x86/kvm/pmu.c              | 2 +-
 arch/x86/kvm/vmx/tdx.c          | 7 +++++++
 arch/x86/kvm/x86.c              | 9 ++++++++-
 arch/x86/kvm/x86.h              | 5 -----
 7 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0787855ab006..41c31a69924d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -779,6 +779,7 @@ struct kvm_vcpu_arch {
 	u64 ia32_misc_enable_msr;
 	u64 smbase;
 	u64 smi_count;
+	bool vcpu_model_is_frozen;
 	bool at_instruction_boundary;
 	bool tpr_access_reporting;
 	bool xfd_no_write_intercept;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 6c7ab125f582..678518ec1c72 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -465,7 +465,7 @@ static int kvm_set_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
 	 * KVM_SET_CPUID{,2} again. To support this legacy behavior, check
 	 * whether the supplied CPUID data is equal to what's already set.
 	 */
-	if (kvm_vcpu_has_run(vcpu)) {
+	if (vcpu->arch.vcpu_model_is_frozen) {
 		r = kvm_cpuid_check_equal(vcpu, e2, nent);
 		if (r)
 			return r;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 713ca857f2c2..75350a5c6c54 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5716,9 +5716,9 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 	/*
 	 * Changing guest CPUID after KVM_RUN is forbidden, see the comment in
-	 * kvm_arch_vcpu_ioctl().
+	 * kvm_arch_vcpu_ioctl_run().
 	 */
-	KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm);
+	KVM_BUG_ON(vcpu->arch.vcpu_model_is_frozen, vcpu->kvm);
 }
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 47a46283c866..4f487a980eae 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -752,7 +752,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 
-	if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm))
+	if (KVM_BUG_ON(vcpu->arch.vcpu_model_is_frozen, vcpu->kvm))
 		return;
 
 	/*
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a587f59167a7..997d14506a1f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2822,6 +2822,13 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 	td_vmcs_write64(tdx, POSTED_INTR_DESC_ADDR, __pa(&tdx->pi_desc));
 	td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);
 
+	/*
+	 * Freeze the vCPU model, as KVM relies on guest CPUID and capabilities
+	 * to be consistent with the TDX Module's view from here on out.
+	 */
+	vcpu->arch.vcpu_model_is_frozen = true;
+	kvm_update_cpuid_runtime(vcpu);
+
 	tdx->state = VCPU_TD_STATE_INITIALIZED;
 	return 0;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2ea7db896ba..3db935737b59 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2218,7 +2218,8 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
 	 * writes of the same value, e.g. to allow userspace to blindly stuff
 	 * all MSRs when emulating RESET.
 	 */
-	if (kvm_vcpu_has_run(vcpu) && kvm_is_immutable_feature_msr(index) &&
+	if (vcpu->arch.vcpu_model_is_frozen &&
+	    kvm_is_immutable_feature_msr(index) &&
 	    (do_get_msr(vcpu, index, &val) || *data != val))
 		return -EINVAL;
 
@@ -11469,6 +11470,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	struct kvm_run *kvm_run = vcpu->run;
 	int r;
 
+	/*
+	 * Freeze the vCPU model, i.e. disallow changing CPUID, feature MSRs,
+	 * etc.  KVM doesn't support changing the model once the vCPU has run.
+	 */
+	vcpu->arch.vcpu_model_is_frozen = true;
+
 	vcpu_load(vcpu);
 	kvm_sigset_activate(vcpu);
 	kvm_run->flags = 0;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 514ffd7513f3..6ed074d03616 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -118,11 +118,6 @@ static inline void kvm_leave_nested(struct kvm_vcpu *vcpu)
 	kvm_x86_ops.nested_ops->leave_nested(vcpu);
 }
 
-static inline bool kvm_vcpu_has_run(struct kvm_vcpu *vcpu)
-{
-	return vcpu->arch.last_vmentry_cpu != -1;
-}
-
 static inline bool kvm_is_exception_pending(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.exception.pending ||

base-commit: d12b37e67b767a9e89b221067d48b257708d3044
-- 

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2025-01-09 19:11               ` Sean Christopherson
  2025-01-10 14:50                 ` Adrian Hunter
@ 2025-01-13 19:28                 ` Adrian Hunter
  2025-01-13 23:47                   ` Sean Christopherson
  1 sibling, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2025-01-13 19:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On 9/01/25 21:11, Sean Christopherson wrote:
> My vote would to KVM_BUG_ON() before entering the guest.

I notice if KVM_BUG_ON() is called with interrupts disabled
smp_call_function_many_cond() generates a warning:

WARNING: CPU: 223 PID: 4213 at kernel/smp.c:807 smp_call_function_many_cond+0x421/0x560

static void smp_call_function_many_cond(const struct cpumask *mask,
					smp_call_func_t func, void *info,
					unsigned int scf_flags,
					smp_cond_func_t cond_func)
{
	int cpu, last_cpu, this_cpu = smp_processor_id();
	struct call_function_data *cfd;
	bool wait = scf_flags & SCF_WAIT;
	int nr_cpus = 0;
	bool run_remote = false;
	bool run_local = false;

	lockdep_assert_preemption_disabled();

	/*
	 * Can deadlock when called with interrupts disabled.
	 * We allow cpu's that are not yet online though, as no one else can
	 * send smp call function interrupt to this cpu and as such deadlocks
	 * can't happen.
	 */
	if (cpu_online(this_cpu) && !oops_in_progress &&
	    !early_boot_irqs_disabled)
		lockdep_assert_irqs_enabled();			<------------- here

Do we need to care about that?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2025-01-13 19:28                 ` Adrian Hunter
@ 2025-01-13 23:47                   ` Sean Christopherson
  0 siblings, 0 replies; 82+ messages in thread
From: Sean Christopherson @ 2025-01-13 23:47 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On Mon, Jan 13, 2025, Adrian Hunter wrote:
> On 9/01/25 21:11, Sean Christopherson wrote:
> > My vote would to KVM_BUG_ON() before entering the guest.
> 
> I notice if KVM_BUG_ON() is called with interrupts disabled
> smp_call_function_many_cond() generates a warning:
> 
> WARNING: CPU: 223 PID: 4213 at kernel/smp.c:807 smp_call_function_many_cond+0x421/0x560
> 
> static void smp_call_function_many_cond(const struct cpumask *mask,
> 					smp_call_func_t func, void *info,
> 					unsigned int scf_flags,
> 					smp_cond_func_t cond_func)
> {
> 	int cpu, last_cpu, this_cpu = smp_processor_id();
> 	struct call_function_data *cfd;
> 	bool wait = scf_flags & SCF_WAIT;
> 	int nr_cpus = 0;
> 	bool run_remote = false;
> 	bool run_local = false;
> 
> 	lockdep_assert_preemption_disabled();
> 
> 	/*
> 	 * Can deadlock when called with interrupts disabled.
> 	 * We allow cpu's that are not yet online though, as no one else can
> 	 * send smp call function interrupt to this cpu and as such deadlocks
> 	 * can't happen.
> 	 */
> 	if (cpu_online(this_cpu) && !oops_in_progress &&
> 	    !early_boot_irqs_disabled)
> 		lockdep_assert_irqs_enabled();			<------------- here
> 
> Do we need to care about that?

Ugh, yes.  E.g. the deadlock mentioned in the comment would occur if two vCPUs
hit the KVM_BUG_ON() at the same time (they'd both wait for the other to respond
to *their* IPI).

Since the damage is limited to the current vCPU, i.e. letting userspace run other
vCPUs is unlikely to put KVM in harm's way, a not-awful alternative would be to
WARN_ON_ONCE() and return KVM_EXIT_INTERNAL_ERROR.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2025-01-10 17:30                   ` Sean Christopherson
@ 2025-01-14 20:04                     ` Adrian Hunter
  2025-01-15  2:28                       ` Sean Christopherson
  0 siblings, 1 reply; 82+ messages in thread
From: Adrian Hunter @ 2025-01-14 20:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On 10/01/25 19:30, Sean Christopherson wrote:
> On Fri, Jan 10, 2025, Adrian Hunter wrote:
>> On 9/01/25 21:11, Sean Christopherson wrote:
>>> On Fri, Jan 03, 2025, Adrian Hunter wrote:
>>>> +static u64 tdx_guest_cr4(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>>>> +	u64 cr4;
>>>> +
>>>> +	rdmsrl(MSR_IA32_VMX_CR4_FIXED1, cr4);
>>>
>>> This won't be accurate long-term.  E.g. run KVM on hardware with CR4 bits that
>>> neither KVM nor TDX know about, and vcpu->arch.cr4 will end up with bits set that
>>> KVM think are illegal, which will cause it's own problems.
>>
>> Currently validation of CR4 is only done when user space changes it,
>> which should not be allowed for TDX.  For that it looks like TDX
>> would need:
>>
>> 	kvm->arch.has_protected_state = true;
>>
>> Not sure why it doesn't already?
> 
> Sorry, I didn't follow any of that.
> 
>>> For CR0 and CR4, we should be able to start with KVM's set of allowed bits, not
>>> the CPU's.  That will mean there will likely be missing bits, in vcpu->arch.cr{0,4},
>>> but if KVM doesn't know about a bit, the fact that it's missing should be a complete
>>> non-issue.
>>
>> What about adding:
>>
>> 	cr4 &= ~cr4_reserved_bits;
>>
>> and
>>
>> 	cr0 &= ~CR0_RESERVED_BITS
> 
> I was thinking a much more explicit:
> 
> 	vcpu->arch.cr4 = ~vcpu->arch.cr4_guest_rsvd_bits;
> 
> which if it's done in tdx_vcpu_init(), in conjunction with freezing the vCPU
> model (see below), should be solid.
> 
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index d2ea7db896ba..f2b1980f830d 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -1240,6 +1240,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>>>>  	u64 old_xcr0 = vcpu->arch.xcr0;
>>>>  	u64 valid_bits;
>>>>  
>>>> +	if (vcpu->arch.guest_state_protected) {
>>>
>>> This should be a WARN_ON_ONCE() + return 1, no?
>>
>> With kvm->arch.has_protected_state = true, KVM_SET_XCRS
>> would fail, which would probably be fine except for KVM selftests:
>>
>> Currently the KVM selftests expect to be able to set XCR0:
>>
>>     td_vcpu_add()
>> 	vm_vcpu_add()
>> 	    vm_arch_vcpu_add()
>> 		vcpu_init_xcrs()
>> 		    vcpu_xcrs_set()
>> 			vcpu_ioctl(KVM_SET_XCRS)
>> 			    __TEST_ASSERT_VM_VCPU_IOCTL(!ret)
>>
>> Seems like vm->arch.has_protected_state is needed for KVM selftests?
> 
> I doubt it's truly needed, my guess (without looking at the code) is that selftests
> are fudging around the fact that KVM doesn't stuff arch.xcr0.

Here is when it was added:

commit 8b14c4d85d031f7700fa4e042aebf99d933971f0
Author: Sean Christopherson <seanjc@google.com>
Date:   Thu Oct 3 16:43:31 2024 -0700

    KVM: selftests: Configure XCR0 to max supported value by default
    
    To play nice with compilers generating AVX instructions, set CR4.OSXSAVE
    and configure XCR0 by default when creating selftests vCPUs.  Some distros
    have switched gcc to '-march=x86-64-v3' by default, and while it's hard to
    find a CPU which doesn't support AVX today, many KVM selftests fail with

Is below OK to avoid it?

diff --git a/tools/testing/selftests/kvm/include/x86_64/kvm_util_arch.h b/tools/testing/selftests/kvm/include/x86_64/kvm_util_arch.h
index 972bb1c4ab4c..42925152ed25 100644
--- a/tools/testing/selftests/kvm/include/x86_64/kvm_util_arch.h
+++ b/tools/testing/selftests/kvm/include/x86_64/kvm_util_arch.h
@@ -12,20 +12,21 @@ extern bool is_forced_emulation_enabled;
 
 struct kvm_vm_arch {
 	vm_vaddr_t gdt;
 	vm_vaddr_t tss;
 	vm_vaddr_t idt;
 
 	uint64_t c_bit;
 	uint64_t s_bit;
 	int sev_fd;
 	bool is_pt_protected;
+	bool has_protected_sregs;
 };
 
 static inline bool __vm_arch_has_protected_memory(struct kvm_vm_arch *arch)
 {
 	return arch->c_bit || arch->s_bit;
 }
 
 #define vm_arch_has_protected_memory(vm) \
 	__vm_arch_has_protected_memory(&(vm)->arch)
 
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index 0ed0768b1c88..89b70fe037d1 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -704,22 +704,25 @@ struct kvm_vcpu *vm_arch_vcpu_add(struct kvm_vm *vm, uint32_t vcpu_id)
 	 *
 	 * If this code is ever used to launch a vCPU with 32-bit entry point it
 	 * may need to subtract 4 bytes instead of 8 bytes.
 	 */
 	TEST_ASSERT(IS_ALIGNED(stack_vaddr, PAGE_SIZE),
 		    "__vm_vaddr_alloc() did not provide a page-aligned address");
 	stack_vaddr -= 8;
 
 	vcpu = __vm_vcpu_add(vm, vcpu_id);
 	vcpu_init_cpuid(vcpu, kvm_get_supported_cpuid());
-	vcpu_init_sregs(vm, vcpu);
-	vcpu_init_xcrs(vm, vcpu);
+
+	if (!vm->arch.has_protected_sregs) {
+		vcpu_init_sregs(vm, vcpu);
+		vcpu_init_xcrs(vm, vcpu);
+	}
 
 	vcpu->initial_stack_addr = stack_vaddr;
 
 	/* Setup guest general purpose registers */
 	vcpu_regs_get(vcpu, &regs);
 	regs.rflags = regs.rflags | 0x2;
 	regs.rsp = stack_vaddr;
 	vcpu_regs_set(vcpu, &regs);
 
 	/* Setup the MP state */
diff --git a/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c b/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c
index 16db4e97673e..da4bcfefdd70 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c
@@ -477,25 +477,22 @@ static void load_td_boot_parameters(struct td_boot_parameters *params,
  * entering the TD first time.
  *
  * Input Args:
  *   vm - Virtual Machine
  *   vcpuid - The id of the VCPU to add to the VM.
  */
 struct kvm_vcpu *td_vcpu_add(struct kvm_vm *vm, uint32_t vcpu_id, void *guest_code)
 {
 	struct kvm_vcpu *vcpu;
 
-	/*
-	 * TD setup will not use the value of rip set in vm_vcpu_add anyway, so
-	 * NULL can be used for guest_code.
-	 */
-	vcpu = vm_vcpu_add(vm, vcpu_id, NULL);
+	vm->arch.has_protected_sregs = true;
+	vcpu = vm_arch_vcpu_add(vm, vcpu_id);
 
 	tdx_td_vcpu_init(vcpu);
 
 	load_td_boot_parameters(addr_gpa2hva(vm, TD_BOOT_PARAMETERS_GPA),
 				vcpu, guest_code);
 
 	return vcpu;
 }
 
 /**


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD
  2025-01-14 20:04                     ` Adrian Hunter
@ 2025-01-15  2:28                       ` Sean Christopherson
  0 siblings, 0 replies; 82+ messages in thread
From: Sean Christopherson @ 2025-01-15  2:28 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Chao Gao, pbonzini, kvm, dave.hansen, rick.p.edgecombe, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, x86, yan.y.zhao,
	weijiang.yang

On Tue, Jan 14, 2025, Adrian Hunter wrote:
> On 10/01/25 19:30, Sean Christopherson wrote:
> >> Currently the KVM selftests expect to be able to set XCR0:
> >>
> >>     td_vcpu_add()
> >> 	vm_vcpu_add()
> >> 	    vm_arch_vcpu_add()
> >> 		vcpu_init_xcrs()
> >> 		    vcpu_xcrs_set()
> >> 			vcpu_ioctl(KVM_SET_XCRS)
> >> 			    __TEST_ASSERT_VM_VCPU_IOCTL(!ret)
> >>
> >> Seems like vm->arch.has_protected_state is needed for KVM selftests?
> > 
> > I doubt it's truly needed, my guess (without looking at the code) is that selftests
> > are fudging around the fact that KVM doesn't stuff arch.xcr0.
> 
> Here is when it was added:
> 
> commit 8b14c4d85d031f7700fa4e042aebf99d933971f0
> Author: Sean Christopherson <seanjc@google.com>
> Date:   Thu Oct 3 16:43:31 2024 -0700
> 
>     KVM: selftests: Configure XCR0 to max supported value by default
>     
>     To play nice with compilers generating AVX instructions, set CR4.OSXSAVE
>     and configure XCR0 by default when creating selftests vCPUs.  Some distros
>     have switched gcc to '-march=x86-64-v3' by default, and while it's hard to
>     find a CPU which doesn't support AVX today, many KVM selftests fail with

Gah, sorry.  I misread the callstack the first time around and didn't realize it
was the common code that was writing XCR0.

> Is below OK to avoid it?

Skipping the ioctls to set XCRs and SREGS is definitely ok.  I'll hold off on
providing concrete feedback until I review the TDX selftests in its entirety,
as I'm skeptical of having td_vcpu_add() wrap vm_arch_vcpu_add() instead of the
other way around, but I don't want to cause a bunch of noise by reacting to a
sliver of the code.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: PKEY syscall number for selftest? (was: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD)
  2024-12-20 21:24             ` PKEY syscall number for selftest? (was: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD) Sean Christopherson
@ 2025-01-27 17:09               ` Sean Christopherson
  0 siblings, 0 replies; 82+ messages in thread
From: Sean Christopherson @ 2025-01-27 17:09 UTC (permalink / raw)
  To: Dave Hansen; +Cc: kvm, linux-kernel, x86

Ping, I'm guessing this fell through the holiday cracks.

On Fri, Dec 20, 2024, Sean Christopherson wrote:
> Switching topics, dropped everyone else except the list.
> 
> On Fri, Dec 20, 2024, Sean Christopherson wrote:
> >  arch/x86/kvm/x86.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 4320647bd78a..9d5cece9260b 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1186,7 +1186,7 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
> >  	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
> >  	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
> >  	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)))
> > -		write_pkru(vcpu->arch.pkru);
> > +		wrpkru(vcpu->arch.pkru);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
> >  
> > @@ -1200,7 +1200,7 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
> >  	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
> >  		vcpu->arch.pkru = rdpkru();
> >  		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
> > -			write_pkru(vcpu->arch.host_pkru);
> > +			wrpkru(vcpu->arch.host_pkru);
> >  	}
> >  
> >  	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> > 
> > base-commit: 13e98294d7cec978e31138d16824f50556a62d17
> > -- 
> 
> I tried to test this by running the mm/protection_keys selftest in a VM, but it
> gives what are effectively false passes on x86-64 due to the selftest picking up
> the generic syscall numbers, e.g. 289 for SYS_pkey_alloc, instead of the x86-64
> numbers.
> 
> I was able to get the test to run by hacking tools/testing/selftests/mm/pkey-x86.h
> to shove in the right numbers, but I can't imagine that's the intended behavior.
> 
> If I omit the #undefs from pkey-x86.h, it shows that the test is grabbing the
> definitions from the generic usr/include/asm-generic/unistd.h header.
> 
> Am I doing something stupid?
> 
> Regardless of whether this is PEBKAC or working as intended, on x86, the test
> should ideally assert that "ospke" support in /proc/cpuinfo is consistent with
> the result of sys_pkey_alloc(), e.g. so that an failure to allocate a pkey on a
> system that work is reported as an error, not a pass.
> 
> --
> diff --git a/tools/testing/selftests/mm/pkey-x86.h b/tools/testing/selftests/mm/pkey-x86.h
> index ac91777c8917..ccc3552e6b77 100644
> --- a/tools/testing/selftests/mm/pkey-x86.h
> +++ b/tools/testing/selftests/mm/pkey-x86.h
> @@ -3,6 +3,10 @@
>  #ifndef _PKEYS_X86_H
>  #define _PKEYS_X86_H
>  
> +#define __NR_pkey_mprotect     329
> +#define __NR_pkey_alloc                330
> +#define __NR_pkey_free         331
> +
>  #ifdef __i386__
>  
>  #define REG_IP_IDX             REG_EIP
> --
> 
> Yields:
> 
> $ ARCH=x86_64 make protection_keys_64
> gcc -Wall -I /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../..  -isystem /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../../usr/include -isystem /home/sean/go/src/kernel.org/linux/tools/testing/selftests/../../../tools/include/uapi -no-pie -D_GNU_SOURCE=  -m64 -mxsave  protection_keys.c vm_util.c thp_settings.c -lrt -lpthread -lm -lrt -ldl -o /home/sean/go/src/kernel.org/linux/tools/testing/selftests/mm/protection_keys_64
> In file included from pkey-helpers.h:102:0,
>                  from protection_keys.c:49:
> pkey-x86.h:6:0: warning: "__NR_pkey_mprotect" redefined
>  #define __NR_pkey_mprotect 329
>  
> In file included from protection_keys.c:45:0:
> /home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:693:0: note: this is the location of the previous definition
>  #define __NR_pkey_mprotect 288
>  
> In file included from pkey-helpers.h:102:0,
>                  from protection_keys.c:49:
> pkey-x86.h:7:0: warning: "__NR_pkey_alloc" redefined
>  #define __NR_pkey_alloc  330
>  
> In file included from protection_keys.c:45:0:
> /home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:695:0: note: this is the location of the previous definition
>  #define __NR_pkey_alloc 289
>  
> In file included from pkey-helpers.h:102:0,
>                  from protection_keys.c:49:
> pkey-x86.h:8:0: warning: "__NR_pkey_free" redefined
>  #define __NR_pkey_free  331
>  
> In file included from protection_keys.c:45:0:
> /home/sean/go/src/kernel.org/linux/usr/include/asm-generic/unistd.h:697:0: note: this is the location of the previous definition
>  #define __NR_pkey_free 290
>  
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2025-01-27 17:09 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-21 20:14 [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Adrian Hunter
2024-11-21 20:14 ` [PATCH RFC 1/7] x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest Adrian Hunter
2024-11-22 11:10   ` Adrian Hunter
2024-11-22 16:33     ` Dave Hansen
2024-11-25 13:40       ` Adrian Hunter
2024-11-28 11:13         ` Adrian Hunter
2024-12-04 15:58           ` Adrian Hunter
2024-12-11 18:43             ` Adrian Hunter
2024-12-13 15:45               ` Adrian Hunter
2024-12-13 16:16               ` Dave Hansen
2024-12-13 16:30                 ` Adrian Hunter
2024-12-13 16:44                   ` Dave Hansen
2024-11-22 16:26   ` Dave Hansen
2024-11-22 17:29     ` Edgecombe, Rick P
2024-11-25 13:43       ` Adrian Hunter
2024-11-21 20:14 ` [PATCH 2/7] KVM: TDX: Implement TDX vcpu enter/exit path Adrian Hunter
2024-11-22  5:23   ` Xiaoyao Li
2024-11-22  5:56     ` Binbin Wu
2024-11-22 14:33       ` Adrian Hunter
2024-11-28  5:56         ` Yan Zhao
2024-11-28  6:26           ` Adrian Hunter
2024-11-21 20:14 ` [PATCH 3/7] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) Adrian Hunter
2024-11-25 14:12   ` Nikolay Borisov
2024-11-26 16:15     ` Adrian Hunter
2024-11-21 20:14 ` [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
2024-11-22  5:49   ` Chao Gao
2024-11-25 11:10     ` Adrian Hunter
2024-11-26  2:20       ` Chao Gao
2024-11-28  6:50         ` Adrian Hunter
2024-12-02  2:52           ` Chao Gao
2024-12-02  6:36             ` Adrian Hunter
2024-12-17 16:09       ` Sean Christopherson
2024-12-20 15:22         ` Adrian Hunter
2024-12-20 16:22           ` Sean Christopherson
2024-12-20 21:24             ` PKEY syscall number for selftest? (was: [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD) Sean Christopherson
2025-01-27 17:09               ` Sean Christopherson
2025-01-03 18:16             ` [PATCH 4/7] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
2025-01-09 19:11               ` Sean Christopherson
2025-01-10 14:50                 ` Adrian Hunter
2025-01-10 17:30                   ` Sean Christopherson
2025-01-14 20:04                     ` Adrian Hunter
2025-01-15  2:28                       ` Sean Christopherson
2025-01-13 19:28                 ` Adrian Hunter
2025-01-13 23:47                   ` Sean Christopherson
2024-11-25 11:34     ` Adrian Hunter
2024-11-21 20:14 ` [PATCH 5/7] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr Adrian Hunter
2024-11-21 20:14 ` [PATCH 6/7] KVM: TDX: restore user ret MSRs Adrian Hunter
2024-11-21 20:14 ` [PATCH 7/7] KVM: TDX: Add TSX_CTRL msr into uret_msrs list Adrian Hunter
2024-11-22  3:27   ` Chao Gao
2024-11-27 14:00     ` Sean Christopherson
2024-11-29 11:39       ` Adrian Hunter
2024-12-02 19:07         ` Sean Christopherson
2024-12-02 19:24           ` Edgecombe, Rick P
2024-12-03  0:34             ` Sean Christopherson
2024-12-03 17:34               ` Edgecombe, Rick P
2024-12-03 19:17                 ` Adrian Hunter
2024-12-04  1:25                   ` Chao Gao
2024-12-04  6:18                     ` Adrian Hunter
2024-12-04  6:37                       ` Chao Gao
2024-12-04  6:57                         ` Adrian Hunter
2024-12-04 11:13                           ` Chao Gao
2024-12-04 11:55                             ` Adrian Hunter
2024-12-04 15:33                               ` Xiaoyao Li
2024-12-04 23:51                                 ` Edgecombe, Rick P
2024-12-05 17:31                                 ` Adrian Hunter
2024-12-06  3:37                                   ` Xiaoyao Li
2024-12-06 14:40                                     ` Adrian Hunter
2024-12-09  2:46                                       ` Xiaoyao Li
2024-12-09  7:08                                         ` Adrian Hunter
2024-12-10  2:45                                           ` Xiaoyao Li
2024-12-04 23:40                               ` Edgecombe, Rick P
2024-11-25  1:25 ` [PATCH 0/7] KVM: TDX: TD vcpu enter/exit Binbin Wu
2024-11-25 15:19   ` Sean Christopherson
2024-11-25 19:50     ` Huang, Kai
2024-11-25 22:51       ` Sean Christopherson
2024-11-26  1:43         ` Huang, Kai
2024-11-26  1:44         ` Binbin Wu
2024-11-26  3:52           ` Huang, Kai
2024-11-26  5:29             ` Binbin Wu
2024-11-26  5:37               ` Huang, Kai
2024-11-26 21:41               ` Sean Christopherson
2024-12-10 18:23 ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).