[PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit
@ 2025-01-29  9:58 Adrian Hunter
  2025-01-29  9:58 ` [PATCH V2 01/12] x86/virt/tdx: Make tdh_vp_enter() noinstr Adrian Hunter
                   ` (11 more replies)
  0 siblings, 12 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:58 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

Hi

Changes in V2:

  Note, in addition to addressing replies to version 1, there are changes
  suggested by Sean here:

    https://lore.kernel.org/r/Z1suNzg2Or743a7e@google.com

  Patches are re-based on kvm-coco-queue plus:

    latest TDX MMU Part 2 series
    latest TDX vCPU/VM creation series

  x86/virt/tdx: Make tdh_vp_enter() noinstr
  KVM: x86: Allow the use of kvm_load_host_xsave_state() with
            guest_state_protected
  KVM: TDX: Set arch.has_protected_state to true
  KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct
    New patches

  KVM: TDX: Implement TDX vcpu enter/exit path
    Move VCPU_TD_STATE_INITIALIZED check to tdx_vcpu_pre_run() (Xiaoyao)
    Check TD_STATE_RUNNABLE also in tdx_vcpu_pre_run() (Yan)
    Add back 'noinstr' for tdx_vcpu_enter_exit() (Sean)
    Add WARN_ON_ONCE if force_immediate_exit (Sean)
    Add vp_enter_args to vcpu_tdx to store the input/output arguments for
    tdh_vp_enter().
    Don't copy arguments to/from vcpu->arch.regs[] unconditionally. (Sean)

  KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
    Use 1 variable named 'guest_state_loaded' to track host state
    save/restore (Sean)
    Rebased due to moving guest_state_loaded/msr_host_kernel_gs_base
    to struct vcpu_vt.

  KVM: TDX: restore host xsave state when exit from the guest TD
    Drop PT and CET feature flags (Chao)
    Use cpu_feature_enabled() instead of static_cpu_has() (Chao)
    Restore PKRU only if the host value differs from defined
    exit value (Chao)
    Use defined masks to separate XFAM bits into XCR0/XSS (Adrian)
    Use existing kvm_load_host_xsave_state() in place of
    tdx_restore_host_xsave_state() by defining guest CR4, XCR0, XSS
    and PKRU (Sean)
    Do not enter if vital guest state is invalid (Adrian)

  KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr
  KVM: TDX: restore user ret MSRs
    No changes

  KVM: TDX: Disable support for TSX and WAITPKG
  KVM: TDX: Save and restore IA32_DEBUGCTL
    New patches

  KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
    Moved from TDX "the rest" to "TD vcpu enter/exit"

This patch series introduces callbacks to facilitate the entry of a TD VCPU
and the corresponding save/restore of host state.

The "outstanding things still to do" from V1 are now all done.  Refer V1
cover letter:

  https://lore.kernel.org/20241121201448.36170-1-adrian.hunter@intel.com

So the patches are ready for hand off to Paolo.

Overview

A TD VCPU is entered via the SEAMCALL TDH.VP.ENTER. The TDX Module manages
the save/restore of guest state and, in conjunction with the SEAMCALL
interface, handles certain aspects of host state. However, there are
specific elements of the host state that require additional attention, as
detailed in the Intel TDX ABI documentation for TDH.VP.ENTER.

TDX is quite different from VMX in this regard.  For VMX, the host VMM is
heavily involved in restoring, managing and saving guest CPU state, whereas
for TDX this is handled by the TDX Module.  In that way, the TDX Module can
protect the confidentiality and integrity of TD CPU state.

The TDX Module does not save/restore all host CPU state because the host
VMM can do it more efficiently and selectively.  CPU state referred to
below is host CPU state.  Often values are already held in memory so no
explicit save is needed, and restoration may not be needed if the kernel
is not using a feature.

Key Details

Argument Passing: Similar to other SEAMCALLs, TDH.VP.ENTER passes
arguments through General Purpose Registers (GPRs). For the special case
of the TD guest invoking TDG.VP.VMCALL, nearly any GPR can be used,
as well as XMM0 to XMM15. Notably, RBP is not used, and Linux mandates the
TDX Module feature NO_RBP_MOD, which is enforced elsewhere. Additionally,
XMM registers are not required for the existing Guest Hypervisor
Communication Interface and are handled by existing KVM code should they be
modified by the guest.

Debug Register Handling: After TDH.VP.ENTER returns, registers DR0, DR1,
DR2, DR3, DR6, and DR7 are set to their architectural INIT values. Existing
KVM code already handles the restoration of host values as needed, refer
vcpu_enter_guest() which calls hw_breakpoint_restore().

MSR Restoration: Certain Model-Specific Registers (MSRs) need to be
restored post TDH.VP.ENTER. The Intel TDX ABI documentation provides a
detailed list in the msr_preservation.json file. Most MSRs do not require
restoration if the guest is not utilizing the corresponding feature. The
following features are currently assumed to be unsupported, and their MSRs
are not restored:
  PERFMON            (TD ATTRIBUTES[63])
  LBRs               (XFAM[15])
  User Interrupts    (XFAM[14])
  Intel PT           (XFAM[8])
  CET                (XFAM[11-12])

Remaining host MSR/Register Handling:

MSR IA32_XFD is already restored by KVM, refer to kvm_put_guest_fpu().
The TDX Module sets MSR IA32_XFD_ERR to its RESET value (0) which is
fine for the kernel.

MSR IA32_DEBUGCTL is addressed in patch "KVM: TDX: Save and restore
IA32_DEBUGCTL".

MSR IA32_UARCH_MISC_CTL is not utilized by the kernel, so it is fine if
the TDX Module sets it to it's RESET value.

MSR IA32_KERNEL_GS_BASE is addressed in patch "KVM: TDX: vcpu_run:
save/restore host state (host kernel gs)".

MSRs IA32_XSS and XCRO are handled in patch "KVM: TDX: restore host xsave
state when exiting from the guest TD".

MSRs IA32_STAR, IA32_LSTAR, IA32_FMASK, and IA32_TSC_AUX are handled in
patch "KVM: TDX: restore user ret MSRs".

MSR IA32_TSX_CTRL and MSR IA32_UMWAIT_CONTROL are addressed in patch
"KVM: TDX: Disable support for TSX and WAITPKG"

Protecting Host CPU State

These patches completely protect host CPU state.  The host CPU state
that can be affected is determined by the TD configuration, especially
XFAM, TD attributes and CPUID.

Should any vital guest information be found to conflict with TD
configuration, VCPU entry is blocked by checks in
tdx_guest_state_is_invalid().

Base

This series is based on kvm-coco-queue plus:

      latest TDX MMU Part 2 series
      latest TDX vCPU/VM creation series

Due to "KVM: VMX: Move common fields of struct", subsequent patches require
changes to use new struct vcpu_vt, refer to the full KVM branch below.

It requires TDX module 1.5.06.00.0744 [1], or later. This is due to removal
of the workarounds for the lack of the NO_RBP_MOD feature required by the
kernel. Now NO_RBP_MOD is enabled (in VM/vCPU creation patches), and this
particular version of the TDX module has a required NO_RBP_MOD related bug
fix.
A working edk2 commit is 95d8a1c ("UnitTestFrameworkPkg: Use TianoCore
mirror of subhook submodule").

Testing

The series has been tested as part of the development branch for the TDX
base series. The testing consisted of TDX kvm-unit-tests and booting a
Linux TD, and TDX enhanced KVM selftests.

The full KVM branch is here:

  https://github.com/intel/tdx/tree/tdx_kvm_dev-2025-01-28.2

Matching QEMU: TBD

[1] https://github.com/intel/tdx-module/releases/tag/TDX_1.5.06

Adrian Hunter (4):
      x86/virt/tdx: Make tdh_vp_enter() noinstr
      KVM: TDX: Set arch.has_protected_state to true
      KVM: TDX: Disable support for TSX and WAITPKG
      KVM: TDX: Save and restore IA32_DEBUGCTL

Binbin Wu (1):
      KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct

Chao Gao (1):
      KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr

Isaku Yamahata (5):
      KVM: TDX: Implement TDX vcpu enter/exit path
      KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
      KVM: TDX: restore host xsave state when exit from the guest TD
      KVM: TDX: restore user ret MSRs
      KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior

Sean Christopherson (1):
      KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected

 arch/x86/include/asm/kvm_host.h  |  12 +-
 arch/x86/kvm/svm/svm.c           |   7 +-
 arch/x86/kvm/vmx/common.h        |  68 +++++++++++
 arch/x86/kvm/vmx/main.c          |  48 +++++++-
 arch/x86/kvm/vmx/nested.c        |  10 +-
 arch/x86/kvm/vmx/posted_intr.c   |  18 +--
 arch/x86/kvm/vmx/tdx.c           | 258 +++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.h           |  19 +--
 arch/x86/kvm/vmx/vmx.c           |  99 +++++++--------
 arch/x86/kvm/vmx/vmx.h           | 104 ++++++----------
 arch/x86/kvm/vmx/x86_ops.h       |  11 ++
 arch/x86/kvm/x86.c               |  46 +++++--
 arch/x86/virt/vmx/tdx/seamcall.S |   3 +
 arch/x86/virt/vmx/tdx/tdx.c      |   2 +-
 14 files changed, 535 insertions(+), 170 deletions(-)

Regards
Adrian

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V2 01/12] x86/virt/tdx: Make tdh_vp_enter() noinstr
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
@ 2025-01-29  9:58 ` Adrian Hunter
  2025-02-16 18:26   ` Paolo Bonzini
  2025-01-29  9:58 ` [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected Adrian Hunter
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:58 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang,
	dave.hansen, x86

Make tdh_vp_enter() noinstr because KVM requires VM entry to be noinstr
for 2 reasons:
 1. The use of context tracking via guest_state_enter_irqoff() and
    guest_state_exit_irqoff()
 2. The need to avoid IRET between VM-exit and NMI handling in order to
    avoid prematurely releasing NMI inhibit.

Consequently make __seamcall_saved_ret() noinstr also. Currently
tdh_vp_enter() is the only caller of __seamcall_saved_ret().

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v2:
 - New patch
---
 arch/x86/virt/vmx/tdx/seamcall.S | 3 +++
 arch/x86/virt/vmx/tdx/tdx.c      | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
index 5b1f2286aea9..6854c52c374b 100644
--- a/arch/x86/virt/vmx/tdx/seamcall.S
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -41,6 +41,9 @@ SYM_FUNC_START(__seamcall_ret)
 	TDX_MODULE_CALL host=1 ret=1
 SYM_FUNC_END(__seamcall_ret)
 
+/* KVM requires non-instrumentable __seamcall_saved_ret() for TDH.VP.ENTER */
+.section .noinstr.text, "ax"
+
 /*
  * __seamcall_saved_ret() - Host-side interface functions to SEAM software
  * (the P-SEAMLDR or the TDX module), with saving output registers to the
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 4a010e65276d..1515c467dd86 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1511,7 +1511,7 @@ static void tdx_clflush_page(struct page *page)
 	clflush_cache_range(page_to_virt(page), PAGE_SIZE);
 }
 
-u64 tdh_vp_enter(struct tdx_vp *td, struct tdx_module_args *args)
+noinstr u64 tdh_vp_enter(struct tdx_vp *td, struct tdx_module_args *args)
 {
 	args->rcx = tdx_tdvpr_pa(td);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 01/12] x86/virt/tdx: Make tdh_vp_enter() noinstr
  2025-01-29  9:58 ` [PATCH V2 01/12] x86/virt/tdx: Make tdh_vp_enter() noinstr Adrian Hunter
@ 2025-02-16 18:26   ` Paolo Bonzini
  2025-02-27 14:13     ` Adrian Hunter
  0 siblings, 1 reply; 39+ messages in thread
From: Paolo Bonzini @ 2025-02-16 18:26 UTC (permalink / raw)
  To: Adrian Hunter, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, yan.y.zhao, chao.gao, weijiang.yang, dave.hansen,
	x86

On 1/29/25 10:58, Adrian Hunter wrote:
> Make tdh_vp_enter() noinstr because KVM requires VM entry to be noinstr
> for 2 reasons:
>   1. The use of context tracking via guest_state_enter_irqoff() and
>      guest_state_exit_irqoff()
>   2. The need to avoid IRET between VM-exit and NMI handling in order to
>      avoid prematurely releasing NMI inhibit.
> 
> Consequently make __seamcall_saved_ret() noinstr also. Currently
> tdh_vp_enter() is the only caller of __seamcall_saved_ret().
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>

This can be squashed into "x86/virt/tdx: Add SEAMCALL wrapper to 
enter/exit TDX guest"; I did that in kvm-coco-queue.

Paolo

> ---
> TD vcpu enter/exit v2:
>   - New patch
> ---
>   arch/x86/virt/vmx/tdx/seamcall.S | 3 +++
>   arch/x86/virt/vmx/tdx/tdx.c      | 2 +-
>   2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
> index 5b1f2286aea9..6854c52c374b 100644
> --- a/arch/x86/virt/vmx/tdx/seamcall.S
> +++ b/arch/x86/virt/vmx/tdx/seamcall.S
> @@ -41,6 +41,9 @@ SYM_FUNC_START(__seamcall_ret)
>   	TDX_MODULE_CALL host=1 ret=1
>   SYM_FUNC_END(__seamcall_ret)
>   
> +/* KVM requires non-instrumentable __seamcall_saved_ret() for TDH.VP.ENTER */
> +.section .noinstr.text, "ax"
> +
>   /*
>    * __seamcall_saved_ret() - Host-side interface functions to SEAM software
>    * (the P-SEAMLDR or the TDX module), with saving output registers to the
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 4a010e65276d..1515c467dd86 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1511,7 +1511,7 @@ static void tdx_clflush_page(struct page *page)
>   	clflush_cache_range(page_to_virt(page), PAGE_SIZE);
>   }
>   
> -u64 tdh_vp_enter(struct tdx_vp *td, struct tdx_module_args *args)
> +noinstr u64 tdh_vp_enter(struct tdx_vp *td, struct tdx_module_args *args)
>   {
>   	args->rcx = tdx_tdvpr_pa(td);
>   


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 01/12] x86/virt/tdx: Make tdh_vp_enter() noinstr
  2025-02-16 18:26   ` Paolo Bonzini
@ 2025-02-27 14:13     ` Adrian Hunter
  0 siblings, 0 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-02-27 14:13 UTC (permalink / raw)
  To: Paolo Bonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, nik.borisov,
	linux-kernel, yan.y.zhao, chao.gao, weijiang.yang, dave.hansen,
	x86

On 16/02/25 20:26, Paolo Bonzini wrote:
> On 1/29/25 10:58, Adrian Hunter wrote:
>> Make tdh_vp_enter() noinstr because KVM requires VM entry to be noinstr
>> for 2 reasons:
>>   1. The use of context tracking via guest_state_enter_irqoff() and
>>      guest_state_exit_irqoff()
>>   2. The need to avoid IRET between VM-exit and NMI handling in order to
>>      avoid prematurely releasing NMI inhibit.
>>
>> Consequently make __seamcall_saved_ret() noinstr also. Currently
>> tdh_vp_enter() is the only caller of __seamcall_saved_ret().
>>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> 
> This can be squashed into "x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest"; I did that in kvm-coco-queue.

We have re-based on kvm-coco-queue so we in-sync on this.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
  2025-01-29  9:58 ` [PATCH V2 01/12] x86/virt/tdx: Make tdh_vp_enter() noinstr Adrian Hunter
@ 2025-01-29  9:58 ` Adrian Hunter
  2025-02-20 10:50   ` Xiaoyao Li
  2025-01-29  9:58 ` [PATCH V2 03/12] KVM: TDX: Set arch.has_protected_state to true Adrian Hunter
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:58 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

From: Sean Christopherson <seanjc@google.com>

Allow the use of kvm_load_host_xsave_state() with
vcpu->arch.guest_state_protected == true. This will allow TDX to reuse
kvm_load_host_xsave_state() instead of creating its own version.

For consistency, amend kvm_load_guest_xsave_state() also.

Ensure that guest state that kvm_load_host_xsave_state() depends upon,
such as MSR_IA32_XSS, cannot be changed by user space, if
guest_state_protected.

[Adrian: wrote commit message]

Link: https://lore.kernel.org/r/Z2GiQS_RmYeHU09L@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v2:
 - New patch
---
 arch/x86/kvm/svm/svm.c |  7 +++++--
 arch/x86/kvm/x86.c     | 18 +++++++++++-------
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7640a84e554a..b4bcfe15ad5e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4253,7 +4253,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
 		svm_set_dr6(svm, DR6_ACTIVE_LOW);
 
 	clgi();
-	kvm_load_guest_xsave_state(vcpu);
+
+	if (!vcpu->arch.guest_state_protected)
+		kvm_load_guest_xsave_state(vcpu);
 
 	kvm_wait_lapic_expire(vcpu);
 
@@ -4282,7 +4284,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
 	if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
 		kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
 
-	kvm_load_host_xsave_state(vcpu);
+	if (!vcpu->arch.guest_state_protected)
+		kvm_load_host_xsave_state(vcpu);
 	stgi();
 
 	/* Any pending NMI will happen here */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bbb6b7f40b3a..5cf9f023fd4b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1169,11 +1169,9 @@ EXPORT_SYMBOL_GPL(kvm_lmsw);
 
 void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->arch.guest_state_protected)
-		return;
+	WARN_ON_ONCE(vcpu->arch.guest_state_protected);
 
 	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
-
 		if (vcpu->arch.xcr0 != kvm_host.xcr0)
 			xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
 
@@ -1192,13 +1190,11 @@ EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
 
 void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->arch.guest_state_protected)
-		return;
-
 	if (cpu_feature_enabled(X86_FEATURE_PKU) &&
 	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
 	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
-		vcpu->arch.pkru = rdpkru();
+		if (!vcpu->arch.guest_state_protected)
+			vcpu->arch.pkru = rdpkru();
 		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
 			wrpkru(vcpu->arch.host_pkru);
 	}
@@ -3916,6 +3912,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
 			return 1;
+
+		if (vcpu->arch.guest_state_protected)
+			return 1;
+
 		/*
 		 * KVM supports exposing PT to the guest, but does not support
 		 * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
@@ -4375,6 +4375,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
 			return 1;
+
+		if (vcpu->arch.guest_state_protected)
+			return 1;
+
 		msr_info->data = vcpu->arch.ia32_xss;
 		break;
 	case MSR_K7_CLK_CTL:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected
  2025-01-29  9:58 ` [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected Adrian Hunter
@ 2025-02-20 10:50   ` Xiaoyao Li
  2025-02-24 11:38     ` Adrian Hunter
  2025-03-06 18:04     ` Paolo Bonzini
  0 siblings, 2 replies; 39+ messages in thread
From: Xiaoyao Li @ 2025-02-20 10:50 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 1/29/2025 5:58 PM, Adrian Hunter wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Allow the use of kvm_load_host_xsave_state() with
> vcpu->arch.guest_state_protected == true. This will allow TDX to reuse
> kvm_load_host_xsave_state() instead of creating its own version.
> 
> For consistency, amend kvm_load_guest_xsave_state() also.
> 
> Ensure that guest state that kvm_load_host_xsave_state() depends upon,
> such as MSR_IA32_XSS, cannot be changed by user space, if
> guest_state_protected.
> 
> [Adrian: wrote commit message]
> 
> Link: https://lore.kernel.org/r/Z2GiQS_RmYeHU09L@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
> TD vcpu enter/exit v2:
>   - New patch
> ---
>   arch/x86/kvm/svm/svm.c |  7 +++++--
>   arch/x86/kvm/x86.c     | 18 +++++++++++-------
>   2 files changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7640a84e554a..b4bcfe15ad5e 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4253,7 +4253,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>   		svm_set_dr6(svm, DR6_ACTIVE_LOW);
>   
>   	clgi();
> -	kvm_load_guest_xsave_state(vcpu);
> +
> +	if (!vcpu->arch.guest_state_protected)
> +		kvm_load_guest_xsave_state(vcpu);
>   
>   	kvm_wait_lapic_expire(vcpu);
>   
> @@ -4282,7 +4284,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>   	if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
>   		kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
>   
> -	kvm_load_host_xsave_state(vcpu);
> +	if (!vcpu->arch.guest_state_protected)
> +		kvm_load_host_xsave_state(vcpu);
>   	stgi();
>   
>   	/* Any pending NMI will happen here */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index bbb6b7f40b3a..5cf9f023fd4b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1169,11 +1169,9 @@ EXPORT_SYMBOL_GPL(kvm_lmsw);
>   
>   void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>   {
> -	if (vcpu->arch.guest_state_protected)
> -		return;
> +	WARN_ON_ONCE(vcpu->arch.guest_state_protected);
>   
>   	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> -
>   		if (vcpu->arch.xcr0 != kvm_host.xcr0)
>   			xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
>   
> @@ -1192,13 +1190,11 @@ EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>   
>   void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>   {
> -	if (vcpu->arch.guest_state_protected)
> -		return;
> -
>   	if (cpu_feature_enabled(X86_FEATURE_PKU) &&
>   	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>   	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
> -		vcpu->arch.pkru = rdpkru();
> +		if (!vcpu->arch.guest_state_protected)
> +			vcpu->arch.pkru = rdpkru();

this needs justification.

>   		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
>   			wrpkru(vcpu->arch.host_pkru);
>   	}


> @@ -3916,6 +3912,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   		if (!msr_info->host_initiated &&
>   		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>   			return 1;
> +
> +		if (vcpu->arch.guest_state_protected)
> +			return 1;
> +

this and below change need to be a separate patch. So that we can 
discuss independently.

I see no reason to make MSR_IA32_XSS special than other MSRs. When 
guest_state_protected, most of the MSRs that aren't emulated by KVM are 
inaccessible by KVM.

>   		/*
>   		 * KVM supports exposing PT to the guest, but does not support
>   		 * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
> @@ -4375,6 +4375,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   		if (!msr_info->host_initiated &&
>   		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>   			return 1;
> +
> +		if (vcpu->arch.guest_state_protected)
> +			return 1;
> +
>   		msr_info->data = vcpu->arch.ia32_xss;
>   		break;
>   	case MSR_K7_CLK_CTL:


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected
  2025-02-20 10:50   ` Xiaoyao Li
@ 2025-02-24 11:38     ` Adrian Hunter
  2025-02-25  5:56       ` Xiaoyao Li
  2025-03-06 18:04     ` Paolo Bonzini
  1 sibling, 1 reply; 39+ messages in thread
From: Adrian Hunter @ 2025-02-24 11:38 UTC (permalink / raw)
  To: Xiaoyao Li, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 20/02/25 12:50, Xiaoyao Li wrote:
> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>> From: Sean Christopherson <seanjc@google.com>
>>
>> Allow the use of kvm_load_host_xsave_state() with
>> vcpu->arch.guest_state_protected == true. This will allow TDX to reuse
>> kvm_load_host_xsave_state() instead of creating its own version.
>>
>> For consistency, amend kvm_load_guest_xsave_state() also.
>>
>> Ensure that guest state that kvm_load_host_xsave_state() depends upon,
>> such as MSR_IA32_XSS, cannot be changed by user space, if
>> guest_state_protected.
>>
>> [Adrian: wrote commit message]
>>
>> Link: https://lore.kernel.org/r/Z2GiQS_RmYeHU09L@google.com
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> ---
>> TD vcpu enter/exit v2:
>>   - New patch
>> ---
>>   arch/x86/kvm/svm/svm.c |  7 +++++--
>>   arch/x86/kvm/x86.c     | 18 +++++++++++-------
>>   2 files changed, 16 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 7640a84e554a..b4bcfe15ad5e 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -4253,7 +4253,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>>           svm_set_dr6(svm, DR6_ACTIVE_LOW);
>>         clgi();
>> -    kvm_load_guest_xsave_state(vcpu);
>> +
>> +    if (!vcpu->arch.guest_state_protected)
>> +        kvm_load_guest_xsave_state(vcpu);
>>         kvm_wait_lapic_expire(vcpu);
>>   @@ -4282,7 +4284,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>>       if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
>>           kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
>>   -    kvm_load_host_xsave_state(vcpu);
>> +    if (!vcpu->arch.guest_state_protected)
>> +        kvm_load_host_xsave_state(vcpu);
>>       stgi();
>>         /* Any pending NMI will happen here */
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index bbb6b7f40b3a..5cf9f023fd4b 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1169,11 +1169,9 @@ EXPORT_SYMBOL_GPL(kvm_lmsw);
>>     void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>>   {
>> -    if (vcpu->arch.guest_state_protected)
>> -        return;
>> +    WARN_ON_ONCE(vcpu->arch.guest_state_protected);
>>         if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
>> -
>>           if (vcpu->arch.xcr0 != kvm_host.xcr0)
>>               xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
>>   @@ -1192,13 +1190,11 @@ EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>>     void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>>   {
>> -    if (vcpu->arch.guest_state_protected)
>> -        return;
>> -
>>       if (cpu_feature_enabled(X86_FEATURE_PKU) &&
>>           ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>>            kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
>> -        vcpu->arch.pkru = rdpkru();
>> +        if (!vcpu->arch.guest_state_protected)
>> +            vcpu->arch.pkru = rdpkru();
> 
> this needs justification.

It was proposed by Sean here:

	https://lore.kernel.org/all/Z2WZ091z8GmGjSbC@google.com/

which is part of the email thread referenced by the "Link:" tag above

> 
>>           if (vcpu->arch.pkru != vcpu->arch.host_pkru)
>>               wrpkru(vcpu->arch.host_pkru);
>>       }
> 
> 
>> @@ -3916,6 +3912,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>           if (!msr_info->host_initiated &&
>>               !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>>               return 1;
>> +
>> +        if (vcpu->arch.guest_state_protected)
>> +            return 1;
>> +
> 
> this and below change need to be a separate patch. So that we can discuss independently.
> 
> I see no reason to make MSR_IA32_XSS special than other MSRs. When guest_state_protected, most of the MSRs that aren't emulated by KVM are inaccessible by KVM.

Yes, TDX will block access to MSR_IA32_XSS anyway because
tdx_has_emulated_msr() will return false for MSR_IA32_XSS.

However kvm_load_host_xsave_state() is not TDX-specific code and it
relies upon vcpu->arch.ia32_xss, so there is reason to block
access to it when vcpu->arch.guest_state_protected is true.

> 
>>           /*
>>            * KVM supports exposing PT to the guest, but does not support
>>            * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
>> @@ -4375,6 +4375,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>           if (!msr_info->host_initiated &&
>>               !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>>               return 1;
>> +
>> +        if (vcpu->arch.guest_state_protected)
>> +            return 1;
>> +
>>           msr_info->data = vcpu->arch.ia32_xss;
>>           break;
>>       case MSR_K7_CLK_CTL:
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected
  2025-02-24 11:38     ` Adrian Hunter
@ 2025-02-25  5:56       ` Xiaoyao Li
  2025-02-27 14:14         ` Adrian Hunter
  0 siblings, 1 reply; 39+ messages in thread
From: Xiaoyao Li @ 2025-02-25  5:56 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 2/24/2025 7:38 PM, Adrian Hunter wrote:
> On 20/02/25 12:50, Xiaoyao Li wrote:
>> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>>> From: Sean Christopherson <seanjc@google.com>
>>>
>>> Allow the use of kvm_load_host_xsave_state() with
>>> vcpu->arch.guest_state_protected == true. This will allow TDX to reuse
>>> kvm_load_host_xsave_state() instead of creating its own version.
>>>
>>> For consistency, amend kvm_load_guest_xsave_state() also.
>>>
>>> Ensure that guest state that kvm_load_host_xsave_state() depends upon,
>>> such as MSR_IA32_XSS, cannot be changed by user space, if
>>> guest_state_protected.
>>>
>>> [Adrian: wrote commit message]
>>>
>>> Link: https://lore.kernel.org/r/Z2GiQS_RmYeHU09L@google.com
>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>>> ---
>>> TD vcpu enter/exit v2:
>>>    - New patch
>>> ---
>>>    arch/x86/kvm/svm/svm.c |  7 +++++--
>>>    arch/x86/kvm/x86.c     | 18 +++++++++++-------
>>>    2 files changed, 16 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>> index 7640a84e554a..b4bcfe15ad5e 100644
>>> --- a/arch/x86/kvm/svm/svm.c
>>> +++ b/arch/x86/kvm/svm/svm.c
>>> @@ -4253,7 +4253,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>>>            svm_set_dr6(svm, DR6_ACTIVE_LOW);
>>>          clgi();
>>> -    kvm_load_guest_xsave_state(vcpu);
>>> +
>>> +    if (!vcpu->arch.guest_state_protected)
>>> +        kvm_load_guest_xsave_state(vcpu);
>>>          kvm_wait_lapic_expire(vcpu);
>>>    @@ -4282,7 +4284,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>>>        if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
>>>            kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
>>>    -    kvm_load_host_xsave_state(vcpu);
>>> +    if (!vcpu->arch.guest_state_protected)
>>> +        kvm_load_host_xsave_state(vcpu);
>>>        stgi();
>>>          /* Any pending NMI will happen here */
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index bbb6b7f40b3a..5cf9f023fd4b 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -1169,11 +1169,9 @@ EXPORT_SYMBOL_GPL(kvm_lmsw);
>>>      void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>>>    {
>>> -    if (vcpu->arch.guest_state_protected)
>>> -        return;
>>> +    WARN_ON_ONCE(vcpu->arch.guest_state_protected);
>>>          if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
>>> -
>>>            if (vcpu->arch.xcr0 != kvm_host.xcr0)
>>>                xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
>>>    @@ -1192,13 +1190,11 @@ EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>>>      void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>>>    {
>>> -    if (vcpu->arch.guest_state_protected)
>>> -        return;
>>> -
>>>        if (cpu_feature_enabled(X86_FEATURE_PKU) &&
>>>            ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>>>             kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
>>> -        vcpu->arch.pkru = rdpkru();
>>> +        if (!vcpu->arch.guest_state_protected)
>>> +            vcpu->arch.pkru = rdpkru();
>>
>> this needs justification.
> 
> It was proposed by Sean here:
> 
> 	https://lore.kernel.org/all/Z2WZ091z8GmGjSbC@google.com/
> 
> which is part of the email thread referenced by the "Link:" tag above

IMHO, this change needs to be put in patch 07, which is the better place 
to justify it.

>>
>>>            if (vcpu->arch.pkru != vcpu->arch.host_pkru)
>>>                wrpkru(vcpu->arch.host_pkru);
>>>        }
>>
>>
>>> @@ -3916,6 +3912,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>            if (!msr_info->host_initiated &&
>>>                !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>>>                return 1;
>>> +
>>> +        if (vcpu->arch.guest_state_protected)
>>> +            return 1;
>>> +
>>
>> this and below change need to be a separate patch. So that we can discuss independently.
>>
>> I see no reason to make MSR_IA32_XSS special than other MSRs. When guest_state_protected, most of the MSRs that aren't emulated by KVM are inaccessible by KVM.
> 
> Yes, TDX will block access to MSR_IA32_XSS anyway because
> tdx_has_emulated_msr() will return false for MSR_IA32_XSS.
> 
> However kvm_load_host_xsave_state() is not TDX-specific code and it
> relies upon vcpu->arch.ia32_xss, so there is reason to block
> access to it when vcpu->arch.guest_state_protected is true.

It is TDX specific logic that TDX requires vcpu->arch.ia32_xss unchanged 
since TDX is going to utilize kvm_load_host_xsave_state() to restore 
host xsave state and relies on vcpu->arch.ia32_xss to be always the 
value of XFAM & XSS_MASK.

So please put this change into the TDX specific patch with the clear 
justfication.

>>
>>>            /*
>>>             * KVM supports exposing PT to the guest, but does not support
>>>             * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
>>> @@ -4375,6 +4375,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>            if (!msr_info->host_initiated &&
>>>                !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>>>                return 1;
>>> +
>>> +        if (vcpu->arch.guest_state_protected)
>>> +            return 1;
>>> +
>>>            msr_info->data = vcpu->arch.ia32_xss;
>>>            break;
>>>        case MSR_K7_CLK_CTL:
>>
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected
  2025-02-25  5:56       ` Xiaoyao Li
@ 2025-02-27 14:14         ` Adrian Hunter
  0 siblings, 0 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-02-27 14:14 UTC (permalink / raw)
  To: Xiaoyao Li, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 25/02/25 07:56, Xiaoyao Li wrote:
> On 2/24/2025 7:38 PM, Adrian Hunter wrote:
>> On 20/02/25 12:50, Xiaoyao Li wrote:
>>> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>>>> From: Sean Christopherson <seanjc@google.com>
>>>>
>>>> Allow the use of kvm_load_host_xsave_state() with
>>>> vcpu->arch.guest_state_protected == true. This will allow TDX to reuse
>>>> kvm_load_host_xsave_state() instead of creating its own version.
>>>>
>>>> For consistency, amend kvm_load_guest_xsave_state() also.
>>>>
>>>> Ensure that guest state that kvm_load_host_xsave_state() depends upon,
>>>> such as MSR_IA32_XSS, cannot be changed by user space, if
>>>> guest_state_protected.
>>>>
>>>> [Adrian: wrote commit message]
>>>>
>>>> Link: https://lore.kernel.org/r/Z2GiQS_RmYeHU09L@google.com
>>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>>>> ---
>>>> TD vcpu enter/exit v2:
>>>>    - New patch
>>>> ---
>>>>    arch/x86/kvm/svm/svm.c |  7 +++++--
>>>>    arch/x86/kvm/x86.c     | 18 +++++++++++-------
>>>>    2 files changed, 16 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>>> index 7640a84e554a..b4bcfe15ad5e 100644
>>>> --- a/arch/x86/kvm/svm/svm.c
>>>> +++ b/arch/x86/kvm/svm/svm.c
>>>> @@ -4253,7 +4253,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>>>>            svm_set_dr6(svm, DR6_ACTIVE_LOW);
>>>>          clgi();
>>>> -    kvm_load_guest_xsave_state(vcpu);
>>>> +
>>>> +    if (!vcpu->arch.guest_state_protected)
>>>> +        kvm_load_guest_xsave_state(vcpu);
>>>>          kvm_wait_lapic_expire(vcpu);
>>>>    @@ -4282,7 +4284,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>>>>        if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
>>>>            kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
>>>>    -    kvm_load_host_xsave_state(vcpu);
>>>> +    if (!vcpu->arch.guest_state_protected)
>>>> +        kvm_load_host_xsave_state(vcpu);
>>>>        stgi();
>>>>          /* Any pending NMI will happen here */
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index bbb6b7f40b3a..5cf9f023fd4b 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -1169,11 +1169,9 @@ EXPORT_SYMBOL_GPL(kvm_lmsw);
>>>>      void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>>>>    {
>>>> -    if (vcpu->arch.guest_state_protected)
>>>> -        return;
>>>> +    WARN_ON_ONCE(vcpu->arch.guest_state_protected);
>>>>          if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
>>>> -
>>>>            if (vcpu->arch.xcr0 != kvm_host.xcr0)
>>>>                xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
>>>>    @@ -1192,13 +1190,11 @@ EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>>>>      void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>>>>    {
>>>> -    if (vcpu->arch.guest_state_protected)
>>>> -        return;
>>>> -
>>>>        if (cpu_feature_enabled(X86_FEATURE_PKU) &&
>>>>            ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>>>>             kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
>>>> -        vcpu->arch.pkru = rdpkru();
>>>> +        if (!vcpu->arch.guest_state_protected)
>>>> +            vcpu->arch.pkru = rdpkru();
>>>
>>> this needs justification.
>>
>> It was proposed by Sean here:
>>
>>     https://lore.kernel.org/all/Z2WZ091z8GmGjSbC@google.com/
>>
>> which is part of the email thread referenced by the "Link:" tag above
> 
> IMHO, this change needs to be put in patch 07, which is the better place to justify it.
> 
>>>
>>>>            if (vcpu->arch.pkru != vcpu->arch.host_pkru)
>>>>                wrpkru(vcpu->arch.host_pkru);
>>>>        }
>>>
>>>
>>>> @@ -3916,6 +3912,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>>            if (!msr_info->host_initiated &&
>>>>                !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>>>>                return 1;
>>>> +
>>>> +        if (vcpu->arch.guest_state_protected)
>>>> +            return 1;
>>>> +
>>>
>>> this and below change need to be a separate patch. So that we can discuss independently.
>>>
>>> I see no reason to make MSR_IA32_XSS special than other MSRs. When guest_state_protected, most of the MSRs that aren't emulated by KVM are inaccessible by KVM.
>>
>> Yes, TDX will block access to MSR_IA32_XSS anyway because
>> tdx_has_emulated_msr() will return false for MSR_IA32_XSS.
>>
>> However kvm_load_host_xsave_state() is not TDX-specific code and it
>> relies upon vcpu->arch.ia32_xss, so there is reason to block
>> access to it when vcpu->arch.guest_state_protected is true.
> 
> It is TDX specific logic that TDX requires vcpu->arch.ia32_xss unchanged since TDX is going to utilize kvm_load_host_xsave_state() to restore host xsave state and relies on vcpu->arch.ia32_xss to be always the value of XFAM & XSS_MASK.
> 
> So please put this change into the TDX specific patch with the clear justfication.

This patch set is owned by Paolo now, so it is up to him.

>>>
>>>>            /*
>>>>             * KVM supports exposing PT to the guest, but does not support
>>>>             * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
>>>> @@ -4375,6 +4375,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>>            if (!msr_info->host_initiated &&
>>>>                !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>>>>                return 1;
>>>> +
>>>> +        if (vcpu->arch.guest_state_protected)
>>>> +            return 1;
>>>> +
>>>>            msr_info->data = vcpu->arch.ia32_xss;
>>>>            break;
>>>>        case MSR_K7_CLK_CTL:
>>>
>>
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected
  2025-02-20 10:50   ` Xiaoyao Li
  2025-02-24 11:38     ` Adrian Hunter
@ 2025-03-06 18:04     ` Paolo Bonzini
  2025-03-06 20:43       ` Sean Christopherson
  1 sibling, 1 reply; 39+ messages in thread
From: Paolo Bonzini @ 2025-03-06 18:04 UTC (permalink / raw)
  To: Xiaoyao Li, Adrian Hunter, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 2/20/25 11:50, Xiaoyao Li wrote:
> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>> From: Sean Christopherson <seanjc@google.com>
>>
>> Allow the use of kvm_load_host_xsave_state() with
>> vcpu->arch.guest_state_protected == true. This will allow TDX to reuse
>> kvm_load_host_xsave_state() instead of creating its own version.
>>
>> For consistency, amend kvm_load_guest_xsave_state() also.
>>
>> Ensure that guest state that kvm_load_host_xsave_state() depends upon,
>> such as MSR_IA32_XSS, cannot be changed by user space, if
>> guest_state_protected.
>>
>> [Adrian: wrote commit message]
>>
>> Link: https://lore.kernel.org/r/Z2GiQS_RmYeHU09L@google.com
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> ---
>> TD vcpu enter/exit v2:
>>   - New patch
>> ---
>>   arch/x86/kvm/svm/svm.c |  7 +++++--
>>   arch/x86/kvm/x86.c     | 18 +++++++++++-------
>>   2 files changed, 16 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 7640a84e554a..b4bcfe15ad5e 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -4253,7 +4253,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct 
>> kvm_vcpu *vcpu,
>>           svm_set_dr6(svm, DR6_ACTIVE_LOW);
>>       clgi();
>> -    kvm_load_guest_xsave_state(vcpu);
>> +
>> +    if (!vcpu->arch.guest_state_protected)
>> +        kvm_load_guest_xsave_state(vcpu);
>>       kvm_wait_lapic_expire(vcpu);
>> @@ -4282,7 +4284,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct 
>> kvm_vcpu *vcpu,
>>       if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
>>           kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
>> -    kvm_load_host_xsave_state(vcpu);
>> +    if (!vcpu->arch.guest_state_protected)
>> +        kvm_load_host_xsave_state(vcpu);
>>       stgi();
>>       /* Any pending NMI will happen here */
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index bbb6b7f40b3a..5cf9f023fd4b 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1169,11 +1169,9 @@ EXPORT_SYMBOL_GPL(kvm_lmsw);
>>   void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>>   {
>> -    if (vcpu->arch.guest_state_protected)
>> -        return;
>> +    WARN_ON_ONCE(vcpu->arch.guest_state_protected);
>>       if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
>> -
>>           if (vcpu->arch.xcr0 != kvm_host.xcr0)
>>               xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
>> @@ -1192,13 +1190,11 @@ EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>>   void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>>   {
>> -    if (vcpu->arch.guest_state_protected)
>> -        return;
>> -
>>       if (cpu_feature_enabled(X86_FEATURE_PKU) &&
>>           ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>>            kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
>> -        vcpu->arch.pkru = rdpkru();
>> +        if (!vcpu->arch.guest_state_protected)
>> +            vcpu->arch.pkru = rdpkru();
> 
> this needs justification.
> 
>>           if (vcpu->arch.pkru != vcpu->arch.host_pkru)
>>               wrpkru(vcpu->arch.host_pkru);
>>       }
> 
> 
>> @@ -3916,6 +3912,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, 
>> struct msr_data *msr_info)
>>           if (!msr_info->host_initiated &&
>>               !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>>               return 1;
>> +
>> +        if (vcpu->arch.guest_state_protected)
>> +            return 1;
>> +
> 
> this and below change need to be a separate patch. So that we can 
> discuss independently.
> 
> I see no reason to make MSR_IA32_XSS special than other MSRs. When 
> guest_state_protected, most of the MSRs that aren't emulated by KVM are 
> inaccessible by KVM.

I agree with Xiaoyao that this change is sensible but should be proposed 
separately for both SNP and TDX.

Allowing the use of kvm_load_host_xsave_state() is really ugly, 
especially since the corresponding code is so simple:

         if (cpu_feature_enabled(X86_FEATURE_PKU) && vcpu->arch.pkru != 0)
                         wrpkru(vcpu->arch.host_pkru);

	if (kvm_host.xcr0 != kvm_tdx->xfam & kvm_caps.supported_xcr0)
		xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);

	/*
	 * All TDX hosts support XSS; but even if they didn't, both
	 * arms of the comparison would be 0 and the wrmsrl would be
	 * skipped.
	 */
	if (kvm_host.xss != kvm_tdx->xfam & kvm_caps.supported_xss)
		wrmsrl(MSR_IA32_XSS, kvm_host.xss);

This is really all that should be in patch 7.  I'll test it and decide 
what to do.

Paolo

>>           /*
>>            * KVM supports exposing PT to the guest, but does not support
>>            * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
>> @@ -4375,6 +4375,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, 
>> struct msr_data *msr_info)
>>           if (!msr_info->host_initiated &&
>>               !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>>               return 1;
>> +
>> +        if (vcpu->arch.guest_state_protected)
>> +            return 1;
>> +
>>           msr_info->data = vcpu->arch.ia32_xss;
>>           break;
>>       case MSR_K7_CLK_CTL:
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected
  2025-03-06 18:04     ` Paolo Bonzini
@ 2025-03-06 20:43       ` Sean Christopherson
  2025-03-06 22:34         ` Paolo Bonzini
  0 siblings, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2025-03-06 20:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Xiaoyao Li, Adrian Hunter, kvm, rick.p.edgecombe, kai.huang,
	reinette.chatre, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel, yan.y.zhao, chao.gao,
	weijiang.yang

On Thu, Mar 06, 2025, Paolo Bonzini wrote:
> I agree with Xiaoyao that this change is sensible but should be proposed
> separately for both SNP and TDX.
> 
> Allowing the use of kvm_load_host_xsave_state() is really ugly, especially
> since the corresponding code is so simple:
> 
>         if (cpu_feature_enabled(X86_FEATURE_PKU) && vcpu->arch.pkru != 0)
>                         wrpkru(vcpu->arch.host_pkru);

It's clearly not "so simple", because this code is buggy.

The justification for using kvm_load_host_xsave_state() is that either KVM gets
the TDX state model correct and the existing flows Just Work, or we handle all
that state as one-offs and at best replicate concepts and flows, and at worst
have bugs that are unique to TDX, e.g. because we get the "simple" code wrong,
we miss flows that subtly consume state, etc.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected
  2025-03-06 20:43       ` Sean Christopherson
@ 2025-03-06 22:34         ` Paolo Bonzini
  2025-03-07 23:04           ` Sean Christopherson
  0 siblings, 1 reply; 39+ messages in thread
From: Paolo Bonzini @ 2025-03-06 22:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, Adrian Hunter, kvm, Rick Edgecombe, Huang, Kai,
	reinette.chatre, Tony Lindgren, Binbin Wu, David Matlack,
	Yamahata, Isaku, Nikolay Borisov, Kernel Mailing List, Linux,
	Yan Zhao, Gao, Chao, Yang, Weijiang

Il gio 6 mar 2025, 21:44 Sean Christopherson <seanjc@google.com> ha scritto:
> > Allowing the use of kvm_load_host_xsave_state() is really ugly, especially
> > since the corresponding code is so simple:
> >
> >         if (cpu_feature_enabled(X86_FEATURE_PKU) && vcpu->arch.pkru != 0)
> >                         wrpkru(vcpu->arch.host_pkru);
>
> It's clearly not "so simple", because this code is buggy.
>
> The justification for using kvm_load_host_xsave_state() is that either KVM gets
> the TDX state model correct and the existing flows Just Work, or we handle all
> that state as one-offs and at best replicate concepts and flows, and at worst
> have bugs that are unique to TDX, e.g. because we get the "simple" code wrong,
> we miss flows that subtly consume state, etc.

A typo doesn't change the fact that kvm_load_host_xsave_state is
optimized with knowledge of the guest CR0 and CR4; faking the values
so that the same field means both "exit value" and "guest value", just
so that the common code does the right thing for pkru/xcr0/xss, is
unmaintainable and conceptually just wrong.  And while the change for
XSS (and possibly other MSRs) is actually correct, it should be
justified for both SEV-ES/SNP and TDX rather than sneaked into the TDX
patches.

While there could be other flows that consume guest state, they're
just as bound to do the wrong thing if vcpu->arch is only guaranteed
to be somehow plausible (think anything that for whatever reason uses
cpu_role). There's no way the existing flows for
!guest_state_protected should run _at all_ when the register state is
not there. If they do, it's a bug and fixing them is the right thing
to do (it may feel like whack-a-mole but isn't). See for example the
KVM_CAP_SYNC_REGS calls that Adrian mentioned in the reply to 05/12,
and for which I'll send a patch too: I haven't checked if it's
possible, but I wonder if userspace could misuse sync_regs() to change
guest xcr0 and xss, and trick the TDX *host* into running with some
bits of xcr0 and xss unexpectedly cleared.

TDX provides well known values for the host for all three registers
that are being restored here, so there's no need to reuse code that is
meant to do something completely different.  I'm not singling out TDX,
the same would be true for SEV-ES/SNP swap type C.

Paolo

Paolo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected
  2025-03-06 22:34         ` Paolo Bonzini
@ 2025-03-07 23:04           ` Sean Christopherson
  2025-03-10 19:08             ` Paolo Bonzini
  0 siblings, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2025-03-07 23:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Xiaoyao Li, Adrian Hunter, kvm, Rick Edgecombe, Kai Huang,
	reinette.chatre, Tony Lindgren, Binbin Wu, David Matlack,
	Isaku Yamahata, Nikolay Borisov, linux-kernel, Yan Zhao, Chao Gao,
	Weijiang Yang

On Thu, Mar 06, 2025, Paolo Bonzini wrote:
> Il gio 6 mar 2025, 21:44 Sean Christopherson <seanjc@google.com> ha scritto:
> > > Allowing the use of kvm_load_host_xsave_state() is really ugly, especially
> > > since the corresponding code is so simple:
> > >
> > >         if (cpu_feature_enabled(X86_FEATURE_PKU) && vcpu->arch.pkru != 0)
> > >                         wrpkru(vcpu->arch.host_pkru);
> >
> > It's clearly not "so simple", because this code is buggy.
> >
> > The justification for using kvm_load_host_xsave_state() is that either KVM gets
> > the TDX state model correct and the existing flows Just Work, or we handle all
> > that state as one-offs and at best replicate concepts and flows, and at worst
> > have bugs that are unique to TDX, e.g. because we get the "simple" code wrong,
> > we miss flows that subtly consume state, etc.
> 
> A typo doesn't change the fact that kvm_load_host_xsave_state is
> optimized with knowledge of the guest CR0 and CR4; faking the values
> so that the same field means both "exit value" and "guest value",

I can't argue against that, but I still absolutely detest carrying dedicated code
for SEV and TDX state management.  It's bad enough that figuring out WTF actually
happens basically requires encyclopedic knowledge of massive specs.

I tried to figure out a way to share code, but everything I can come up with that
doesn't fake vCPU state makes the non-TDX code a mess.  :-(

> just so that the common code does the right thing for pkru/xcr0/xss,

FWIW, it's not just to that KVM does the right thing for those values, it's a
defense in depth mechanism so that *when*, not if, KVM screws up, the odds of the
bug being fatal to KVM and/or the guest are reduced.

> is > unmaintainable and conceptually just wrong. 

I don't necessarily disagree, but what we have today isn't maintainable either.
Without actual sanity check and safeguards in the low level helpers, we absolutely
are playing a game of whack-a-mole.

E.g. see commit 9b42d1e8e4fe ("KVM: x86: Play nice with protected guests in
complete_hypercall_exit()").

At a glance, kvm_hv_hypercall() is still broken, because is_protmode() will return
false incorrectly.

> And while the change for XSS (and possibly other MSRs) is actually correct,
> it should be justified for both SEV-ES/SNP and TDX rather than sneaked into
> the TDX patches.
> 
> While there could be other flows that consume guest state, they're
> just as bound to do the wrong thing if vcpu->arch is only guaranteed
> to be somehow plausible (think anything that for whatever reason uses
> cpu_role).

But the MMU code is *already* broken.  kvm_init_mmu() => vcpu_to_role_regs().  It
"works" because the fubar role is never truly consumed.  I'm sure there are more
examples.

> There's no way the existing flows for !guest_state_protected should run _at
> all_ when the register state is not there. If they do, it's a bug and fixing
> them is the right thing to do (it may feel like whack-a-mole but isn't)

Eh, it's still whack-a-mole, there just happen to be a finite number of moles :-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected
  2025-03-07 23:04           ` Sean Christopherson
@ 2025-03-10 19:08             ` Paolo Bonzini
  0 siblings, 0 replies; 39+ messages in thread
From: Paolo Bonzini @ 2025-03-10 19:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, Adrian Hunter, kvm, Rick Edgecombe, Kai Huang,
	reinette.chatre, Tony Lindgren, Binbin Wu, David Matlack,
	Isaku Yamahata, Nikolay Borisov, linux-kernel, Yan Zhao, Chao Gao,
	Weijiang Yang

On Sat, Mar 8, 2025 at 12:04 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Mar 06, 2025, Paolo Bonzini wrote:
> I still absolutely detest carrying dedicated code
> for SEV and TDX state management.  It's bad enough that figuring out WTF actually
> happens basically requires encyclopedic knowledge of massive specs.
>
> I tried to figure out a way to share code, but everything I can come up with that
> doesn't fake vCPU state makes the non-TDX code a mess.  :-(

The only thing worse is requiring encyclopedic knowledge of both the
specs and KVM. :)  And yeah, we do require some knowledge of parts of
KVM
that *shouldn't* matter for protected-state guests, but it shouldn't
be worse than needed.

There's different microcode/firmware for VMX/SVM/SEV-ES+/TDX, the
chance of sharing code is lower and lower as more stuff is added
there---as is the case
for SEV-ES/SNP and TDX. Which is why state management code for TDX is
anyway doing its own thing most of the time---there's no point in
sharing a little bit which is not even the hardest.

> > just so that the common code does the right thing for pkru/xcr0/xss,
>
> FWIW, it's not just to that KVM does the right thing for those values, it's a
> defense in depth mechanism so that *when*, not if, KVM screws up, the odds of the
> bug being fatal to KVM and/or the guest are reduced.

I would say the other way round is true too.  Not relying too much on
fake values in vcpu->arch can be more robust.

> Without actual sanity check and safeguards in the low level helpers, we absolutely
> are playing a game of whack-a-mole.
>
> E.g. see commit 9b42d1e8e4fe ("KVM: x86: Play nice with protected guests in
> complete_hypercall_exit()").
>
> At a glance, kvm_hv_hypercall() is still broken, because is_protmode() will return
> false incorrectly.

So the fixes are needed anyway and we're playing the game anyway. :(

> > And while the change for XSS (and possibly other MSRs) is actually correct,
> > it should be justified for both SEV-ES/SNP and TDX rather than sneaked into
> > the TDX patches.
> >
> > While there could be other flows that consume guest state, they're
> > just as bound to do the wrong thing if vcpu->arch is only guaranteed
> > to be somehow plausible (think anything that for whatever reason uses
> > cpu_role).
>
> But the MMU code is *already* broken.  kvm_init_mmu() => vcpu_to_role_regs().  It
> "works" because the fubar role is never truly consumed.  I'm sure there are more
> examples.

Yes, and there should be at least a WARN_ON_ONCE when it is accessed,
even if we don't completely cull the initialization of cpu_role...
Loading the XSAVE state isn't any different.

I'm okay with placing some values in cr0/cr4 or even xcr0/xss, but do
not wish to use them more than the absolute minimum necessary. And I
would rather not set more than the bare minimum needed in CR4... why
set CR4.PKE for example, if KVM anyway has no business using the guest
PKRU.

Paolo

> > There's no way the existing flows for !guest_state_protected should run _at
> > all_ when the register state is not there. If they do, it's a bug and fixing
> > them is the right thing to do (it may feel like whack-a-mole but isn't)
>
> Eh, it's still whack-a-mole, there just happen to be a finite number of moles :-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V2 03/12] KVM: TDX: Set arch.has_protected_state to true
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
  2025-01-29  9:58 ` [PATCH V2 01/12] x86/virt/tdx: Make tdh_vp_enter() noinstr Adrian Hunter
  2025-01-29  9:58 ` [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected Adrian Hunter
@ 2025-01-29  9:58 ` Adrian Hunter
  2025-02-20 12:35   ` Xiaoyao Li
  2025-01-29  9:58 ` [PATCH V2 04/12] KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct Adrian Hunter
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:58 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

TDX VMs have protected state. Accordingly, set arch.has_protected_state to
true.

This will cause the following IOCTL functions to return an error:

	kvm_arch_vcpu_ioctl()	case KVM_GET_SREGS2
	kvm_arch_vcpu_ioctl()	case KVM_SET_SREGS2
	kvm_arch_vcpu_ioctl_get_regs()
	kvm_arch_vcpu_ioctl_set_regs()
	kvm_arch_vcpu_ioctl_get_sregs()
	kvm_arch_vcpu_ioctl_set_sregs()
	kvm_vcpu_ioctl_x86_get_debugregs()
	kvm_vcpu_ioctl_x86_set_debugregs
	kvm_vcpu_ioctl_x86_get_xcrs()
	kvm_vcpu_ioctl_x86_set_xcrs()

In addition, the following will error for confidential FPU state:

	kvm_vcpu_ioctl_x86_get_xsave ()
	kvm_vcpu_ioctl_x86_get_xsave2()
	kvm_vcpu_ioctl_x86_set_xsave()
	kvm_arch_vcpu_ioctl_get_fpu()
	kvm_arch_vcpu_ioctl_set_fpu()

And finally, in accordance with commit 66155de93bcf ("KVM: x86: Disallow
read-only memslots for SEV-ES and SEV-SNP (and TDX)"), read-only
memslots will be disallowed.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v2:
 - New patch
---
 arch/x86/kvm/vmx/tdx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ea9498028212..a7ebdafdfd82 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -553,6 +553,7 @@ int tdx_vm_init(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 
+	kvm->arch.has_protected_state = true;
 	kvm->arch.has_private_mem = true;
 
 	/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 03/12] KVM: TDX: Set arch.has_protected_state to true
  2025-01-29  9:58 ` [PATCH V2 03/12] KVM: TDX: Set arch.has_protected_state to true Adrian Hunter
@ 2025-02-20 12:35   ` Xiaoyao Li
  2025-02-27 14:17     ` Adrian Hunter
  0 siblings, 1 reply; 39+ messages in thread
From: Xiaoyao Li @ 2025-02-20 12:35 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 1/29/2025 5:58 PM, Adrian Hunter wrote:
> TDX VMs have protected state. Accordingly, set arch.has_protected_state to
> true.
> 
> This will cause the following IOCTL functions to return an error:
> 
> 	kvm_arch_vcpu_ioctl()	case KVM_GET_SREGS2
> 	kvm_arch_vcpu_ioctl()	case KVM_SET_SREGS2
> 	kvm_arch_vcpu_ioctl_get_regs()
> 	kvm_arch_vcpu_ioctl_set_regs()
> 	kvm_arch_vcpu_ioctl_get_sregs()
> 	kvm_arch_vcpu_ioctl_set_sregs()
> 	kvm_vcpu_ioctl_x86_get_debugregs()
> 	kvm_vcpu_ioctl_x86_set_debugregs
> 	kvm_vcpu_ioctl_x86_get_xcrs()
> 	kvm_vcpu_ioctl_x86_set_xcrs()
> 
> In addition, the following will error for confidential FPU state:
> 
> 	kvm_vcpu_ioctl_x86_get_xsave ()
> 	kvm_vcpu_ioctl_x86_get_xsave2()
> 	kvm_vcpu_ioctl_x86_set_xsave()
> 	kvm_arch_vcpu_ioctl_get_fpu()
> 	kvm_arch_vcpu_ioctl_set_fpu()
> 
> And finally, in accordance with commit 66155de93bcf ("KVM: x86: Disallow
> read-only memslots for SEV-ES and SEV-SNP (and TDX)"), read-only
> memslots will be disallowed.
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
> TD vcpu enter/exit v2:
>   - New patch
> ---
>   arch/x86/kvm/vmx/tdx.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index ea9498028212..a7ebdafdfd82 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -553,6 +553,7 @@ int tdx_vm_init(struct kvm *kvm)
>   {
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>   
> +	kvm->arch.has_protected_state = true;

This can be squashed into the one that implements the tdx_vm_init();

>   	kvm->arch.has_private_mem = true;
>   
>   	/*


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 03/12] KVM: TDX: Set arch.has_protected_state to true
  2025-02-20 12:35   ` Xiaoyao Li
@ 2025-02-27 14:17     ` Adrian Hunter
  0 siblings, 0 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-02-27 14:17 UTC (permalink / raw)
  To: Xiaoyao Li, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 20/02/25 14:35, Xiaoyao Li wrote:
> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>> TDX VMs have protected state. Accordingly, set arch.has_protected_state to
>> true.
>>
>> This will cause the following IOCTL functions to return an error:
>>
>>     kvm_arch_vcpu_ioctl()    case KVM_GET_SREGS2
>>     kvm_arch_vcpu_ioctl()    case KVM_SET_SREGS2
>>     kvm_arch_vcpu_ioctl_get_regs()
>>     kvm_arch_vcpu_ioctl_set_regs()
>>     kvm_arch_vcpu_ioctl_get_sregs()
>>     kvm_arch_vcpu_ioctl_set_sregs()
>>     kvm_vcpu_ioctl_x86_get_debugregs()
>>     kvm_vcpu_ioctl_x86_set_debugregs
>>     kvm_vcpu_ioctl_x86_get_xcrs()
>>     kvm_vcpu_ioctl_x86_set_xcrs()
>>
>> In addition, the following will error for confidential FPU state:
>>
>>     kvm_vcpu_ioctl_x86_get_xsave ()
>>     kvm_vcpu_ioctl_x86_get_xsave2()
>>     kvm_vcpu_ioctl_x86_set_xsave()
>>     kvm_arch_vcpu_ioctl_get_fpu()
>>     kvm_arch_vcpu_ioctl_set_fpu()
>>
>> And finally, in accordance with commit 66155de93bcf ("KVM: x86: Disallow
>> read-only memslots for SEV-ES and SEV-SNP (and TDX)"), read-only
>> memslots will be disallowed.
>>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> ---
>> TD vcpu enter/exit v2:
>>   - New patch
>> ---
>>   arch/x86/kvm/vmx/tdx.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index ea9498028212..a7ebdafdfd82 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -553,6 +553,7 @@ int tdx_vm_init(struct kvm *kvm)
>>   {
>>       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>>   +    kvm->arch.has_protected_state = true;
> 
> This can be squashed into the one that implements the tdx_vm_init();

This has been done in kvm-coco-queue.
We have re-based on kvm-coco-queue so we in-sync on this.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V2 04/12] KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (2 preceding siblings ...)
  2025-01-29  9:58 ` [PATCH V2 03/12] KVM: TDX: Set arch.has_protected_state to true Adrian Hunter
@ 2025-01-29  9:58 ` Adrian Hunter
  2025-01-29  9:58 ` [PATCH V2 05/12] KVM: TDX: Implement TDX vcpu enter/exit path Adrian Hunter
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:58 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

From: Binbin Wu <binbin.wu@linux.intel.com>

Move common fields of struct vcpu_vmx and struct vcpu_tdx to struct
vcpu_vt, to share the code between VMX/TDX as much as possible and to make
TDX exit handling more VMX like.

No functional change intended.

[Adrian: move code that depends on struct vcpu_vmx back to vmx.h]

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/Z1suNzg2Or743a7e@google.com
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v2:
 - New patch
---
 arch/x86/kvm/vmx/common.h      |  68 +++++++++++++++++++++
 arch/x86/kvm/vmx/main.c        |   4 ++
 arch/x86/kvm/vmx/nested.c      |  10 ++--
 arch/x86/kvm/vmx/posted_intr.c |  18 +++---
 arch/x86/kvm/vmx/tdx.h         |  16 +----
 arch/x86/kvm/vmx/vmx.c         |  99 +++++++++++++++----------------
 arch/x86/kvm/vmx/vmx.h         | 104 +++++++++++++--------------------
 7 files changed, 178 insertions(+), 141 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 7a592467a044..9d4982694f06 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -6,6 +6,74 @@
 
 #include "mmu.h"
 
+union vmx_exit_reason {
+	struct {
+		u32	basic			: 16;
+		u32	reserved16		: 1;
+		u32	reserved17		: 1;
+		u32	reserved18		: 1;
+		u32	reserved19		: 1;
+		u32	reserved20		: 1;
+		u32	reserved21		: 1;
+		u32	reserved22		: 1;
+		u32	reserved23		: 1;
+		u32	reserved24		: 1;
+		u32	reserved25		: 1;
+		u32	bus_lock_detected	: 1;
+		u32	enclave_mode		: 1;
+		u32	smi_pending_mtf		: 1;
+		u32	smi_from_vmx_root	: 1;
+		u32	reserved30		: 1;
+		u32	failed_vmentry		: 1;
+	};
+	u32 full;
+};
+
+struct vcpu_vt {
+	/* Posted interrupt descriptor */
+	struct pi_desc pi_desc;
+
+	/* Used if this vCPU is waiting for PI notification wakeup. */
+	struct list_head pi_wakeup_list;
+
+	union vmx_exit_reason exit_reason;
+
+	unsigned long	exit_qualification;
+	u32		exit_intr_info;
+
+	/*
+	 * If true, guest state has been loaded into hardware, and host state
+	 * saved into vcpu_{vt,vmx,tdx}.  If false, host state is loaded into
+	 * hardware.
+	 */
+	bool		guest_state_loaded;
+
+#ifdef CONFIG_X86_64
+	u64		msr_host_kernel_gs_base;
+#endif
+
+	unsigned long	host_debugctlmsr;
+};
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+static __always_inline bool is_td(struct kvm *kvm)
+{
+	return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static __always_inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
+{
+	return is_td(vcpu->kvm);
+}
+
+#else
+
+static inline bool is_td(struct kvm *kvm) { return false; }
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
+
+#endif
+
 static inline bool vt_is_tdx_private_gpa(struct kvm *kvm, gpa_t gpa)
 {
 	/* For TDX the direct mask is the shared mask. */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 0ff7394f8466..1cc1c06461f2 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -10,6 +10,10 @@
 #include "tdx.h"
 #include "tdx_arch.h"
 
+#ifdef CONFIG_INTEL_TDX_HOST
+static_assert(offsetof(struct vcpu_vmx, vt) == offsetof(struct vcpu_tdx, vt));
+#endif
+
 static void vt_disable_virtualization_cpu(void)
 {
 	/* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 8a7af02d466e..3add9f1073ff 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -275,7 +275,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
 {
 	struct vmcs_host_state *dest, *src;
 
-	if (unlikely(!vmx->guest_state_loaded))
+	if (unlikely(!vmx->vt.guest_state_loaded))
 		return;
 
 	src = &prev->host_state;
@@ -425,7 +425,7 @@ static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
 		 * tables also changed, but KVM should not treat EPT Misconfig
 		 * VM-Exits as writes.
 		 */
-		WARN_ON_ONCE(vmx->exit_reason.basic != EXIT_REASON_EPT_VIOLATION);
+		WARN_ON_ONCE(vmx->vt.exit_reason.basic != EXIT_REASON_EPT_VIOLATION);
 
 		/*
 		 * PML Full and EPT Violation VM-Exits both use bit 12 to report
@@ -4622,7 +4622,7 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 {
 	/* update exit information fields: */
 	vmcs12->vm_exit_reason = vm_exit_reason;
-	if (to_vmx(vcpu)->exit_reason.enclave_mode)
+	if (vmx_get_exit_reason(vcpu).enclave_mode)
 		vmcs12->vm_exit_reason |= VMX_EXIT_REASONS_SGX_ENCLAVE_MODE;
 	vmcs12->exit_qualification = exit_qualification;
 
@@ -6115,7 +6115,7 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
 	 * nested VM-Exit.  Pass the original exit reason, i.e. don't hardcode
 	 * EXIT_REASON_VMFUNC as the exit reason.
 	 */
-	nested_vmx_vmexit(vcpu, vmx->exit_reason.full,
+	nested_vmx_vmexit(vcpu, vmx->vt.exit_reason.full,
 			  vmx_get_intr_info(vcpu),
 			  vmx_get_exit_qual(vcpu));
 	return 1;
@@ -6560,7 +6560,7 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
 bool nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	union vmx_exit_reason exit_reason = vmx->exit_reason;
+	union vmx_exit_reason exit_reason = vmx->vt.exit_reason;
 	unsigned long exit_qual;
 	u32 exit_intr_info;
 
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index ec08fa3caf43..5696e0f9f924 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -33,7 +33,7 @@ static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock);
 
 static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
 {
-	return &(to_vmx(vcpu)->pi_desc);
+	return &(to_vt(vcpu)->pi_desc);
 }
 
 static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
@@ -53,7 +53,7 @@ static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vcpu_vt *vt = to_vt(vcpu);
 	struct pi_desc old, new;
 	unsigned long flags;
 	unsigned int dest;
@@ -90,7 +90,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 	 */
 	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
 		raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
-		list_del(&vmx->pi_wakeup_list);
+		list_del(&vt->pi_wakeup_list);
 		raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
 	}
 
@@ -146,14 +146,14 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
 static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vcpu_vt *vt = to_vt(vcpu);
 	struct pi_desc old, new;
 	unsigned long flags;
 
 	local_irq_save(flags);
 
 	raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
-	list_add_tail(&vmx->pi_wakeup_list,
+	list_add_tail(&vt->pi_wakeup_list,
 		      &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
 	raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
 
@@ -220,13 +220,13 @@ void pi_wakeup_handler(void)
 	int cpu = smp_processor_id();
 	struct list_head *wakeup_list = &per_cpu(wakeup_vcpus_on_cpu, cpu);
 	raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, cpu);
-	struct vcpu_vmx *vmx;
+	struct vcpu_vt *vt;
 
 	raw_spin_lock(spinlock);
-	list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) {
+	list_for_each_entry(vt, wakeup_list, pi_wakeup_list) {
 
-		if (pi_test_on(&vmx->pi_desc))
-			kvm_vcpu_wake_up(&vmx->vcpu);
+		if (pi_test_on(&vt->pi_desc))
+			kvm_vcpu_wake_up(vt_to_vcpu(vt));
 	}
 	raw_spin_unlock(spinlock);
 }
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 3904479892f3..ba880dae547f 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -6,6 +6,8 @@
 #include "tdx_errno.h"
 
 #ifdef CONFIG_INTEL_TDX_HOST
+#include "common.h"
+
 int tdx_bringup(void);
 void tdx_cleanup(void);
 
@@ -43,6 +45,7 @@ enum vcpu_tdx_state {
 
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
+	struct vcpu_vt vt;
 
 	struct tdx_vp vp;
 
@@ -55,16 +58,6 @@ void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
 void tdh_vp_wr_failed(struct vcpu_tdx *tdx, char *uclass, char *op, u32 field,
 		      u64 val, u64 err);
 
-static inline bool is_td(struct kvm *kvm)
-{
-	return kvm->arch.vm_type == KVM_X86_TDX_VM;
-}
-
-static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
-{
-	return is_td(vcpu->kvm);
-}
-
 static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
 {
 	u64 err, data;
@@ -174,9 +167,6 @@ struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
 };
 
-static inline bool is_td(struct kvm *kvm) { return false; }
-static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
-
 #endif
 
 #endif
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e22df2b1e887..5475abb11533 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1282,6 +1282,7 @@ void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
 void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vcpu_vt *vt = to_vt(vcpu);
 	struct vmcs_host_state *host_state;
 #ifdef CONFIG_X86_64
 	int cpu = raw_smp_processor_id();
@@ -1310,7 +1311,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	if (vmx->nested.need_vmcs12_to_shadow_sync)
 		nested_sync_vmcs12_to_shadow(vcpu);
 
-	if (vmx->guest_state_loaded)
+	if (vt->guest_state_loaded)
 		return;
 
 	host_state = &vmx->loaded_vmcs->host_state;
@@ -1331,12 +1332,12 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 		fs_sel = current->thread.fsindex;
 		gs_sel = current->thread.gsindex;
 		fs_base = current->thread.fsbase;
-		vmx->msr_host_kernel_gs_base = current->thread.gsbase;
+		vt->msr_host_kernel_gs_base = current->thread.gsbase;
 	} else {
 		savesegment(fs, fs_sel);
 		savesegment(gs, gs_sel);
 		fs_base = read_msr(MSR_FS_BASE);
-		vmx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
+		vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
 	}
 
 	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
@@ -1348,14 +1349,14 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 #endif
 
 	vmx_set_host_fs_gs(host_state, fs_sel, gs_sel, fs_base, gs_base);
-	vmx->guest_state_loaded = true;
+	vt->guest_state_loaded = true;
 }
 
 static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx)
 {
 	struct vmcs_host_state *host_state;
 
-	if (!vmx->guest_state_loaded)
+	if (!vmx->vt.guest_state_loaded)
 		return;
 
 	host_state = &vmx->loaded_vmcs->host_state;
@@ -1383,10 +1384,10 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx)
 #endif
 	invalidate_tss_limit();
 #ifdef CONFIG_X86_64
-	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
+	wrmsrl(MSR_KERNEL_GS_BASE, vmx->vt.msr_host_kernel_gs_base);
 #endif
 	load_fixmap_gdt(raw_smp_processor_id());
-	vmx->guest_state_loaded = false;
+	vmx->vt.guest_state_loaded = false;
 	vmx->guest_uret_msrs_loaded = false;
 }
 
@@ -1394,7 +1395,7 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx)
 static u64 vmx_read_guest_kernel_gs_base(struct vcpu_vmx *vmx)
 {
 	preempt_disable();
-	if (vmx->guest_state_loaded)
+	if (vmx->vt.guest_state_loaded)
 		rdmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
 	preempt_enable();
 	return vmx->msr_guest_kernel_gs_base;
@@ -1403,7 +1404,7 @@ static u64 vmx_read_guest_kernel_gs_base(struct vcpu_vmx *vmx)
 static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data)
 {
 	preempt_disable();
-	if (vmx->guest_state_loaded)
+	if (vmx->vt.guest_state_loaded)
 		wrmsrl(MSR_KERNEL_GS_BASE, data);
 	preempt_enable();
 	vmx->msr_guest_kernel_gs_base = data;
@@ -1524,7 +1525,7 @@ void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 	vmx_vcpu_pi_load(vcpu, cpu);
 
-	vmx->host_debugctlmsr = get_debugctlmsr();
+	vmx->vt.host_debugctlmsr = get_debugctlmsr();
 }
 
 void vmx_vcpu_put(struct kvm_vcpu *vcpu)
@@ -1703,7 +1704,7 @@ int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
 	 * so that guest userspace can't DoS the guest simply by triggering
 	 * emulation (enclaves are CPL3 only).
 	 */
-	if (to_vmx(vcpu)->exit_reason.enclave_mode) {
+	if (vmx_get_exit_reason(vcpu).enclave_mode) {
 		kvm_queue_exception(vcpu, UD_VECTOR);
 		return X86EMUL_PROPAGATE_FAULT;
 	}
@@ -1718,7 +1719,7 @@ int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
 
 static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
 {
-	union vmx_exit_reason exit_reason = to_vmx(vcpu)->exit_reason;
+	union vmx_exit_reason exit_reason = vmx_get_exit_reason(vcpu);
 	unsigned long rip, orig_rip;
 	u32 instr_len;
 
@@ -4277,7 +4278,7 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
  */
 static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 {
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vcpu_vt *vt = to_vt(vcpu);
 	int r;
 
 	r = vmx_deliver_nested_posted_interrupt(vcpu, vector);
@@ -4288,11 +4289,11 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	if (!vcpu->arch.apic->apicv_active)
 		return -1;
 
-	if (pi_test_and_set_pir(vector, &vmx->pi_desc))
+	if (pi_test_and_set_pir(vector, &vt->pi_desc))
 		return 0;
 
 	/* If a previous notification has sent the IPI, nothing to do.  */
-	if (pi_test_and_set_on(&vmx->pi_desc))
+	if (pi_test_and_set_on(&vt->pi_desc))
 		return 0;
 
 	/*
@@ -4768,7 +4769,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 		vmcs_write16(GUEST_INTR_STATUS, 0);
 
 		vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
-		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
+		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->vt.pi_desc)));
 	}
 
 	if (vmx_can_use_ipiv(&vmx->vcpu)) {
@@ -4881,8 +4882,8 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 	 * Enforce invariant: pi_desc.nv is always either POSTED_INTR_VECTOR
 	 * or POSTED_INTR_WAKEUP_VECTOR.
 	 */
-	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
-	__pi_set_sn(&vmx->pi_desc);
+	vmx->vt.pi_desc.nv = POSTED_INTR_VECTOR;
+	__pi_set_sn(&vmx->vt.pi_desc);
 }
 
 void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -6062,7 +6063,7 @@ static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
 	 * VM-Exits. Unconditionally set the flag here and leave the handling to
 	 * vmx_handle_exit().
 	 */
-	to_vmx(vcpu)->exit_reason.bus_lock_detected = true;
+	to_vt(vcpu)->exit_reason.bus_lock_detected = true;
 	return 1;
 }
 
@@ -6160,9 +6161,9 @@ void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
-	*reason = vmx->exit_reason.full;
+	*reason = vmx->vt.exit_reason.full;
 	*info1 = vmx_get_exit_qual(vcpu);
-	if (!(vmx->exit_reason.failed_vmentry)) {
+	if (!(vmx->vt.exit_reason.failed_vmentry)) {
 		*info2 = vmx->idt_vectoring_info;
 		*intr_info = vmx_get_intr_info(vcpu);
 		if (is_exception_with_error_code(*intr_info))
@@ -6458,7 +6459,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	union vmx_exit_reason exit_reason = vmx->exit_reason;
+	union vmx_exit_reason exit_reason = vmx_get_exit_reason(vcpu);
 	u32 vectoring_info = vmx->idt_vectoring_info;
 	u16 exit_handler_index;
 
@@ -6624,7 +6625,7 @@ int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	 * Exit to user space when bus lock detected to inform that there is
 	 * a bus lock in guest.
 	 */
-	if (to_vmx(vcpu)->exit_reason.bus_lock_detected) {
+	if (vmx_get_exit_reason(vcpu).bus_lock_detected) {
 		if (ret > 0)
 			vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
 
@@ -6903,22 +6904,22 @@ static void vmx_set_rvi(int vector)
 
 int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vcpu_vt *vt = to_vt(vcpu);
 	int max_irr;
 	bool got_posted_interrupt;
 
 	if (KVM_BUG_ON(!enable_apicv, vcpu->kvm))
 		return -EIO;
 
-	if (pi_test_on(&vmx->pi_desc)) {
-		pi_clear_on(&vmx->pi_desc);
+	if (pi_test_on(&vt->pi_desc)) {
+		pi_clear_on(&vt->pi_desc);
 		/*
 		 * IOMMU can write to PID.ON, so the barrier matters even on UP.
 		 * But on x86 this is just a compiler barrier anyway.
 		 */
 		smp_mb__after_atomic();
 		got_posted_interrupt =
-			kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
+			kvm_apic_update_irr(vcpu, vt->pi_desc.pir, &max_irr);
 	} else {
 		max_irr = kvm_lapic_find_highest_irr(vcpu);
 		got_posted_interrupt = false;
@@ -6960,10 +6961,10 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 
 void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
 {
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vcpu_vt *vt = to_vt(vcpu);
 
-	pi_clear_on(&vmx->pi_desc);
-	memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
+	pi_clear_on(&vt->pi_desc);
+	memset(vt->pi_desc.pir, 0, sizeof(vt->pi_desc.pir));
 }
 
 void vmx_do_interrupt_irqoff(unsigned long entry);
@@ -7028,9 +7029,9 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 	if (vmx->emulation_required)
 		return;
 
-	if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
+	if (vmx_get_exit_reason(vcpu).basic == EXIT_REASON_EXTERNAL_INTERRUPT)
 		handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
-	else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
+	else if (vmx_get_exit_reason(vcpu).basic == EXIT_REASON_EXCEPTION_NMI)
 		handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
 }
 
@@ -7261,10 +7262,10 @@ static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu,
 	 * the fastpath even, all other exits must use the slow path.
 	 */
 	if (is_guest_mode(vcpu) &&
-	    to_vmx(vcpu)->exit_reason.basic != EXIT_REASON_PREEMPTION_TIMER)
+	    vmx_get_exit_reason(vcpu).basic != EXIT_REASON_PREEMPTION_TIMER)
 		return EXIT_FASTPATH_NONE;
 
-	switch (to_vmx(vcpu)->exit_reason.basic) {
+	switch (vmx_get_exit_reason(vcpu).basic) {
 	case EXIT_REASON_MSR_WRITE:
 		return handle_fastpath_set_msr_irqoff(vcpu);
 	case EXIT_REASON_PREEMPTION_TIMER:
@@ -7311,15 +7312,15 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	vmx_enable_fb_clear(vmx);
 
 	if (unlikely(vmx->fail)) {
-		vmx->exit_reason.full = 0xdead;
+		vmx->vt.exit_reason.full = 0xdead;
 		goto out;
 	}
 
-	vmx->exit_reason.full = vmcs_read32(VM_EXIT_REASON);
-	if (likely(!vmx->exit_reason.failed_vmentry))
+	vmx->vt.exit_reason.full = vmcs_read32(VM_EXIT_REASON);
+	if (likely(!vmx_get_exit_reason(vcpu).failed_vmentry))
 		vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
-	if ((u16)vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
+	if ((u16)vmx_get_exit_reason(vcpu).basic == EXIT_REASON_EXCEPTION_NMI &&
 	    is_nmi(vmx_get_intr_info(vcpu))) {
 		kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
 		if (cpu_feature_enabled(X86_FEATURE_FRED))
@@ -7351,12 +7352,12 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	if (unlikely(vmx->emulation_required)) {
 		vmx->fail = 0;
 
-		vmx->exit_reason.full = EXIT_REASON_INVALID_STATE;
-		vmx->exit_reason.failed_vmentry = 1;
+		vmx->vt.exit_reason.full = EXIT_REASON_INVALID_STATE;
+		vmx->vt.exit_reason.failed_vmentry = 1;
 		kvm_register_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_1);
-		vmx->exit_qualification = ENTRY_FAIL_DEFAULT;
+		vmx->vt.exit_qualification = ENTRY_FAIL_DEFAULT;
 		kvm_register_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_2);
-		vmx->exit_intr_info = 0;
+		vmx->vt.exit_intr_info = 0;
 		return EXIT_FASTPATH_NONE;
 	}
 
@@ -7437,8 +7438,8 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	}
 
 	/* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
-	if (vmx->host_debugctlmsr)
-		update_debugctlmsr(vmx->host_debugctlmsr);
+	if (vmx->vt.host_debugctlmsr)
+		update_debugctlmsr(vmx->vt.host_debugctlmsr);
 
 #ifndef CONFIG_X86_64
 	/*
@@ -7463,7 +7464,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 		 * checking.
 		 */
 		if (vmx->nested.nested_run_pending &&
-		    !vmx->exit_reason.failed_vmentry)
+		    !vmx_get_exit_reason(vcpu).failed_vmentry)
 			++vcpu->stat.nested_run;
 
 		vmx->nested.nested_run_pending = 0;
@@ -7472,12 +7473,12 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	if (unlikely(vmx->fail))
 		return EXIT_FASTPATH_NONE;
 
-	if (unlikely((u16)vmx->exit_reason.basic == EXIT_REASON_MCE_DURING_VMENTRY))
+	if (unlikely((u16)vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
 		kvm_machine_check();
 
 	trace_kvm_exit(vcpu, KVM_ISA_VMX);
 
-	if (unlikely(vmx->exit_reason.failed_vmentry))
+	if (unlikely(vmx_get_exit_reason(vcpu).failed_vmentry))
 		return EXIT_FASTPATH_NONE;
 
 	vmx->loaded_vmcs->launched = 1;
@@ -7509,7 +7510,7 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 	BUILD_BUG_ON(offsetof(struct vcpu_vmx, vcpu) != 0);
 	vmx = to_vmx(vcpu);
 
-	INIT_LIST_HEAD(&vmx->pi_wakeup_list);
+	INIT_LIST_HEAD(&vmx->vt.pi_wakeup_list);
 
 	err = -ENOMEM;
 
@@ -7607,7 +7608,7 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 
 	if (vmx_can_use_ipiv(vcpu))
 		WRITE_ONCE(to_kvm_vmx(vcpu->kvm)->pid_table[vcpu->vcpu_id],
-			   __pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
+			   __pa(&vmx->vt.pi_desc) | PID_TABLE_ENTRY_VALID);
 
 	return 0;
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index a58b940f0634..e635199901e2 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -17,6 +17,7 @@
 #include "../cpuid.h"
 #include "run_flags.h"
 #include "../mmu.h"
+#include "common.h"
 
 #define X2APIC_MSR(r) (APIC_BASE_MSR + ((r) >> 4))
 
@@ -68,29 +69,6 @@ struct pt_desc {
 	struct pt_ctx guest;
 };
 
-union vmx_exit_reason {
-	struct {
-		u32	basic			: 16;
-		u32	reserved16		: 1;
-		u32	reserved17		: 1;
-		u32	reserved18		: 1;
-		u32	reserved19		: 1;
-		u32	reserved20		: 1;
-		u32	reserved21		: 1;
-		u32	reserved22		: 1;
-		u32	reserved23		: 1;
-		u32	reserved24		: 1;
-		u32	reserved25		: 1;
-		u32	bus_lock_detected	: 1;
-		u32	enclave_mode		: 1;
-		u32	smi_pending_mtf		: 1;
-		u32	smi_from_vmx_root	: 1;
-		u32	reserved30		: 1;
-		u32	failed_vmentry		: 1;
-	};
-	u32 full;
-};
-
 /*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
@@ -231,20 +209,10 @@ struct nested_vmx {
 
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
+	struct vcpu_vt	      vt;
 	u8                    fail;
 	u8		      x2apic_msr_bitmap_mode;
 
-	/*
-	 * If true, host state has been stored in vmx->loaded_vmcs for
-	 * the CPU registers that only need to be switched when transitioning
-	 * to/from the kernel, and the registers have been loaded with guest
-	 * values.  If false, host state is loaded in the CPU registers
-	 * and vmx->loaded_vmcs->host_state is invalid.
-	 */
-	bool		      guest_state_loaded;
-
-	unsigned long         exit_qualification;
-	u32                   exit_intr_info;
 	u32                   idt_vectoring_info;
 	ulong                 rflags;
 
@@ -257,7 +225,6 @@ struct vcpu_vmx {
 	struct vmx_uret_msr   guest_uret_msrs[MAX_NR_USER_RETURN_MSRS];
 	bool                  guest_uret_msrs_loaded;
 #ifdef CONFIG_X86_64
-	u64		      msr_host_kernel_gs_base;
 	u64		      msr_guest_kernel_gs_base;
 #endif
 
@@ -298,14 +265,6 @@ struct vcpu_vmx {
 	int vpid;
 	bool emulation_required;
 
-	union vmx_exit_reason exit_reason;
-
-	/* Posted interrupt descriptor */
-	struct pi_desc pi_desc;
-
-	/* Used if this vCPU is waiting for PI notification wakeup. */
-	struct list_head pi_wakeup_list;
-
 	/* Support for a guest hypervisor (nested VMX) */
 	struct nested_vmx nested;
 
@@ -323,8 +282,6 @@ struct vcpu_vmx {
 	/* apic deadline value in host tsc */
 	u64 hv_deadline_tsc;
 
-	unsigned long host_debugctlmsr;
-
 	/*
 	 * Only bits masked by msr_ia32_feature_control_valid_bits can be set in
 	 * msr_ia32_feature_control. FEAT_CTL_LOCKED is always included
@@ -361,6 +318,43 @@ struct kvm_vmx {
 	u64 *pid_table;
 };
 
+static __always_inline struct vcpu_vt *to_vt(struct kvm_vcpu *vcpu)
+{
+	return &(container_of(vcpu, struct vcpu_vmx, vcpu)->vt);
+}
+
+static __always_inline struct kvm_vcpu *vt_to_vcpu(struct vcpu_vt *vt)
+{
+	return &(container_of(vt, struct vcpu_vmx, vt)->vcpu);
+}
+
+static __always_inline union vmx_exit_reason vmx_get_exit_reason(struct kvm_vcpu *vcpu)
+{
+	return to_vt(vcpu)->exit_reason;
+}
+
+static __always_inline unsigned long vmx_get_exit_qual(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vt *vt = to_vt(vcpu);
+
+	if (!kvm_register_test_and_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_1) &&
+	    !WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		vt->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	return vt->exit_qualification;
+}
+
+static __always_inline u32 vmx_get_intr_info(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vt *vt = to_vt(vcpu);
+
+	if (!kvm_register_test_and_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_2) &&
+	    !WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		vt->exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+
+	return vt->exit_intr_info;
+}
+
 void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
 			struct loaded_vmcs *buddy);
 int allocate_vpid(void);
@@ -651,26 +645,6 @@ void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu);
 int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
 void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu);
 
-static __always_inline unsigned long vmx_get_exit_qual(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
-
-	if (!kvm_register_test_and_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_1))
-		vmx->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
-
-	return vmx->exit_qualification;
-}
-
-static __always_inline u32 vmx_get_intr_info(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
-
-	if (!kvm_register_test_and_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_2))
-		vmx->exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
-
-	return vmx->exit_intr_info;
-}
-
 struct vmcs *alloc_vmcs_cpu(bool shadow, int cpu, gfp_t flags);
 void free_vmcs(struct vmcs *vmcs);
 int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V2 05/12] KVM: TDX: Implement TDX vcpu enter/exit path
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (3 preceding siblings ...)
  2025-01-29  9:58 ` [PATCH V2 04/12] KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct Adrian Hunter
@ 2025-01-29  9:58 ` Adrian Hunter
  2025-02-20 13:16   ` Xiaoyao Li
  2025-01-29  9:58 ` [PATCH V2 06/12] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) Adrian Hunter
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:58 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement callbacks to enter/exit a TDX VCPU by calling tdh_vp_enter().
Ensure the TDX VCPU is in a correct state to run.

Do not pass arguments from/to vcpu->arch.regs[] unconditionally. Instead,
marshall state to/from the appropriate x86 registers only when needed,
i.e., to handle some TDVMCALL sub-leaves following KVM's ABI to leverage
the existing code.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v2:
- Move VCPU_TD_STATE_INITIALIZED check to tdx_vcpu_pre_run() (Xiaoyao)
- Check TD_STATE_RUNNABLE also in tdx_vcpu_pre_run() (Yan)
- Add back 'noinstr' for tdx_vcpu_enter_exit() (Sean)
- Add WARN_ON_ONCE if force_immediate_exit (Sean)
- Add vp_enter_args to vcpu_tdx to store the input/output arguments for
  tdh_vp_enter().
- Don't copy arguments to/from vcpu->arch.regs[] unconditionally. (Sean)

TD vcpu enter/exit v1:
- Make argument of tdx_vcpu_enter_exit() struct kvm_vcpu.
- Update for the wrapper functions for SEAMCALLs. (Sean)
- Remove noinstr (Sean)
- Add a missing comma, clarify sched_in part, and update changelog to
  match code by dropping the PMU related paragraph (Binbin)
  https://lore.kernel.org/lkml/c0029d4d-3dee-4f11-a929-d64d2651bfb3@linux.intel.com/
- Remove the union tdx_exit_reason. (Sean)
  https://lore.kernel.org/kvm/ZfSExlemFMKjBtZb@google.com/
- Remove the code of special handling of vcpu->kvm->vm_bugged (Rick)
  https://lore.kernel.org/kvm/20240318234010.GD1645738@ls.amr.corp.intel.com/
- For !tdx->initialized case, set tdx->vp_enter_ret to TDX_SW_ERROR to avoid
  collision with EXIT_REASON_EXCEPTION_NMI.

v19:
- Removed export_symbol_gpl(host_xcr0) to the patch that uses it

Changes v15 -> v16:
- use __seamcall_saved_ret()
- As struct tdx_module_args doesn't match with vcpu.arch.regs, copy regs
  before/after calling __seamcall_saved_ret().
---
 arch/x86/kvm/vmx/main.c    | 20 ++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 47 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h     |  3 +++
 arch/x86/kvm/vmx/x86_ops.h |  7 ++++++
 4 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 1cc1c06461f2..301c1a26606f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -133,6 +133,22 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vmx_vcpu_load(vcpu, cpu);
 }
 
+static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_pre_run(vcpu);
+
+	return vmx_vcpu_pre_run(vcpu);
+}
+
+static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_run(vcpu, force_immediate_exit);
+
+	return vmx_vcpu_run(vcpu, force_immediate_exit);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu)) {
@@ -272,8 +288,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.flush_tlb_gva = vt_flush_tlb_gva,
 	.flush_tlb_guest = vt_flush_tlb_guest,
 
-	.vcpu_pre_run = vmx_vcpu_pre_run,
-	.vcpu_run = vmx_vcpu_run,
+	.vcpu_pre_run = vt_vcpu_pre_run,
+	.vcpu_run = vt_vcpu_run,
 	.handle_exit = vmx_handle_exit,
 	.skip_emulated_instruction = vmx_skip_emulated_instruction,
 	.update_emulated_instruction = vmx_update_emulated_instruction,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a7ebdafdfd82..95420ffd0022 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -11,6 +11,8 @@
 #include "vmx.h"
 #include "mmu/spte.h"
 #include "common.h"
+#include <trace/events/kvm.h>
+#include "trace.h"
 
 #pragma GCC poison to_vmx
 
@@ -673,6 +675,51 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 	tdx->state = VCPU_TD_STATE_UNINITIALIZED;
 }
 
+int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu)
+{
+	if (unlikely(to_tdx(vcpu)->state != VCPU_TD_STATE_INITIALIZED ||
+		     to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
+		return -EINVAL;
+
+	return 1;
+}
+
+static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	guest_state_enter_irqoff();
+
+	tdx->vp_enter_ret = tdh_vp_enter(&tdx->vp, &tdx->vp_enter_args);
+
+	guest_state_exit_irqoff();
+}
+
+#define TDX_REGS_UNSUPPORTED_SET	(BIT(VCPU_EXREG_RFLAGS) |	\
+					 BIT(VCPU_EXREG_SEGMENTS))
+
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
+{
+	/*
+	 * force_immediate_exit requires vCPU entering for events injection with
+	 * an immediately exit followed. But The TDX module doesn't guarantee
+	 * entry, it's already possible for KVM to _think_ it completely entry
+	 * to the guest without actually having done so.
+	 * Since KVM never needs to force an immediate exit for TDX, and can't
+	 * do direct injection, just warn on force_immediate_exit.
+	 */
+	WARN_ON_ONCE(force_immediate_exit);
+
+	trace_kvm_entry(vcpu, force_immediate_exit);
+
+	tdx_vcpu_enter_exit(vcpu);
+
+	vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
+
+	trace_kvm_exit(vcpu, KVM_ISA_VMX);
+
+	return EXIT_FASTPATH_NONE;
+}
 
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ba880dae547f..8339bbf0fdd4 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -46,11 +46,14 @@ enum vcpu_tdx_state {
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
 	struct vcpu_vt vt;
+	struct tdx_module_args vp_enter_args;
 
 	struct tdx_vp vp;
 
 	struct list_head cpu_list;
 
+	u64 vp_enter_ret;
+
 	enum vcpu_tdx_state state;
 };
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index ff6370787926..83aac44b779b 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -131,6 +131,8 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
@@ -158,6 +160,11 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
+static inline int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
+static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
+{
+	return EXIT_FASTPATH_NONE;
+}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 05/12] KVM: TDX: Implement TDX vcpu enter/exit path
  2025-01-29  9:58 ` [PATCH V2 05/12] KVM: TDX: Implement TDX vcpu enter/exit path Adrian Hunter
@ 2025-02-20 13:16   ` Xiaoyao Li
  2025-02-24 12:27     ` Adrian Hunter
  0 siblings, 1 reply; 39+ messages in thread
From: Xiaoyao Li @ 2025-02-20 13:16 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 1/29/2025 5:58 PM, Adrian Hunter wrote:
> +#define TDX_REGS_UNSUPPORTED_SET	(BIT(VCPU_EXREG_RFLAGS) |	\
> +					 BIT(VCPU_EXREG_SEGMENTS))
> +
> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> +{
> +	/*
> +	 * force_immediate_exit requires vCPU entering for events injection with
> +	 * an immediately exit followed. But The TDX module doesn't guarantee
> +	 * entry, it's already possible for KVM to_think_ it completely entry
> +	 * to the guest without actually having done so.
> +	 * Since KVM never needs to force an immediate exit for TDX, and can't
> +	 * do direct injection, just warn on force_immediate_exit.
> +	 */
> +	WARN_ON_ONCE(force_immediate_exit);
> +
> +	trace_kvm_entry(vcpu, force_immediate_exit);
> +
> +	tdx_vcpu_enter_exit(vcpu);
> +
> +	vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;

I don't understand this. Why only clear RFLAGS and SEGMENTS?

When creating the vcpu, vcpu->arch.regs_avail = ~0 in 
kvm_arch_vcpu_create().

now it only clears RFLAGS and SEGMENTS for TDX vcpu, which leaves other 
bits set. But I don't see any code that syncs the guest value of into 
vcpu->arch.regs[reg].

> +	trace_kvm_exit(vcpu, KVM_ISA_VMX);
> +
> +	return EXIT_FASTPATH_NONE;
> +}


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 05/12] KVM: TDX: Implement TDX vcpu enter/exit path
  2025-02-20 13:16   ` Xiaoyao Li
@ 2025-02-24 12:27     ` Adrian Hunter
  2025-02-25  6:15       ` Xiaoyao Li
  0 siblings, 1 reply; 39+ messages in thread
From: Adrian Hunter @ 2025-02-24 12:27 UTC (permalink / raw)
  To: Xiaoyao Li, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 20/02/25 15:16, Xiaoyao Li wrote:
> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>> +#define TDX_REGS_UNSUPPORTED_SET    (BIT(VCPU_EXREG_RFLAGS) |    \
>> +                     BIT(VCPU_EXREG_SEGMENTS))
>> +
>> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>> +{
>> +    /*
>> +     * force_immediate_exit requires vCPU entering for events injection with
>> +     * an immediately exit followed. But The TDX module doesn't guarantee
>> +     * entry, it's already possible for KVM to_think_ it completely entry
>> +     * to the guest without actually having done so.
>> +     * Since KVM never needs to force an immediate exit for TDX, and can't
>> +     * do direct injection, just warn on force_immediate_exit.
>> +     */
>> +    WARN_ON_ONCE(force_immediate_exit);
>> +
>> +    trace_kvm_entry(vcpu, force_immediate_exit);
>> +
>> +    tdx_vcpu_enter_exit(vcpu);
>> +
>> +    vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
> 
> I don't understand this. Why only clear RFLAGS and SEGMENTS?
> 
> When creating the vcpu, vcpu->arch.regs_avail = ~0 in kvm_arch_vcpu_create().
> 
> now it only clears RFLAGS and SEGMENTS for TDX vcpu, which leaves other bits set. But I don't see any code that syncs the guest value of into vcpu->arch.regs[reg].

TDX guest registers are generally not known but
values are placed into vcpu->arch.regs when needed
to work with common code.

We used to use ~VMX_REGS_LAZY_LOAD_SET and tdx_cache_reg()
which has since been removed.

tdx_cache_reg() did not support RFLAGS, SEGMENTS,
EXIT_INFO_1/EXIT_INFO_2 but EXIT_INFO_1/EXIT_INFO_2 became
needed, so that just left RFLAGS, SEGMENTS.

> 
>> +    trace_kvm_exit(vcpu, KVM_ISA_VMX);
>> +
>> +    return EXIT_FASTPATH_NONE;
>> +}
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 05/12] KVM: TDX: Implement TDX vcpu enter/exit path
  2025-02-24 12:27     ` Adrian Hunter
@ 2025-02-25  6:15       ` Xiaoyao Li
  2025-02-27 18:37         ` Adrian Hunter
  0 siblings, 1 reply; 39+ messages in thread
From: Xiaoyao Li @ 2025-02-25  6:15 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 2/24/2025 8:27 PM, Adrian Hunter wrote:
> On 20/02/25 15:16, Xiaoyao Li wrote:
>> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>>> +#define TDX_REGS_UNSUPPORTED_SET    (BIT(VCPU_EXREG_RFLAGS) |    \
>>> +                     BIT(VCPU_EXREG_SEGMENTS))
>>> +
>>> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>> +{
>>> +    /*
>>> +     * force_immediate_exit requires vCPU entering for events injection with
>>> +     * an immediately exit followed. But The TDX module doesn't guarantee
>>> +     * entry, it's already possible for KVM to_think_ it completely entry
>>> +     * to the guest without actually having done so.
>>> +     * Since KVM never needs to force an immediate exit for TDX, and can't
>>> +     * do direct injection, just warn on force_immediate_exit.
>>> +     */
>>> +    WARN_ON_ONCE(force_immediate_exit);
>>> +
>>> +    trace_kvm_entry(vcpu, force_immediate_exit);
>>> +
>>> +    tdx_vcpu_enter_exit(vcpu);
>>> +
>>> +    vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
>>
>> I don't understand this. Why only clear RFLAGS and SEGMENTS?
>>
>> When creating the vcpu, vcpu->arch.regs_avail = ~0 in kvm_arch_vcpu_create().
>>
>> now it only clears RFLAGS and SEGMENTS for TDX vcpu, which leaves other bits set. But I don't see any code that syncs the guest value of into vcpu->arch.regs[reg].
> 
> TDX guest registers are generally not known but
> values are placed into vcpu->arch.regs when needed
> to work with common code.
> 
> We used to use ~VMX_REGS_LAZY_LOAD_SET and tdx_cache_reg()
> which has since been removed.
> 
> tdx_cache_reg() did not support RFLAGS, SEGMENTS,
> EXIT_INFO_1/EXIT_INFO_2 but EXIT_INFO_1/EXIT_INFO_2 became
> needed, so that just left RFLAGS, SEGMENTS.

Quote what Sean said [1]

   “I'm also not convinced letting KVM read garbage for RIP, RSP, CR3, or
   PDPTRs is at all reasonable.  CR3 and PDPTRs should be unreachable,
   and I gotta imagine the same holds true for RSP.  Allow reads/writes
   to RIP is fine, in that it probably simplifies the overall code.”

We need to justify why to let KVM read "garbage" of VCPU_REGS_RIP,
VCPU_EXREG_PDPTR, VCPU_EXREG_CR0, VCPU_EXREG_CR3, VCPU_EXREG_CR4,
VCPU_EXREG_EXIT_INFO_1, and VCPU_EXREG_EXIT_INFO_2 are neeed.

The changelog justify nothing for it.

btw, how EXIT_INFO_1/EXIT_INFO_2 became needed? It seems I cannot find 
any TDX code use them.

[1] https://lore.kernel.org/all/Z2GiQS_RmYeHU09L@google.com/

>>
>>> +    trace_kvm_exit(vcpu, KVM_ISA_VMX);
>>> +
>>> +    return EXIT_FASTPATH_NONE;
>>> +}
>>
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 05/12] KVM: TDX: Implement TDX vcpu enter/exit path
  2025-02-25  6:15       ` Xiaoyao Li
@ 2025-02-27 18:37         ` Adrian Hunter
  2025-03-06 18:19           ` Paolo Bonzini
  0 siblings, 1 reply; 39+ messages in thread
From: Adrian Hunter @ 2025-02-27 18:37 UTC (permalink / raw)
  To: Xiaoyao Li, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 25/02/25 08:15, Xiaoyao Li wrote:
> On 2/24/2025 8:27 PM, Adrian Hunter wrote:
>> On 20/02/25 15:16, Xiaoyao Li wrote:
>>> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>>>> +#define TDX_REGS_UNSUPPORTED_SET    (BIT(VCPU_EXREG_RFLAGS) |    \
>>>> +                     BIT(VCPU_EXREG_SEGMENTS))
>>>> +
>>>> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>>> +{
>>>> +    /*
>>>> +     * force_immediate_exit requires vCPU entering for events injection with
>>>> +     * an immediately exit followed. But The TDX module doesn't guarantee
>>>> +     * entry, it's already possible for KVM to_think_ it completely entry
>>>> +     * to the guest without actually having done so.
>>>> +     * Since KVM never needs to force an immediate exit for TDX, and can't
>>>> +     * do direct injection, just warn on force_immediate_exit.
>>>> +     */
>>>> +    WARN_ON_ONCE(force_immediate_exit);
>>>> +
>>>> +    trace_kvm_entry(vcpu, force_immediate_exit);
>>>> +
>>>> +    tdx_vcpu_enter_exit(vcpu);
>>>> +
>>>> +    vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
>>>
>>> I don't understand this. Why only clear RFLAGS and SEGMENTS?
>>>
>>> When creating the vcpu, vcpu->arch.regs_avail = ~0 in kvm_arch_vcpu_create().
>>>
>>> now it only clears RFLAGS and SEGMENTS for TDX vcpu, which leaves other bits set. But I don't see any code that syncs the guest value of into vcpu->arch.regs[reg].
>>
>> TDX guest registers are generally not known but
>> values are placed into vcpu->arch.regs when needed
>> to work with common code.
>>
>> We used to use ~VMX_REGS_LAZY_LOAD_SET and tdx_cache_reg()
>> which has since been removed.
>>
>> tdx_cache_reg() did not support RFLAGS, SEGMENTS,
>> EXIT_INFO_1/EXIT_INFO_2 but EXIT_INFO_1/EXIT_INFO_2 became
>> needed, so that just left RFLAGS, SEGMENTS.
> 
> Quote what Sean said [1]
> 
>   “I'm also not convinced letting KVM read garbage for RIP, RSP, CR3, or
>   PDPTRs is at all reasonable.  CR3 and PDPTRs should be unreachable,
>   and I gotta imagine the same holds true for RSP.  Allow reads/writes
>   to RIP is fine, in that it probably simplifies the overall code.”
> 
> We need to justify why to let KVM read "garbage" of VCPU_REGS_RIP,
> VCPU_EXREG_PDPTR, VCPU_EXREG_CR0, VCPU_EXREG_CR3, VCPU_EXREG_CR4,
> VCPU_EXREG_EXIT_INFO_1, and VCPU_EXREG_EXIT_INFO_2 are neeed.
> 
> The changelog justify nothing for it.

Could add VCPU_REGS_RIP, VCPU_REGS_RSP, VCPU_EXREG_CR3, VCPU_EXREG_PDPTR.
But not VCPU_EXREG_CR0 nor VCPU_EXREG_CR4 since we started using them.

> 
> btw, how EXIT_INFO_1/EXIT_INFO_2 became needed? It seems I cannot find any TDX code use them.

vmx_get_exit_qual() / vmx_get_intr_info() are now used by TDX.

> 
> [1] https://lore.kernel.org/all/Z2GiQS_RmYeHU09L@google.com/
> 
>>>
>>>> +    trace_kvm_exit(vcpu, KVM_ISA_VMX);
>>>> +
>>>> +    return EXIT_FASTPATH_NONE;
>>>> +}
>>>
>>
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 05/12] KVM: TDX: Implement TDX vcpu enter/exit path
  2025-02-27 18:37         ` Adrian Hunter
@ 2025-03-06 18:19           ` Paolo Bonzini
  2025-03-06 19:13             ` Adrian Hunter
  0 siblings, 1 reply; 39+ messages in thread
From: Paolo Bonzini @ 2025-03-06 18:19 UTC (permalink / raw)
  To: Adrian Hunter, Xiaoyao Li, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 2/27/25 19:37, Adrian Hunter wrote:
> On 25/02/25 08:15, Xiaoyao Li wrote:
>> On 2/24/2025 8:27 PM, Adrian Hunter wrote:
>>> On 20/02/25 15:16, Xiaoyao Li wrote:
>>>> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>>>>> +#define TDX_REGS_UNSUPPORTED_SET    (BIT(VCPU_EXREG_RFLAGS) |    \
>>>>> +                     BIT(VCPU_EXREG_SEGMENTS))
>>>>> +
>>>>> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>>>> +{
>>>>> +    /*
>>>>> +     * force_immediate_exit requires vCPU entering for events injection with
>>>>> +     * an immediately exit followed. But The TDX module doesn't guarantee
>>>>> +     * entry, it's already possible for KVM to_think_ it completely entry
>>>>> +     * to the guest without actually having done so.
>>>>> +     * Since KVM never needs to force an immediate exit for TDX, and can't
>>>>> +     * do direct injection, just warn on force_immediate_exit.
>>>>> +     */
>>>>> +    WARN_ON_ONCE(force_immediate_exit);
>>>>> +
>>>>> +    trace_kvm_entry(vcpu, force_immediate_exit);
>>>>> +
>>>>> +    tdx_vcpu_enter_exit(vcpu);
>>>>> +
>>>>> +    vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
>>>>
>>>> I don't understand this. Why only clear RFLAGS and SEGMENTS?
>>>>
>>>> When creating the vcpu, vcpu->arch.regs_avail = ~0 in kvm_arch_vcpu_create().
>>>>
>>>> now it only clears RFLAGS and SEGMENTS for TDX vcpu, which leaves other bits set. But I don't see any code that syncs the guest value of into vcpu->arch.regs[reg].
>>>
>>> TDX guest registers are generally not known but
>>> values are placed into vcpu->arch.regs when needed
>>> to work with common code.
>>>
>>> We used to use ~VMX_REGS_LAZY_LOAD_SET and tdx_cache_reg()
>>> which has since been removed.
>>>
>>> tdx_cache_reg() did not support RFLAGS, SEGMENTS,
>>> EXIT_INFO_1/EXIT_INFO_2 but EXIT_INFO_1/EXIT_INFO_2 became
>>> needed, so that just left RFLAGS, SEGMENTS.
>>
>> Quote what Sean said [1]
>>
>>    “I'm also not convinced letting KVM read garbage for RIP, RSP, CR3, or
>>    PDPTRs is at all reasonable.  CR3 and PDPTRs should be unreachable,
>>    and I gotta imagine the same holds true for RSP.  Allow reads/writes
>>    to RIP is fine, in that it probably simplifies the overall code.”
>>
>> We need to justify why to let KVM read "garbage" of VCPU_REGS_RIP,
>> VCPU_EXREG_PDPTR, VCPU_EXREG_CR0, VCPU_EXREG_CR3, VCPU_EXREG_CR4,
>> VCPU_EXREG_EXIT_INFO_1, and VCPU_EXREG_EXIT_INFO_2 are neeed.
>>
>> The changelog justify nothing for it.
> 
> Could add VCPU_REGS_RIP, VCPU_REGS_RSP, VCPU_EXREG_CR3, VCPU_EXREG_PDPTR.
> But not VCPU_EXREG_CR0 nor VCPU_EXREG_CR4 since we started using them.

Hi Adrian,

how is CR0 used? And CR4 is only used other than for loading the XSAVE 
state, I think?

I will change this to a list of specific available registers instead of 
using "&= ~", and it would be even better if CR0/CR4 are not on the list.

Paolo

>> btw, how EXIT_INFO_1/EXIT_INFO_2 became needed? It seems I cannot find any TDX code use them.
> 
> vmx_get_exit_qual() / vmx_get_intr_info() are now used by TDX.
> 
>>
>> [1] https://lore.kernel.org/all/Z2GiQS_RmYeHU09L@google.com/
>>
>>>>
>>>>> +    trace_kvm_exit(vcpu, KVM_ISA_VMX);
>>>>> +
>>>>> +    return EXIT_FASTPATH_NONE;
>>>>> +}
>>>>
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 05/12] KVM: TDX: Implement TDX vcpu enter/exit path
  2025-03-06 18:19           ` Paolo Bonzini
@ 2025-03-06 19:13             ` Adrian Hunter
  0 siblings, 0 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-03-06 19:13 UTC (permalink / raw)
  To: Paolo Bonzini, Xiaoyao Li, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 6/03/25 20:19, Paolo Bonzini wrote:
> On 2/27/25 19:37, Adrian Hunter wrote:
>> On 25/02/25 08:15, Xiaoyao Li wrote:
>>> On 2/24/2025 8:27 PM, Adrian Hunter wrote:
>>>> On 20/02/25 15:16, Xiaoyao Li wrote:
>>>>> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>>>>>> +#define TDX_REGS_UNSUPPORTED_SET    (BIT(VCPU_EXREG_RFLAGS) |    \
>>>>>> +                     BIT(VCPU_EXREG_SEGMENTS))
>>>>>> +
>>>>>> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>>>>> +{
>>>>>> +    /*
>>>>>> +     * force_immediate_exit requires vCPU entering for events injection with
>>>>>> +     * an immediately exit followed. But The TDX module doesn't guarantee
>>>>>> +     * entry, it's already possible for KVM to_think_ it completely entry
>>>>>> +     * to the guest without actually having done so.
>>>>>> +     * Since KVM never needs to force an immediate exit for TDX, and can't
>>>>>> +     * do direct injection, just warn on force_immediate_exit.
>>>>>> +     */
>>>>>> +    WARN_ON_ONCE(force_immediate_exit);
>>>>>> +
>>>>>> +    trace_kvm_entry(vcpu, force_immediate_exit);
>>>>>> +
>>>>>> +    tdx_vcpu_enter_exit(vcpu);
>>>>>> +
>>>>>> +    vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
>>>>>
>>>>> I don't understand this. Why only clear RFLAGS and SEGMENTS?
>>>>>
>>>>> When creating the vcpu, vcpu->arch.regs_avail = ~0 in kvm_arch_vcpu_create().
>>>>>
>>>>> now it only clears RFLAGS and SEGMENTS for TDX vcpu, which leaves other bits set. But I don't see any code that syncs the guest value of into vcpu->arch.regs[reg].
>>>>
>>>> TDX guest registers are generally not known but
>>>> values are placed into vcpu->arch.regs when needed
>>>> to work with common code.
>>>>
>>>> We used to use ~VMX_REGS_LAZY_LOAD_SET and tdx_cache_reg()
>>>> which has since been removed.
>>>>
>>>> tdx_cache_reg() did not support RFLAGS, SEGMENTS,
>>>> EXIT_INFO_1/EXIT_INFO_2 but EXIT_INFO_1/EXIT_INFO_2 became
>>>> needed, so that just left RFLAGS, SEGMENTS.
>>>
>>> Quote what Sean said [1]
>>>
>>>    “I'm also not convinced letting KVM read garbage for RIP, RSP, CR3, or
>>>    PDPTRs is at all reasonable.  CR3 and PDPTRs should be unreachable,
>>>    and I gotta imagine the same holds true for RSP.  Allow reads/writes
>>>    to RIP is fine, in that it probably simplifies the overall code.”
>>>
>>> We need to justify why to let KVM read "garbage" of VCPU_REGS_RIP,
>>> VCPU_EXREG_PDPTR, VCPU_EXREG_CR0, VCPU_EXREG_CR3, VCPU_EXREG_CR4,
>>> VCPU_EXREG_EXIT_INFO_1, and VCPU_EXREG_EXIT_INFO_2 are neeed.
>>>
>>> The changelog justify nothing for it.
>>
>> Could add VCPU_REGS_RIP, VCPU_REGS_RSP, VCPU_EXREG_CR3, VCPU_EXREG_PDPTR.
>> But not VCPU_EXREG_CR0 nor VCPU_EXREG_CR4 since we started using them.
> 
> Hi Adrian,
> 
> how is CR0 used? And CR4 is only used other than for loading the XSAVE state, I think?

I meant it is used in the sense that patch "[PATCH V2 07/12] KVM: TDX: 
restore host xsave state when exit from the guest TD" provides a value for it.

But it looks like it might be accessible via:
	store_regs()
		__get_sregs()
			__get_sregs_common()

Sean wanted a maximal CR0 value consistent with the CR4.

CR4 is also being used in kvm_update_cpuid_runtime().

> 
> I will change this to a list of specific available registers instead of using "&= ~", and it would be even better if CR0/CR4 are not on the list.
> 
> Paolo
> 
>>> btw, how EXIT_INFO_1/EXIT_INFO_2 became needed? It seems I cannot find any TDX code use them.
>>
>> vmx_get_exit_qual() / vmx_get_intr_info() are now used by TDX.
>>
>>>
>>> [1] https://lore.kernel.org/all/Z2GiQS_RmYeHU09L@google.com/
>>>
>>>>>
>>>>>> +    trace_kvm_exit(vcpu, KVM_ISA_VMX);
>>>>>> +
>>>>>> +    return EXIT_FASTPATH_NONE;
>>>>>> +}
>>>>>
>>>>
>>>
>>
>>
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V2 06/12] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (4 preceding siblings ...)
  2025-01-29  9:58 ` [PATCH V2 05/12] KVM: TDX: Implement TDX vcpu enter/exit path Adrian Hunter
@ 2025-01-29  9:58 ` Adrian Hunter
  2025-01-29  9:58 ` [PATCH V2 07/12] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:58 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

From: Isaku Yamahata <isaku.yamahata@intel.com>

On entering/exiting TDX vcpu, preserved or clobbered CPU state is different
from the VMX case. Add TDX hooks to save/restore host/guest CPU state.
Save/restore kernel GS base MSR.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TD vcpu enter/exit v2:
 - Use 1 variable named 'guest_state_loaded' to track host state
   save/restore (Sean)
 - Rebased due to moving guest_state_loaded/msr_host_kernel_gs_base
   to struct vcpu_vt.

TD vcpu enter/exit v1:
 - Clarify comment (Binbin)
 - Use lower case preserved and add the for VMX in log (Tony)
 - Fix bisectability issue with includes (Kai)
---
 arch/x86/kvm/vmx/main.c    | 24 +++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 40 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  4 ++++
 3 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 301c1a26606f..341aa537ca72 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -133,6 +133,26 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vmx_vcpu_load(vcpu, cpu);
 }
 
+static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_prepare_switch_to_guest(vcpu);
+		return;
+	}
+
+	vmx_prepare_switch_to_guest(vcpu);
+}
+
+static void vt_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_vcpu_put(vcpu);
+		return;
+	}
+
+	vmx_vcpu_put(vcpu);
+}
+
 static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -253,9 +273,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_free = vt_vcpu_free,
 	.vcpu_reset = vt_vcpu_reset,
 
-	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
+	.prepare_switch_to_guest = vt_prepare_switch_to_guest,
 	.vcpu_load = vt_vcpu_load,
-	.vcpu_put = vmx_vcpu_put,
+	.vcpu_put = vt_vcpu_put,
 
 	.update_exception_bitmap = vmx_update_exception_bitmap,
 	.get_feature_msr = vmx_get_feature_msr,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 95420ffd0022..3f3d61935a58 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2,6 +2,7 @@
 #include <linux/cleanup.h>
 #include <linux/cpu.h>
 #include <asm/cpufeature.h>
+#include <linux/mmu_context.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
 #include "mmu.h"
@@ -11,6 +12,7 @@
 #include "vmx.h"
 #include "mmu/spte.h"
 #include "common.h"
+#include "posted_intr.h"
 #include <trace/events/kvm.h>
 #include "trace.h"
 
@@ -642,6 +644,44 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	local_irq_enable();
 }
 
+/*
+ * Compared to vmx_prepare_switch_to_guest(), there is not much to do
+ * as SEAMCALL/SEAMRET calls take care of most of save and restore.
+ */
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vt *vt = to_vt(vcpu);
+
+	if (vt->guest_state_loaded)
+		return;
+
+	if (likely(is_64bit_mm(current->mm)))
+		vt->msr_host_kernel_gs_base = current->thread.gsbase;
+	else
+		vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
+
+	vt->guest_state_loaded = true;
+}
+
+static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vt *vt = to_vt(vcpu);
+
+	if (!vt->guest_state_loaded)
+		return;
+
+	++vcpu->stat.host_state_reload;
+	wrmsrl(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
+
+	vt->guest_state_loaded = false;
+}
+
+void tdx_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	vmx_vcpu_pi_put(vcpu);
+	tdx_prepare_switch_to_host(vcpu);
+}
+
 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 83aac44b779b..f856eac8f1e8 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -133,6 +133,8 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
 fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void tdx_vcpu_put(struct kvm_vcpu *vcpu);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
@@ -165,6 +167,8 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediat
 {
 	return EXIT_FASTPATH_NONE;
 }
+static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V2 07/12] KVM: TDX: restore host xsave state when exit from the guest TD
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (5 preceding siblings ...)
  2025-01-29  9:58 ` [PATCH V2 06/12] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) Adrian Hunter
@ 2025-01-29  9:58 ` Adrian Hunter
  2025-02-25  6:43   ` Xiaoyao Li
  2025-01-29  9:58 ` [PATCH V2 08/12] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr Adrian Hunter
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:58 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

From: Isaku Yamahata <isaku.yamahata@intel.com>

On exiting from the guest TD, xsave state is clobbered.  Restore xsave
state on TD exit.

Set up guest state so that existing kvm_load_host_xsave_state() can be
used. Do not allow VCPU entry if guest state conflicts with the TD's
configuration.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v2:
 - Drop PT and CET feature flags (Chao)
 - Use cpu_feature_enabled() instead of static_cpu_has() (Chao)
 - Restore PKRU only if the host value differs from defined
   exit value (Chao)
 - Use defined masks to separate XFAM bits into XCR0/XSS (Adrian)
 - Use existing kvm_load_host_xsave_state() in place of
   tdx_restore_host_xsave_state() by defining guest CR4, XCR0, XSS
   and PKRU (Sean)
 - Do not enter if vital guest state is invalid (Adrian)

TD vcpu enter/exit v1:
 - Remove noinstr on tdx_vcpu_enter_exit() (Sean)
 - Switch to kvm_host struct for xcr0 and xss

v19:
 - Add EXPORT_SYMBOL_GPL(host_xcr0)

v15 -> v16:
 - Added CET flag mask
---
 arch/x86/kvm/vmx/tdx.c | 72 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3f3d61935a58..e4355553569a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -83,16 +83,21 @@ static u64 tdx_get_supported_attrs(const struct tdx_sys_info_td_conf *td_conf)
 	return val;
 }
 
+/*
+ * Before returning from TDH.VP.ENTER, the TDX Module assigns:
+ *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9, 18:17)
+ *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
+ */
+#define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9) | GENMASK(18, 17))
+#define TDX_XFAM_XSS_MASK	(BIT(8) | GENMASK(16, 10))
+#define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)
+
 static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
 {
 	u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;
 
-	/*
-	 * PT and CET can be exposed to TD guest regardless of KVM's XSS, PT
-	 * and, CET support.
-	 */
-	val |= XFEATURE_MASK_PT | XFEATURE_MASK_CET_USER |
-	       XFEATURE_MASK_CET_KERNEL;
+	/* Ensure features are in the masks */
+	val &= TDX_XFAM_MASK;
 
 	if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
 		return 0;
@@ -724,6 +729,19 @@ int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+	return vcpu->arch.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK) ||
+	       vcpu->arch.ia32_xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK) ||
+	       vcpu->arch.pkru ||
+	       (cpu_feature_enabled(X86_FEATURE_XSAVE) &&
+		!kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) ||
+	       (cpu_feature_enabled(X86_FEATURE_XSAVES) &&
+		!guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES));
+}
+
 static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -740,6 +758,8 @@ static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 
 fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 {
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
 	/*
 	 * force_immediate_exit requires vCPU entering for events injection with
 	 * an immediately exit followed. But The TDX module doesn't guarantee
@@ -750,10 +770,22 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	 */
 	WARN_ON_ONCE(force_immediate_exit);
 
+	if (WARN_ON_ONCE(tdx_guest_state_is_invalid(vcpu))) {
+		/*
+		 * Invalid exit_reason becomes KVM_EXIT_INTERNAL_ERROR, refer
+		 * tdx_handle_exit().
+		 */
+		tdx->vt.exit_reason.full = -1u;
+		tdx->vp_enter_ret = -1u;
+		return EXIT_FASTPATH_NONE;
+	}
+
 	trace_kvm_entry(vcpu, force_immediate_exit);
 
 	tdx_vcpu_enter_exit(vcpu);
 
+	kvm_load_host_xsave_state(vcpu);
+
 	vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
 
 	trace_kvm_exit(vcpu, KVM_ISA_VMX);
@@ -1878,9 +1910,23 @@ static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 	return r;
 }
 
+static u64 tdx_guest_cr0(struct kvm_vcpu *vcpu, u64 cr4)
+{
+	u64 cr0 = ~CR0_RESERVED_BITS;
+
+	if (cr4 & X86_CR4_CET)
+		cr0 |= X86_CR0_WP;
+
+	cr0 |= X86_CR0_PE | X86_CR0_NE;
+	cr0 &= ~(X86_CR0_NW | X86_CR0_CD);
+
+	return cr0;
+}
+
 static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 {
 	u64 apic_base;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	int ret;
 
@@ -1903,6 +1949,20 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 	if (ret)
 		return ret;
 
+	vcpu->arch.cr4 = ~vcpu->arch.cr4_guest_rsvd_bits;
+	vcpu->arch.cr0 = tdx_guest_cr0(vcpu, vcpu->arch.cr4);
+	/*
+	 * On return from VP.ENTER, the TDX Module sets XCR0 and XSS to the
+	 * maximal values supported by the guest, and zeroes PKRU, so from
+	 * KVM's perspective, those are the guest's values at all times.
+	 */
+	vcpu->arch.ia32_xss = kvm_tdx->xfam & TDX_XFAM_XSS_MASK;
+	vcpu->arch.xcr0 = kvm_tdx->xfam & TDX_XFAM_XCR0_MASK;
+	vcpu->arch.pkru = 0;
+
+	/* TODO: freeze vCPU model before kvm_update_cpuid_runtime() */
+	kvm_update_cpuid_runtime(vcpu);
+
 	tdx->state = VCPU_TD_STATE_INITIALIZED;
 
 	return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 07/12] KVM: TDX: restore host xsave state when exit from the guest TD
  2025-01-29  9:58 ` [PATCH V2 07/12] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
@ 2025-02-25  6:43   ` Xiaoyao Li
  2025-02-27 14:29     ` Adrian Hunter
  0 siblings, 1 reply; 39+ messages in thread
From: Xiaoyao Li @ 2025-02-25  6:43 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 1/29/2025 5:58 PM, Adrian Hunter wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> On exiting from the guest TD, xsave state is clobbered.  Restore xsave
> state on TD exit.
> 
> Set up guest state so that existing kvm_load_host_xsave_state() can be
> used. Do not allow VCPU entry if guest state conflicts with the TD's
> configuration.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
> TD vcpu enter/exit v2:
>   - Drop PT and CET feature flags (Chao)
>   - Use cpu_feature_enabled() instead of static_cpu_has() (Chao)
>   - Restore PKRU only if the host value differs from defined
>     exit value (Chao)
>   - Use defined masks to separate XFAM bits into XCR0/XSS (Adrian)
>   - Use existing kvm_load_host_xsave_state() in place of
>     tdx_restore_host_xsave_state() by defining guest CR4, XCR0, XSS
>     and PKRU (Sean)
>   - Do not enter if vital guest state is invalid (Adrian)
> 
> TD vcpu enter/exit v1:
>   - Remove noinstr on tdx_vcpu_enter_exit() (Sean)
>   - Switch to kvm_host struct for xcr0 and xss
> 
> v19:
>   - Add EXPORT_SYMBOL_GPL(host_xcr0)
> 
> v15 -> v16:
>   - Added CET flag mask
> ---
>   arch/x86/kvm/vmx/tdx.c | 72 ++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 66 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 3f3d61935a58..e4355553569a 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -83,16 +83,21 @@ static u64 tdx_get_supported_attrs(const struct tdx_sys_info_td_conf *td_conf)
>   	return val;
>   }
>   
> +/*
> + * Before returning from TDH.VP.ENTER, the TDX Module assigns:
> + *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9, 18:17)
> + *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
> + */
> +#define TDX_XFAM_XCR0_MASK	(GENMASK(7, 0) | BIT(9) | GENMASK(18, 17))
> +#define TDX_XFAM_XSS_MASK	(BIT(8) | GENMASK(16, 10))
> +#define TDX_XFAM_MASK		(TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)

No need to define TDX-specific mask for XFAM. kernel defines 
XFEATURE_MASK_* and you can define XFEATURE_XCR0_MASK and 
XFEATURE_XSS_MASK with XFEATURE_MASK_*.

>   static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
>   {
>   	u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;
>   
> -	/*
> -	 * PT and CET can be exposed to TD guest regardless of KVM's XSS, PT
> -	 * and, CET support.
> -	 */
> -	val |= XFEATURE_MASK_PT | XFEATURE_MASK_CET_USER |
> -	       XFEATURE_MASK_CET_KERNEL;
> +	/* Ensure features are in the masks */
> +	val &= TDX_XFAM_MASK;

It's unncessary.

kvm_caps.supported_xcr0 | kvm_caps.supported_xss must be the subset of 
TDX_XFAM_MASK;

>   	if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
>   		return 0;
> @@ -724,6 +729,19 @@ int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu)
>   	return 1;
>   }
>   
> +static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +
> +	return vcpu->arch.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK) ||
> +	       vcpu->arch.ia32_xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK) ||
> +	       vcpu->arch.pkru ||
> +	       (cpu_feature_enabled(X86_FEATURE_XSAVE) &&
> +		!kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) ||
> +	       (cpu_feature_enabled(X86_FEATURE_XSAVES) &&
> +		!guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES));

guest_cpu_cap_has() is better to put into the path when userspace 
configures the vcpu model. So that KVM can return error to userspace 
earlier before running the vcpu.

 > +}> +
>   static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -740,6 +758,8 @@ static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
>   
>   fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>   {
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
>   	/*
>   	 * force_immediate_exit requires vCPU entering for events injection with
>   	 * an immediately exit followed. But The TDX module doesn't guarantee
> @@ -750,10 +770,22 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>   	 */
>   	WARN_ON_ONCE(force_immediate_exit);
>   
> +	if (WARN_ON_ONCE(tdx_guest_state_is_invalid(vcpu))) {
> +		/*
> +		 * Invalid exit_reason becomes KVM_EXIT_INTERNAL_ERROR, refer
> +		 * tdx_handle_exit().
> +		 */
> +		tdx->vt.exit_reason.full = -1u;
> +		tdx->vp_enter_ret = -1u;
> +		return EXIT_FASTPATH_NONE;
> +	}
> +
>   	trace_kvm_entry(vcpu, force_immediate_exit);
>   
>   	tdx_vcpu_enter_exit(vcpu);
>   
> +	kvm_load_host_xsave_state(vcpu);
> +
>   	vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
>   
>   	trace_kvm_exit(vcpu, KVM_ISA_VMX);
> @@ -1878,9 +1910,23 @@ static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
>   	return r;
>   }
>   
> +static u64 tdx_guest_cr0(struct kvm_vcpu *vcpu, u64 cr4)
> +{
> +	u64 cr0 = ~CR0_RESERVED_BITS;
> +
> +	if (cr4 & X86_CR4_CET)
> +		cr0 |= X86_CR0_WP;
> +
> +	cr0 |= X86_CR0_PE | X86_CR0_NE;
> +	cr0 &= ~(X86_CR0_NW | X86_CR0_CD);
> +
> +	return cr0;
> +}
> +
>   static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
>   {
>   	u64 apic_base;
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>   	struct vcpu_tdx *tdx = to_tdx(vcpu);
>   	int ret;
>   
> @@ -1903,6 +1949,20 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
>   	if (ret)
>   		return ret;
>   
> +	vcpu->arch.cr4 = ~vcpu->arch.cr4_guest_rsvd_bits;

when userspace doesn't configure XFEATURE_MASK_PKRU in XFAM, it seems 
kvm_load_host_xsave_state() will skip restore host's PKRU value.

Besides, vcpu->arch.cr4_guest_rsvd_bits depends on KVM_SET_CPUID*, we 
need enfore the dependency that tdx_vcpu_init() needs to be called after 
vcpu model is configured.

> +	vcpu->arch.cr0 = tdx_guest_cr0(vcpu, vcpu->arch.cr4);
> +	/*
> +	 * On return from VP.ENTER, the TDX Module sets XCR0 and XSS to the
> +	 * maximal values supported by the guest, and zeroes PKRU, so from
> +	 * KVM's perspective, those are the guest's values at all times.
> +	 */

It's better to also call out that this is only to make 
kvm_load_host_xsave_state() work for TDX. They don't represent the real 
guest value.

> +	vcpu->arch.ia32_xss = kvm_tdx->xfam & TDX_XFAM_XSS_MASK;
> +	vcpu->arch.xcr0 = kvm_tdx->xfam & TDX_XFAM_XCR0_MASK;
> +	vcpu->arch.pkru = 0;
> +
> +	/* TODO: freeze vCPU model before kvm_update_cpuid_runtime() */
> +	kvm_update_cpuid_runtime(vcpu);
> +
>   	tdx->state = VCPU_TD_STATE_INITIALIZED;
>   
>   	return 0;


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 07/12] KVM: TDX: restore host xsave state when exit from the guest TD
  2025-02-25  6:43   ` Xiaoyao Li
@ 2025-02-27 14:29     ` Adrian Hunter
  2025-02-28  1:58       ` Xiaoyao Li
  0 siblings, 1 reply; 39+ messages in thread
From: Adrian Hunter @ 2025-02-27 14:29 UTC (permalink / raw)
  To: Xiaoyao Li, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 25/02/25 08:43, Xiaoyao Li wrote:
> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> On exiting from the guest TD, xsave state is clobbered.  Restore xsave
>> state on TD exit.
>>
>> Set up guest state so that existing kvm_load_host_xsave_state() can be
>> used. Do not allow VCPU entry if guest state conflicts with the TD's
>> configuration.
>>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> ---
>> TD vcpu enter/exit v2:
>>   - Drop PT and CET feature flags (Chao)
>>   - Use cpu_feature_enabled() instead of static_cpu_has() (Chao)
>>   - Restore PKRU only if the host value differs from defined
>>     exit value (Chao)
>>   - Use defined masks to separate XFAM bits into XCR0/XSS (Adrian)
>>   - Use existing kvm_load_host_xsave_state() in place of
>>     tdx_restore_host_xsave_state() by defining guest CR4, XCR0, XSS
>>     and PKRU (Sean)
>>   - Do not enter if vital guest state is invalid (Adrian)
>>
>> TD vcpu enter/exit v1:
>>   - Remove noinstr on tdx_vcpu_enter_exit() (Sean)
>>   - Switch to kvm_host struct for xcr0 and xss
>>
>> v19:
>>   - Add EXPORT_SYMBOL_GPL(host_xcr0)
>>
>> v15 -> v16:
>>   - Added CET flag mask
>> ---
>>   arch/x86/kvm/vmx/tdx.c | 72 ++++++++++++++++++++++++++++++++++++++----
>>   1 file changed, 66 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index 3f3d61935a58..e4355553569a 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -83,16 +83,21 @@ static u64 tdx_get_supported_attrs(const struct tdx_sys_info_td_conf *td_conf)
>>       return val;
>>   }
>>   +/*
>> + * Before returning from TDH.VP.ENTER, the TDX Module assigns:
>> + *   XCR0 to the TD’s user-mode feature bits of XFAM (bits 7:0, 9, 18:17)
>> + *   IA32_XSS to the TD's supervisor-mode feature bits of XFAM (bits 8, 16:10)
>> + */
>> +#define TDX_XFAM_XCR0_MASK    (GENMASK(7, 0) | BIT(9) | GENMASK(18, 17))
>> +#define TDX_XFAM_XSS_MASK    (BIT(8) | GENMASK(16, 10))
>> +#define TDX_XFAM_MASK        (TDX_XFAM_XCR0_MASK | TDX_XFAM_XSS_MASK)
> 
> No need to define TDX-specific mask for XFAM. kernel defines XFEATURE_MASK_* and you can define XFEATURE_XCR0_MASK and XFEATURE_XSS_MASK with XFEATURE_MASK_*.

Curently, those masks are being added only when the (host) kernel uses them.

> 
>>   static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf)
>>   {
>>       u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss;
>>   -    /*
>> -     * PT and CET can be exposed to TD guest regardless of KVM's XSS, PT
>> -     * and, CET support.
>> -     */
>> -    val |= XFEATURE_MASK_PT | XFEATURE_MASK_CET_USER |
>> -           XFEATURE_MASK_CET_KERNEL;
>> +    /* Ensure features are in the masks */
>> +    val &= TDX_XFAM_MASK;
> 
> It's unncessary.
> 
> kvm_caps.supported_xcr0 | kvm_caps.supported_xss must be the subset of TDX_XFAM_MASK;

The code has changed in kvm-coco-queue.

> 
>>       if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1)
>>           return 0;
>> @@ -724,6 +729,19 @@ int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu)
>>       return 1;
>>   }
>>   +static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
>> +{
>> +    struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> +
>> +    return vcpu->arch.xcr0 != (kvm_tdx->xfam & TDX_XFAM_XCR0_MASK) ||
>> +           vcpu->arch.ia32_xss != (kvm_tdx->xfam & TDX_XFAM_XSS_MASK) ||
>> +           vcpu->arch.pkru ||
>> +           (cpu_feature_enabled(X86_FEATURE_XSAVE) &&
>> +        !kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) ||
>> +           (cpu_feature_enabled(X86_FEATURE_XSAVES) &&
>> +        !guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES));
> 
> guest_cpu_cap_has() is better to put into the path when userspace configures the vcpu model. So that KVM can return error to userspace earlier before running the vcpu.

The purpose of the function is to protect host state from being
restored incorrectly, which is why it is called before tdx_vcpu_enter_exit().
i.e. it is protecting against unexpected changes to guest state information
that do not match TD configuration.  That is largely because the KVM code base
is quite complex and the consequences of messing up host state are dire.
guest_cpu_cap_has() is very quick, so there is not really any reason
not to use it here.

> 
>> +}> +
>>   static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
>>   {
>>       struct vcpu_tdx *tdx = to_tdx(vcpu);
>> @@ -740,6 +758,8 @@ static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
>>     fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>   {
>> +    struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +
>>       /*
>>        * force_immediate_exit requires vCPU entering for events injection with
>>        * an immediately exit followed. But The TDX module doesn't guarantee
>> @@ -750,10 +770,22 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>        */
>>       WARN_ON_ONCE(force_immediate_exit);
>>   +    if (WARN_ON_ONCE(tdx_guest_state_is_invalid(vcpu))) {
>> +        /*
>> +         * Invalid exit_reason becomes KVM_EXIT_INTERNAL_ERROR, refer
>> +         * tdx_handle_exit().
>> +         */
>> +        tdx->vt.exit_reason.full = -1u;
>> +        tdx->vp_enter_ret = -1u;
>> +        return EXIT_FASTPATH_NONE;
>> +    }
>> +
>>       trace_kvm_entry(vcpu, force_immediate_exit);
>>         tdx_vcpu_enter_exit(vcpu);
>>   +    kvm_load_host_xsave_state(vcpu);
>> +
>>       vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
>>         trace_kvm_exit(vcpu, KVM_ISA_VMX);
>> @@ -1878,9 +1910,23 @@ static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
>>       return r;
>>   }
>>   +static u64 tdx_guest_cr0(struct kvm_vcpu *vcpu, u64 cr4)
>> +{
>> +    u64 cr0 = ~CR0_RESERVED_BITS;
>> +
>> +    if (cr4 & X86_CR4_CET)
>> +        cr0 |= X86_CR0_WP;
>> +
>> +    cr0 |= X86_CR0_PE | X86_CR0_NE;
>> +    cr0 &= ~(X86_CR0_NW | X86_CR0_CD);
>> +
>> +    return cr0;
>> +}
>> +
>>   static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
>>   {
>>       u64 apic_base;
>> +    struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>>       struct vcpu_tdx *tdx = to_tdx(vcpu);
>>       int ret;
>>   @@ -1903,6 +1949,20 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
>>       if (ret)
>>           return ret;
>>   +    vcpu->arch.cr4 = ~vcpu->arch.cr4_guest_rsvd_bits;
> 
> when userspace doesn't configure XFEATURE_MASK_PKRU in XFAM, it seems kvm_load_host_xsave_state() will skip restore host's PKRU value.

That's correct.

> 
> Besides, vcpu->arch.cr4_guest_rsvd_bits depends on KVM_SET_CPUID*, we need enfore the dependency that tdx_vcpu_init() needs to be called after vcpu model is configured.

Sean had some code to freeze the CPU model, hence the subsequent TODO.
refer https://lore.kernel.org/all/Z4FZKOzXIdhLOlU8@google.com/
Anything that matters will be caught by tdx_guest_state_is_invalid().

> 
>> +    vcpu->arch.cr0 = tdx_guest_cr0(vcpu, vcpu->arch.cr4);
>> +    /*
>> +     * On return from VP.ENTER, the TDX Module sets XCR0 and XSS to the
>> +     * maximal values supported by the guest, and zeroes PKRU, so from
>> +     * KVM's perspective, those are the guest's values at all times.
>> +     */
> 
> It's better to also call out that this is only to make kvm_load_host_xsave_state() work for TDX. They don't represent the real guest value.

It is to provide a reasonable value not just for
kvm_load_host_xsave_state().

Note Sean wrote the comment.

This patch set is owned by Paolo now, so it is up to him.

> 
>> +    vcpu->arch.ia32_xss = kvm_tdx->xfam & TDX_XFAM_XSS_MASK;
>> +    vcpu->arch.xcr0 = kvm_tdx->xfam & TDX_XFAM_XCR0_MASK;
>> +    vcpu->arch.pkru = 0;
>> +
>> +    /* TODO: freeze vCPU model before kvm_update_cpuid_runtime() */
>> +    kvm_update_cpuid_runtime(vcpu);
>> +
>>       tdx->state = VCPU_TD_STATE_INITIALIZED;
>>         return 0;
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 07/12] KVM: TDX: restore host xsave state when exit from the guest TD
  2025-02-27 14:29     ` Adrian Hunter
@ 2025-02-28  1:58       ` Xiaoyao Li
  0 siblings, 0 replies; 39+ messages in thread
From: Xiaoyao Li @ 2025-02-28  1:58 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 2/27/2025 10:29 PM, Adrian Hunter wrote:
>>>    +    vcpu->arch.cr4 = ~vcpu->arch.cr4_guest_rsvd_bits;
>> when userspace doesn't configure XFEATURE_MASK_PKRU in XFAM, it seems kvm_load_host_xsave_state() will skip restore host's PKRU value.
> That's correct.

FYI, in this case, it's safe and functional though 
kvm_load_host_xsave_state() skips restoring the host's PKRU value.
Because host's PKRU is not clobbered after TD exit when XFAM.PKU is 0. 
So no need to restore.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V2 08/12] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (6 preceding siblings ...)
  2025-01-29  9:58 ` [PATCH V2 07/12] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
@ 2025-01-29  9:58 ` Adrian Hunter
  2025-02-25  7:00   ` Xiaoyao Li
  2025-01-29  9:58 ` [PATCH V2 09/12] KVM: TDX: restore user ret MSRs Adrian Hunter
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:58 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

From: Chao Gao <chao.gao@intel.com>

Several MSRs are constant and only used in userspace(ring 3).  But VMs may
have different values.  KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace.  To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.

TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit.  It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware.  This inconsistency needs to be
fixed.  Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values.  kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr.  So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TD vcpu enter/exit v2:
 - No changes

TD vcpu enter/exit v1:
 - Rename functions and remove useless comment (Binbin)
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 24 +++++++++++++++++++-----
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6b686d62c735..e557a441fade 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2322,6 +2322,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
 int kvm_add_user_return_msr(u32 msr);
 int kvm_find_user_return_msr(u32 msr);
 int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
+void kvm_user_return_msr_update_cache(unsigned int index, u64 val);
 
 static inline bool kvm_is_supported_user_return_msr(u32 msr)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5cf9f023fd4b..15447fe7687c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -636,6 +636,15 @@ static void kvm_user_return_msr_cpu_online(void)
 	}
 }
 
+static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
+{
+	if (!msrs->registered) {
+		msrs->urn.on_user_return = kvm_on_user_return;
+		user_return_notifier_register(&msrs->urn);
+		msrs->registered = true;
+	}
+}
+
 int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 {
 	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
@@ -649,15 +658,20 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 		return 1;
 
 	msrs->values[slot].curr = value;
-	if (!msrs->registered) {
-		msrs->urn.on_user_return = kvm_on_user_return;
-		user_return_notifier_register(&msrs->urn);
-		msrs->registered = true;
-	}
+	kvm_user_return_register_notifier(msrs);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
 
+void kvm_user_return_msr_update_cache(unsigned int slot, u64 value)
+{
+	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
+
+	msrs->values[slot].curr = value;
+	kvm_user_return_register_notifier(msrs);
+}
+EXPORT_SYMBOL_GPL(kvm_user_return_msr_update_cache);
+
 static void drop_user_return_notifiers(void)
 {
 	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 08/12] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr
  2025-01-29  9:58 ` [PATCH V2 08/12] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr Adrian Hunter
@ 2025-02-25  7:00   ` Xiaoyao Li
  0 siblings, 0 replies; 39+ messages in thread
From: Xiaoyao Li @ 2025-02-25  7:00 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 1/29/2025 5:58 PM, Adrian Hunter wrote:
> From: Chao Gao <chao.gao@intel.com>
> 
> Several MSRs are constant and only used in userspace(ring 3).  But VMs may
> have different values.  KVM uses kvm_set_user_return_msr() to switch to
> guest's values and leverages user return notifier to restore them when the
> kernel is to return to userspace.  To eliminate unnecessary wrmsr, KVM also
> caches the value it wrote to an MSR last time.
> 
> TDX module unconditionally resets some of these MSRs to architectural INIT
> state on TD exit.  It makes the cached values in kvm_user_return_msrs are
> inconsistent with values in hardware.  This inconsistency needs to be
> fixed.  Otherwise, it may mislead kvm_on_user_return() to skip restoring
> some MSRs to the host's values.  kvm_set_user_return_msr() can help correct
> this case, but it is not optimal as it always does a wrmsr.  So, introduce
> a variation of kvm_set_user_return_msr() to update cached values and skip
> that wrmsr.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

> ---
> TD vcpu enter/exit v2:
>   - No changes
> 
> TD vcpu enter/exit v1:
>   - Rename functions and remove useless comment (Binbin)
> ---
>   arch/x86/include/asm/kvm_host.h |  1 +
>   arch/x86/kvm/x86.c              | 24 +++++++++++++++++++-----
>   2 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6b686d62c735..e557a441fade 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2322,6 +2322,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
>   int kvm_add_user_return_msr(u32 msr);
>   int kvm_find_user_return_msr(u32 msr);
>   int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
> +void kvm_user_return_msr_update_cache(unsigned int index, u64 val);
>   
>   static inline bool kvm_is_supported_user_return_msr(u32 msr)
>   {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5cf9f023fd4b..15447fe7687c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -636,6 +636,15 @@ static void kvm_user_return_msr_cpu_online(void)
>   	}
>   }
>   
> +static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
> +{
> +	if (!msrs->registered) {
> +		msrs->urn.on_user_return = kvm_on_user_return;
> +		user_return_notifier_register(&msrs->urn);
> +		msrs->registered = true;
> +	}
> +}
> +
>   int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
>   {
>   	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
> @@ -649,15 +658,20 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
>   		return 1;
>   
>   	msrs->values[slot].curr = value;
> -	if (!msrs->registered) {
> -		msrs->urn.on_user_return = kvm_on_user_return;
> -		user_return_notifier_register(&msrs->urn);
> -		msrs->registered = true;
> -	}
> +	kvm_user_return_register_notifier(msrs);
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
>   
> +void kvm_user_return_msr_update_cache(unsigned int slot, u64 value)
> +{
> +	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
> +
> +	msrs->values[slot].curr = value;
> +	kvm_user_return_register_notifier(msrs);
> +}
> +EXPORT_SYMBOL_GPL(kvm_user_return_msr_update_cache);
> +
>   static void drop_user_return_notifiers(void)
>   {
>   	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V2 09/12] KVM: TDX: restore user ret MSRs
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (7 preceding siblings ...)
  2025-01-29  9:58 ` [PATCH V2 08/12] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr Adrian Hunter
@ 2025-01-29  9:58 ` Adrian Hunter
  2025-02-25  7:01   ` Xiaoyao Li
  2025-01-29  9:58 ` [PATCH V2 10/12] KVM: TDX: Disable support for TSX and WAITPKG Adrian Hunter
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:58 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

From: Isaku Yamahata <isaku.yamahata@intel.com>

Several user ret MSRs are clobbered on TD exit.  Restore those values on
TD exit and before returning to ring 3.

Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TD vcpu enter/exit v2:
 - No changes

TD vcpu enter/exit v1:
 - Rename tdx_user_return_update_cache() ->
     tdx_user_return_msr_update_cache() (extrapolated from Binbin)
 - Adjust to rename in previous patches (Binbin)
 - Simplify comment (Tony)
 - Move code change in tdx_hardware_setup() to __tdx_bringup().
---
 arch/x86/kvm/vmx/tdx.c | 44 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e4355553569a..a0f5cdfd290b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -729,6 +729,28 @@ int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+struct tdx_uret_msr {
+	u32 msr;
+	unsigned int slot;
+	u64 defval;
+};
+
+static struct tdx_uret_msr tdx_uret_msrs[] = {
+	{.msr = MSR_SYSCALL_MASK, .defval = 0x20200 },
+	{.msr = MSR_STAR,},
+	{.msr = MSR_LSTAR,},
+	{.msr = MSR_TSC_AUX,},
+};
+
+static void tdx_user_return_msr_update_cache(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
+		kvm_user_return_msr_update_cache(tdx_uret_msrs[i].slot,
+						 tdx_uret_msrs[i].defval);
+}
+
 static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -784,6 +806,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 
 	tdx_vcpu_enter_exit(vcpu);
 
+	tdx_user_return_msr_update_cache();
+
 	kvm_load_host_xsave_state(vcpu);
 
 	vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
@@ -2245,7 +2269,25 @@ static bool __init kvm_can_support_tdx(void)
 static int __init __tdx_bringup(void)
 {
 	const struct tdx_sys_info_td_conf *td_conf;
-	int r;
+	int r, i;
+
+	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
+		/*
+		 * Check if MSRs (tdx_uret_msrs) can be saved/restored
+		 * before returning to user space.
+		 *
+		 * this_cpu_ptr(user_return_msrs)->registered isn't checked
+		 * because the registration is done at vcpu runtime by
+		 * tdx_user_return_msr_update_cache().
+		 */
+		tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
+		if (tdx_uret_msrs[i].slot == -1) {
+			/* If any MSR isn't supported, it is a KVM bug */
+			pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
+				tdx_uret_msrs[i].msr);
+			return -EIO;
+		}
+	}
 
 	/*
 	 * Enabling TDX requires enabling hardware virtualization first,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 09/12] KVM: TDX: restore user ret MSRs
  2025-01-29  9:58 ` [PATCH V2 09/12] KVM: TDX: restore user ret MSRs Adrian Hunter
@ 2025-02-25  7:01   ` Xiaoyao Li
  2025-02-27 14:19     ` Adrian Hunter
  0 siblings, 1 reply; 39+ messages in thread
From: Xiaoyao Li @ 2025-02-25  7:01 UTC (permalink / raw)
  To: Adrian Hunter, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 1/29/2025 5:58 PM, Adrian Hunter wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Several user ret MSRs are clobbered on TD exit.  Restore those values on
> TD exit and before returning to ring 3.
> 
> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> TD vcpu enter/exit v2:
>   - No changes
> 
> TD vcpu enter/exit v1:
>   - Rename tdx_user_return_update_cache() ->
>       tdx_user_return_msr_update_cache() (extrapolated from Binbin)
>   - Adjust to rename in previous patches (Binbin)
>   - Simplify comment (Tony)
>   - Move code change in tdx_hardware_setup() to __tdx_bringup().
> ---
>   arch/x86/kvm/vmx/tdx.c | 44 +++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e4355553569a..a0f5cdfd290b 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -729,6 +729,28 @@ int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu)
>   	return 1;
>   }
>   
> +struct tdx_uret_msr {
> +	u32 msr;
> +	unsigned int slot;
> +	u64 defval;
> +};
> +
> +static struct tdx_uret_msr tdx_uret_msrs[] = {
> +	{.msr = MSR_SYSCALL_MASK, .defval = 0x20200 },
> +	{.msr = MSR_STAR,},
> +	{.msr = MSR_LSTAR,},
> +	{.msr = MSR_TSC_AUX,},
> +};
> +
> +static void tdx_user_return_msr_update_cache(void)
> +{
> +	int i;

I think it can be optimized to skip update cache if it the caches are 
updated already. No need to update the cache after every TD exit.

> +	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
> +		kvm_user_return_msr_update_cache(tdx_uret_msrs[i].slot,
> +						 tdx_uret_msrs[i].defval);
> +}
> +
>   static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
>   {
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> @@ -784,6 +806,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>   
>   	tdx_vcpu_enter_exit(vcpu);
>   
> +	tdx_user_return_msr_update_cache();
> +
>   	kvm_load_host_xsave_state(vcpu);
>   
>   	vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
> @@ -2245,7 +2269,25 @@ static bool __init kvm_can_support_tdx(void)
>   static int __init __tdx_bringup(void)
>   {
>   	const struct tdx_sys_info_td_conf *td_conf;
> -	int r;
> +	int r, i;
> +
> +	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
> +		/*
> +		 * Check if MSRs (tdx_uret_msrs) can be saved/restored
> +		 * before returning to user space.
> +		 *
> +		 * this_cpu_ptr(user_return_msrs)->registered isn't checked
> +		 * because the registration is done at vcpu runtime by
> +		 * tdx_user_return_msr_update_cache().
> +		 */
> +		tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
> +		if (tdx_uret_msrs[i].slot == -1) {
> +			/* If any MSR isn't supported, it is a KVM bug */
> +			pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
> +				tdx_uret_msrs[i].msr);
> +			return -EIO;
> +		}
> +	}
>   
>   	/*
>   	 * Enabling TDX requires enabling hardware virtualization first,


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V2 09/12] KVM: TDX: restore user ret MSRs
  2025-02-25  7:01   ` Xiaoyao Li
@ 2025-02-27 14:19     ` Adrian Hunter
  0 siblings, 0 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-02-27 14:19 UTC (permalink / raw)
  To: Xiaoyao Li, pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, nik.borisov, linux-kernel,
	yan.y.zhao, chao.gao, weijiang.yang

On 25/02/25 09:01, Xiaoyao Li wrote:
> On 1/29/2025 5:58 PM, Adrian Hunter wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> Several user ret MSRs are clobbered on TD exit.  Restore those values on
>> TD exit and before returning to ring 3.
>>
>> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
>> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>> TD vcpu enter/exit v2:
>>   - No changes
>>
>> TD vcpu enter/exit v1:
>>   - Rename tdx_user_return_update_cache() ->
>>       tdx_user_return_msr_update_cache() (extrapolated from Binbin)
>>   - Adjust to rename in previous patches (Binbin)
>>   - Simplify comment (Tony)
>>   - Move code change in tdx_hardware_setup() to __tdx_bringup().
>> ---
>>   arch/x86/kvm/vmx/tdx.c | 44 +++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 43 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index e4355553569a..a0f5cdfd290b 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -729,6 +729,28 @@ int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu)
>>       return 1;
>>   }
>>   +struct tdx_uret_msr {
>> +    u32 msr;
>> +    unsigned int slot;
>> +    u64 defval;
>> +};
>> +
>> +static struct tdx_uret_msr tdx_uret_msrs[] = {
>> +    {.msr = MSR_SYSCALL_MASK, .defval = 0x20200 },
>> +    {.msr = MSR_STAR,},
>> +    {.msr = MSR_LSTAR,},
>> +    {.msr = MSR_TSC_AUX,},
>> +};
>> +
>> +static void tdx_user_return_msr_update_cache(void)
>> +{
>> +    int i;
> 
> I think it can be optimized to skip update cache if it the caches are updated already. No need to update the cache after every TD exit.

It might be ok to move tdx_user_return_msr_update_cache() to
tdx_prepare_switch_to_host() but tdx_user_return_msr_update_cache()
should be pretty quick, so probably not worth messing around with
it at this stage.

This patch set is owned by Paolo now, so it is up to him.

> 
>> +    for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
>> +        kvm_user_return_msr_update_cache(tdx_uret_msrs[i].slot,
>> +                         tdx_uret_msrs[i].defval);
>> +}
>> +
>>   static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
>>   {
>>       struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>> @@ -784,6 +806,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>         tdx_vcpu_enter_exit(vcpu);
>>   +    tdx_user_return_msr_update_cache();
>> +
>>       kvm_load_host_xsave_state(vcpu);
>>         vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
>> @@ -2245,7 +2269,25 @@ static bool __init kvm_can_support_tdx(void)
>>   static int __init __tdx_bringup(void)
>>   {
>>       const struct tdx_sys_info_td_conf *td_conf;
>> -    int r;
>> +    int r, i;
>> +
>> +    for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
>> +        /*
>> +         * Check if MSRs (tdx_uret_msrs) can be saved/restored
>> +         * before returning to user space.
>> +         *
>> +         * this_cpu_ptr(user_return_msrs)->registered isn't checked
>> +         * because the registration is done at vcpu runtime by
>> +         * tdx_user_return_msr_update_cache().
>> +         */
>> +        tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
>> +        if (tdx_uret_msrs[i].slot == -1) {
>> +            /* If any MSR isn't supported, it is a KVM bug */
>> +            pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
>> +                tdx_uret_msrs[i].msr);
>> +            return -EIO;
>> +        }
>> +    }
>>         /*
>>        * Enabling TDX requires enabling hardware virtualization first,
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V2 10/12] KVM: TDX: Disable support for TSX and WAITPKG
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (8 preceding siblings ...)
  2025-01-29  9:58 ` [PATCH V2 09/12] KVM: TDX: restore user ret MSRs Adrian Hunter
@ 2025-01-29  9:58 ` Adrian Hunter
  2025-01-29  9:59 ` [PATCH V2 11/12] KVM: TDX: Save and restore IA32_DEBUGCTL Adrian Hunter
  2025-01-29  9:59 ` [PATCH V2 12/12] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior Adrian Hunter
  11 siblings, 0 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:58 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

Support for restoring IA32_TSX_CTRL MSR and IA32_UMWAIT_CONTROL MSR is not
yet implemented, so disable support for TSX and WAITPKG for now.  Clear the
associated CPUID bits returned by KVM_TDX_CAPABILITIES, and return an error
if those bits are set in KVM_TDX_INIT_VM.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v2:
 - New patch
---
 arch/x86/kvm/vmx/tdx.c | 43 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a0f5cdfd290b..70996af4be64 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -117,6 +117,44 @@ static u32 tdx_set_guest_phys_addr_bits(const u32 eax, int addr_bits)
 	return (eax & ~GENMASK(23, 16)) | (addr_bits & 0xff) << 16;
 }
 
+#define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM))
+
+static bool has_tsx(const struct kvm_cpuid_entry2 *entry)
+{
+	return entry->function == 7 && entry->index == 0 &&
+	       (entry->ebx & TDX_FEATURE_TSX);
+}
+
+static void clear_tsx(struct kvm_cpuid_entry2 *entry)
+{
+	entry->ebx &= ~TDX_FEATURE_TSX;
+}
+
+static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry)
+{
+	return entry->function == 7 && entry->index == 0 &&
+	       (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG));
+}
+
+static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
+{
+	entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
+}
+
+static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
+{
+	if (has_tsx(entry))
+		clear_tsx(entry);
+
+	if (has_waitpkg(entry))
+		clear_waitpkg(entry);
+}
+
+static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
+{
+	return has_tsx(entry) || has_waitpkg(entry);
+}
+
 #define KVM_TDX_CPUID_NO_SUBLEAF	((__u32)-1)
 
 static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx)
@@ -140,6 +178,8 @@ static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char i
 	 */
 	if (entry->function == 0x80000008)
 		entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
+
+	tdx_clear_unsupported_cpuid(entry);
 }
 
 static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
@@ -1214,6 +1254,9 @@ static int setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
 		if (!entry)
 			continue;
 
+		if (tdx_unsupported_cpuid(entry))
+			return -EINVAL;
+
 		copy_cnt++;
 
 		value = &td_params->cpuid_values[i];
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V2 11/12] KVM: TDX: Save and restore IA32_DEBUGCTL
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (9 preceding siblings ...)
  2025-01-29  9:58 ` [PATCH V2 10/12] KVM: TDX: Disable support for TSX and WAITPKG Adrian Hunter
@ 2025-01-29  9:59 ` Adrian Hunter
  2025-01-29  9:59 ` [PATCH V2 12/12] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior Adrian Hunter
  11 siblings, 0 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:59 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

Save the IA32_DEBUGCTL MSR before entering a TDX VCPU and restore it
afterwards.  The TDX Module preserves bits 1, 12, and 14, so if no
other bits are set, no restore is done.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
TD vcpu enter/exit v2:
 - New patch
 - Rebased due to moving host_debugctlmsr to struct vcpu_vt.
---
 arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 70996af4be64..0bce00415f42 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -705,6 +705,8 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	else
 		vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
 
+	vt->host_debugctlmsr = get_debugctlmsr();
+
 	vt->guest_state_loaded = true;
 }
 
@@ -818,9 +820,14 @@ static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 #define TDX_REGS_UNSUPPORTED_SET	(BIT(VCPU_EXREG_RFLAGS) |	\
 					 BIT(VCPU_EXREG_SEGMENTS))
 
+#define TDX_DEBUGCTL_PRESERVED (DEBUGCTLMSR_BTF | \
+				DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI | \
+				DEBUGCTLMSR_FREEZE_IN_SMM)
+
 fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct vcpu_vt *vt = to_vt(vcpu);
 
 	/*
 	 * force_immediate_exit requires vCPU entering for events injection with
@@ -846,6 +853,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 
 	tdx_vcpu_enter_exit(vcpu);
 
+	if (vt->host_debugctlmsr & ~TDX_DEBUGCTL_PRESERVED)
+		update_debugctlmsr(vt->host_debugctlmsr);
+
 	tdx_user_return_msr_update_cache();
 
 	kvm_load_host_xsave_state(vcpu);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V2 12/12] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
  2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
                   ` (10 preceding siblings ...)
  2025-01-29  9:59 ` [PATCH V2 11/12] KVM: TDX: Save and restore IA32_DEBUGCTL Adrian Hunter
@ 2025-01-29  9:59 ` Adrian Hunter
  11 siblings, 0 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:59 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest
DRs.

TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter,
and resets DRs to architectural INIT state on TD exit.  Use the new
flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to
save/restore guest DRs.  KVM still needs to restore host DRs after TD
exit if there are active breakpoints in the host, which is covered by
the existing code.

MOV-DR exiting is always cleared for TDX guests, so the handler for DR
access is never called, and KVM_DEBUGREG_WONT_EXIT is never set.  Add
a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH
are set.

Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().

Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20241210004946.3718496-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
TD vcpu enter/exit v2:
- Moved from TDX "the rest" to "TD vcpu enter/exit"

TDX "the rest" v1:
- Update the comment about KVM_DEBUGREG_AUTO_SWITCH.
- Check explicitly KVM_DEBUGREG_AUTO_SWITCH is not set in switch_db_regs
  before restoring guest DRs, because KVM_DEBUGREG_BP_ENABLED could be set
  by userspace. (Paolo)
  https://lore.kernel.org/lkml/ea136ac6-53cf-cdc5-a741-acfb437819b1@redhat.com/
- Fix the issue that host DRs are not restored in v19 (Binbin)
  https://lore.kernel.org/kvm/20240413002026.GP3039520@ls.amr.corp.intel.com/
- Update the changelog a bit.
---
 arch/x86/include/asm/kvm_host.h | 11 +++++++++--
 arch/x86/kvm/vmx/tdx.c          |  1 +
 arch/x86/kvm/x86.c              |  4 +++-
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e557a441fade..bcfd89c28308 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -606,8 +606,15 @@ struct kvm_pmu {
 struct kvm_pmu_ops;
 
 enum {
-	KVM_DEBUGREG_BP_ENABLED = 1,
-	KVM_DEBUGREG_WONT_EXIT = 2,
+	KVM_DEBUGREG_BP_ENABLED		= BIT(0),
+	KVM_DEBUGREG_WONT_EXIT		= BIT(1),
+	/*
+	 * Guest debug registers (DR0-3, DR6 and DR7) are saved/restored by
+	 * hardware on exit from or enter to guest. KVM needn't switch them.
+	 * DR0-3, DR6 and DR7 are set to their architectural INIT value on VM
+	 * exit, host values need to be restored.
+	 */
+	KVM_DEBUGREG_AUTO_SWITCH	= BIT(2),
 };
 
 struct kvm_mtrr {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0bce00415f42..0863bdaf761a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -652,6 +652,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
 
+	vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH;
 	vcpu->arch.cr0_guest_owned_bits = -1ul;
 	vcpu->arch.cr4_guest_owned_bits = -1ul;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 15447fe7687c..b023283e7ed4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10977,7 +10977,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.guest_fpu.xfd_err)
 		wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
 
-	if (unlikely(vcpu->arch.switch_db_regs)) {
+	if (unlikely(vcpu->arch.switch_db_regs &&
+		     !(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH))) {
 		set_debugreg(0, 7);
 		set_debugreg(vcpu->arch.eff_db[0], 0);
 		set_debugreg(vcpu->arch.eff_db[1], 1);
@@ -11024,6 +11025,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 */
 	if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
 		WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
+		WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH);
 		kvm_x86_call(sync_dirty_debug_regs)(vcpu);
 		kvm_update_dr0123(vcpu);
 		kvm_update_dr7(vcpu);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit
@ 2025-01-29  9:56 Adrian Hunter
  0 siblings, 0 replies; 39+ messages in thread
From: Adrian Hunter @ 2025-01-29  9:56 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel, yan.y.zhao, chao.gao, weijiang.yang

Hi

Changes in V2:

  Note, in addition to addressing replies to version 1, there are changes
  suggested by Sean here:

    https://lore.kernel.org/r/Z1suNzg2Or743a7e@google.com

  Patches are re-based on kvm-coco-queue plus:

    latest TDX MMU Part 2 series
    latest TDX vCPU/VM creation series

  x86/virt/tdx: Make tdh_vp_enter() noinstr
  KVM: x86: Allow the use of kvm_load_host_xsave_state() with
            guest_state_protected
  KVM: TDX: Set arch.has_protected_state to true
  KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct
    New patches

  KVM: TDX: Implement TDX vcpu enter/exit path
    Move VCPU_TD_STATE_INITIALIZED check to tdx_vcpu_pre_run() (Xiaoyao)
    Check TD_STATE_RUNNABLE also in tdx_vcpu_pre_run() (Yan)
    Add back 'noinstr' for tdx_vcpu_enter_exit() (Sean)
    Add WARN_ON_ONCE if force_immediate_exit (Sean)
    Add vp_enter_args to vcpu_tdx to store the input/output arguments for
    tdh_vp_enter().
    Don't copy arguments to/from vcpu->arch.regs[] unconditionally. (Sean)

  KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
    Use 1 variable named 'guest_state_loaded' to track host state
    save/restore (Sean)
    Rebased due to moving guest_state_loaded/msr_host_kernel_gs_base
    to struct vcpu_vt.

  KVM: TDX: restore host xsave state when exit from the guest TD
    Drop PT and CET feature flags (Chao)
    Use cpu_feature_enabled() instead of static_cpu_has() (Chao)
    Restore PKRU only if the host value differs from defined
    exit value (Chao)
    Use defined masks to separate XFAM bits into XCR0/XSS (Adrian)
    Use existing kvm_load_host_xsave_state() in place of
    tdx_restore_host_xsave_state() by defining guest CR4, XCR0, XSS
    and PKRU (Sean)
    Do not enter if vital guest state is invalid (Adrian)

  KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr
  KVM: TDX: restore user ret MSRs
    No changes

  KVM: TDX: Disable support for TSX and WAITPKG
  KVM: TDX: Save and restore IA32_DEBUGCTL
    New patches

  KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
    Moved from TDX "the rest" to "TD vcpu enter/exit"

This patch series introduces callbacks to facilitate the entry of a TD VCPU
and the corresponding save/restore of host state.

The "outstanding things still to do" from V1 are now all done.  Refer V1
cover letter:

  https://lore.kernel.org/20241121201448.36170-1-adrian.hunter@intel.com

So the patches are ready for hand off to Paolo.

Overview

A TD VCPU is entered via the SEAMCALL TDH.VP.ENTER. The TDX Module manages
the save/restore of guest state and, in conjunction with the SEAMCALL
interface, handles certain aspects of host state. However, there are
specific elements of the host state that require additional attention, as
detailed in the Intel TDX ABI documentation for TDH.VP.ENTER.

TDX is quite different from VMX in this regard.  For VMX, the host VMM is
heavily involved in restoring, managing and saving guest CPU state, whereas
for TDX this is handled by the TDX Module.  In that way, the TDX Module can
protect the confidentiality and integrity of TD CPU state.

The TDX Module does not save/restore all host CPU state because the host
VMM can do it more efficiently and selectively.  CPU state referred to
below is host CPU state.  Often values are already held in memory so no
explicit save is needed, and restoration may not be needed if the kernel
is not using a feature.

Key Details

Argument Passing: Similar to other SEAMCALLs, TDH.VP.ENTER passes
arguments through General Purpose Registers (GPRs). For the special case
of the TD guest invoking TDG.VP.VMCALL, nearly any GPR can be used,
as well as XMM0 to XMM15. Notably, RBP is not used, and Linux mandates the
TDX Module feature NO_RBP_MOD, which is enforced elsewhere. Additionally,
XMM registers are not required for the existing Guest Hypervisor
Communication Interface and are handled by existing KVM code should they be
modified by the guest.

Debug Register Handling: After TDH.VP.ENTER returns, registers DR0, DR1,
DR2, DR3, DR6, and DR7 are set to their architectural INIT values. Existing
KVM code already handles the restoration of host values as needed, refer
vcpu_enter_guest() which calls hw_breakpoint_restore().

MSR Restoration: Certain Model-Specific Registers (MSRs) need to be
restored post TDH.VP.ENTER. The Intel TDX ABI documentation provides a
detailed list in the msr_preservation.json file. Most MSRs do not require
restoration if the guest is not utilizing the corresponding feature. The
following features are currently assumed to be unsupported, and their MSRs
are not restored:
  PERFMON            (TD ATTRIBUTES[63])
  LBRs               (XFAM[15])
  User Interrupts    (XFAM[14])
  Intel PT           (XFAM[8])
  CET                (XFAM[11-12])

Remaining host MSR/Register Handling:

MSR IA32_XFD is already restored by KVM, refer to kvm_put_guest_fpu().
The TDX Module sets MSR IA32_XFD_ERR to its RESET value (0) which is
fine for the kernel.

MSR IA32_DEBUGCTL is addressed in patch "KVM: TDX: Save and restore
IA32_DEBUGCTL".

MSR IA32_UARCH_MISC_CTL is not utilized by the kernel, so it is fine if
the TDX Module sets it to it's RESET value.

MSR IA32_KERNEL_GS_BASE is addressed in patch "KVM: TDX: vcpu_run:
save/restore host state (host kernel gs)".

MSRs IA32_XSS and XCRO are handled in patch "KVM: TDX: restore host xsave
state when exiting from the guest TD".

MSRs IA32_STAR, IA32_LSTAR, IA32_FMASK, and IA32_TSC_AUX are handled in
patch "KVM: TDX: restore user ret MSRs".

MSR IA32_TSX_CTRL and MSR IA32_UMWAIT_CONTROL are addressed in patch
"KVM: TDX: Disable support for TSX and WAITPKG"

Protecting Host CPU State

These patches completely protect host CPU state.  The host CPU state
that can be affected is determined by the TD configuration, especially
XFAM, TD attributes and CPUID.

Should any vital guest information be found to conflict with TD
configuration, VCPU entry is blocked by checks in
tdx_guest_state_is_invalid().

Base

This series is based on kvm-coco-queue plus:

      latest TDX MMU Part 2 series
      latest TDX vCPU/VM creation series

Due to "KVM: VMX: Move common fields of struct", subsequent patches require
changes to use new struct vcpu_vt, refer to the full KVM branch below.

It requires TDX module 1.5.06.00.0744 [1], or later. This is due to removal
of the workarounds for the lack of the NO_RBP_MOD feature required by the
kernel. Now NO_RBP_MOD is enabled (in VM/vCPU creation patches), and this
particular version of the TDX module has a required NO_RBP_MOD related bug
fix.
A working edk2 commit is 95d8a1c ("UnitTestFrameworkPkg: Use TianoCore
mirror of subhook submodule").

Testing

The series has been tested as part of the development branch for the TDX
base series. The testing consisted of TDX kvm-unit-tests and booting a
Linux TD, and TDX enhanced KVM selftests.

The full KVM branch is here:

  https://github.com/intel/tdx/tree/tdx_kvm_dev-2025-01-28.2

Matching QEMU: TBD

[1] https://github.com/intel/tdx-module/releases/tag/TDX_1.5.06

Adrian Hunter (4):
      x86/virt/tdx: Make tdh_vp_enter() noinstr
      KVM: TDX: Set arch.has_protected_state to true
      KVM: TDX: Disable support for TSX and WAITPKG
      KVM: TDX: Save and restore IA32_DEBUGCTL

Binbin Wu (1):
      KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct

Chao Gao (1):
      KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr

Isaku Yamahata (5):
      KVM: TDX: Implement TDX vcpu enter/exit path
      KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
      KVM: TDX: restore host xsave state when exit from the guest TD
      KVM: TDX: restore user ret MSRs
      KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior

Sean Christopherson (1):
      KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected

 arch/x86/include/asm/kvm_host.h  |  12 +-
 arch/x86/kvm/svm/svm.c           |   7 +-
 arch/x86/kvm/vmx/common.h        |  68 +++++++++++
 arch/x86/kvm/vmx/main.c          |  48 +++++++-
 arch/x86/kvm/vmx/nested.c        |  10 +-
 arch/x86/kvm/vmx/posted_intr.c   |  18 +--
 arch/x86/kvm/vmx/tdx.c           | 258 +++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.h           |  19 +--
 arch/x86/kvm/vmx/vmx.c           |  99 +++++++--------
 arch/x86/kvm/vmx/vmx.h           | 104 ++++++----------
 arch/x86/kvm/vmx/x86_ops.h       |  11 ++
 arch/x86/kvm/x86.c               |  46 +++++--
 arch/x86/virt/vmx/tdx/seamcall.S |   3 +
 arch/x86/virt/vmx/tdx/tdx.c      |   2 +-
 14 files changed, 535 insertions(+), 170 deletions(-)

Regards
Adrian

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-03-10 19:08 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-29  9:58 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter
2025-01-29  9:58 ` [PATCH V2 01/12] x86/virt/tdx: Make tdh_vp_enter() noinstr Adrian Hunter
2025-02-16 18:26   ` Paolo Bonzini
2025-02-27 14:13     ` Adrian Hunter
2025-01-29  9:58 ` [PATCH V2 02/12] KVM: x86: Allow the use of kvm_load_host_xsave_state() with guest_state_protected Adrian Hunter
2025-02-20 10:50   ` Xiaoyao Li
2025-02-24 11:38     ` Adrian Hunter
2025-02-25  5:56       ` Xiaoyao Li
2025-02-27 14:14         ` Adrian Hunter
2025-03-06 18:04     ` Paolo Bonzini
2025-03-06 20:43       ` Sean Christopherson
2025-03-06 22:34         ` Paolo Bonzini
2025-03-07 23:04           ` Sean Christopherson
2025-03-10 19:08             ` Paolo Bonzini
2025-01-29  9:58 ` [PATCH V2 03/12] KVM: TDX: Set arch.has_protected_state to true Adrian Hunter
2025-02-20 12:35   ` Xiaoyao Li
2025-02-27 14:17     ` Adrian Hunter
2025-01-29  9:58 ` [PATCH V2 04/12] KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct Adrian Hunter
2025-01-29  9:58 ` [PATCH V2 05/12] KVM: TDX: Implement TDX vcpu enter/exit path Adrian Hunter
2025-02-20 13:16   ` Xiaoyao Li
2025-02-24 12:27     ` Adrian Hunter
2025-02-25  6:15       ` Xiaoyao Li
2025-02-27 18:37         ` Adrian Hunter
2025-03-06 18:19           ` Paolo Bonzini
2025-03-06 19:13             ` Adrian Hunter
2025-01-29  9:58 ` [PATCH V2 06/12] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) Adrian Hunter
2025-01-29  9:58 ` [PATCH V2 07/12] KVM: TDX: restore host xsave state when exit from the guest TD Adrian Hunter
2025-02-25  6:43   ` Xiaoyao Li
2025-02-27 14:29     ` Adrian Hunter
2025-02-28  1:58       ` Xiaoyao Li
2025-01-29  9:58 ` [PATCH V2 08/12] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr Adrian Hunter
2025-02-25  7:00   ` Xiaoyao Li
2025-01-29  9:58 ` [PATCH V2 09/12] KVM: TDX: restore user ret MSRs Adrian Hunter
2025-02-25  7:01   ` Xiaoyao Li
2025-02-27 14:19     ` Adrian Hunter
2025-01-29  9:58 ` [PATCH V2 10/12] KVM: TDX: Disable support for TSX and WAITPKG Adrian Hunter
2025-01-29  9:59 ` [PATCH V2 11/12] KVM: TDX: Save and restore IA32_DEBUGCTL Adrian Hunter
2025-01-29  9:59 ` [PATCH V2 12/12] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior Adrian Hunter
  -- strict thread matches above, loose matches on Subject: below --
2025-01-29  9:56 [PATCH V2 00/12] KVM: TDX: TD vcpu enter/exit Adrian Hunter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox