[PATCH v2 0/8] KVM: TDX: TDX hypercalls may exit to userspace

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/8] KVM: TDX: TDX hypercalls may exit to userspace
@ 2025-02-11  2:54 Binbin Wu
  2025-02-11  2:54 ` [PATCH v2 1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs Binbin Wu
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Binbin Wu @ 2025-02-11  2:54 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

Hi,

When executing in the TD, TDX can exit to the host VMM (KVM) for many
reasons. These reasons are analogous to the exit reasons for VMX. Some of
the exits will be handled within KVM in later changes. This series handles
just the TDX exits that may be passed to userspace to handle, which are all
via the TDCALL exit code. Although, these userspace exits have the same TDX
exit code, they result in several different types of exits to userspace.

This patch set is one of several patch sets that are all needed to provide
the ability to run a functioning TD VM.  With v2, we have addressed the
structural changes suggested by Sean [1] to make TDX exit handling more VMX
like. With the size of the change, it probably needs another round of
review before being ready for kvm-coco-queue.

Base of this series
===================
This series is based on kvm-coco-queue up to the end of MMU part 2, plus
one later section. Stack is:
  - '55f78d925e07 ("KVM: TDX: Return -EBUSY when tdh_mem_page_add()
    encounters TDX_OPERAND_BUSY")'.
  - v2 "KVM: TDX: TD vcpu enter/exit" (There is one small log difference
    between the v2 patches and the commits in kvm-coco-queue. No code
    differences). 

Notable changes since v1 [2]
============================
Record vmx exit reason and exit_qualification in struct vcpu_vt for TDX,
so that TDX/VMX can use the same helpers to get exit reason and exit
qualification.

Get/set tdvmcall inputs/outputs from/to vp_enter_args directly in struct
vcpu_tdx. Drop helpers for read/write a0~a3.

Skip setting of return code when the value is TDVMCALL_STATUS_SUCCESS
because r10 is always 0 for standard TDVMCALL exit.

Morph the guest requested exit reason (via TDVMCALL) to KVM's tracked exit
reason when it could, i.e. when the TDVMCALL leaf number is less than
0x10000.
- Morph the TDG.VP.VMCALL with KVM hypercall to EXIT_REASON_VMCALL.
- Morph PV port I/O hypercall to EXIT_REASON_IO_INSTRUCTION.
- Morph PV MMIO hypercall to EXIT_REASON_EPT_MISCONFIG.

TDX marshalls registers of TDVMCALL ABI into KVM's x86 registers to match
the definition of KVM hypercall ABI _before_ ____kvm_emulate_hypercall()
gets called. Also, "KVM: x86: Have ____kvm_emulate_hypercall() read the
GPRs" is added as a preparation patch for the change.

Zero run->hypercall.ret in __tdx_map_gpa() following the pattern of Paolo's
patch [3], the feedback of adding a helper is still pending.

Combine TDX_OPERAND_BUSY for TDX_OPERAND_ID_TD_EPOCH and
TDX_OPERAND_ID_SEPT.  Use EXIT_FASTPATH_EXIT_HANDLED instead of
EXIT_FASTPATH_REENTER_GUEST for TDX_OPERAND_BUSY because when KVM requests
KVM_REQ_OUTSIDE_GUEST_MODE, which has both KVM_REQUEST_WAIT and
KVM_REQUEST_NO_ACTION set, it requires target vCPUs leaving fastpath so
that interrupt can be enabled to ensure the IPIs can be delivered. Return
EXIT_FASTPATH_EXIT_HANDLED instead of EXIT_FASTPATH_REENTER_GUEST to exit
fastpath, otherwise, the requester may be blocked endlessly.

Remove the code for reading/writing APIC mmio, since TDX guest supports
x2APIC only.

TDX hypercalls
==============
The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL) for the
guest TDs to make hypercall to VMM.  When a guest TD issues a TDVMCALL, it
exits to VMM with a new exit reason.  The arguments from the guest TD and
return values from the VMM are passed through the guest registers.  The
ABI details for the guest TD hypercalls are specified in the TDX Guest-Host
Communication Interface (GHCI) specification [4].

There are two types of hypercalls defined in the GHCI specification:
- Standard TDVMCALLs: When input of R10 from guest TD is set to 0, it
  indicates that the TDVMCALL sub-function used in R11 is defined in GHCI
  specification.
- Vendor-Specific TDVMCALLs: When input of R10 from guest TD is non-zero,
  it indicates a vendor-specific TDVMCALL. For KVM hypercalls from guest
  TDs, KVM uses R10 as KVM hypercall number and R11-R14 as 4 arguments,
  with the error code returned in R10.

KVM hypercalls
--------------
This series supports KVM hypercalls from guest TDs following the 
vendor-specific interface described above.  The KVM_HC_MAP_GPA_RANGE
hypercall will need to exit to userspace for handling.

Standard TDVMCALLs exiting to userspace
---------------------------------------
To support basic functionality of TDX,  this series includes the following
standard TDVMCALL sub-functions, which reuse exist exit reasons when they
need to exit to userspace for handling:
- TDG.VP.VMCALL<MapGPA> reuses exit reason KVM_EXIT_HYPERCALL with the
  hypercall number KVM_HC_MAP_GPA_RANGE.
- TDG.VP.VMCALL<ReportFatalError> reuses exit reason KVM_EXIT_SYSTEM_EVENT
  with a new event type KVM_SYSTEM_EVENT_TDX_FATAL.
- TDG.VP.VMCALL<Instruction.IO> reuses exit reason KVM_EXIT_IO.
- TDG.VP.VMCALL<#VE.RequestMMIO> reuses exit reason KVM_EXIT_MMIO.

Support for TD attestation is currently excluded from the TDX basic
enabling, so handling for TDG.VP.VMCALL<SetupEventNotifyInterrupt> and
TDG.VP.VMCALL<GetQuote> is not included in this patch series.

Repos
=====
Due to "KVM: VMX: Move common fields of struct" in "TDX vcpu enter/exit" v2
[5], subsequent patches require changes to use new struct vcpu_vt, refer to
the full KVM branch below.

It requires TDX module 1.5.06.00.0744 [6], or later as mentioned in [5].
A working edk2 commit is 95d8a1c ("UnitTestFrameworkPkg: Use TianoCore
mirror of subhook submodule").

The full KVM branch is here:
https://github.com/intel/tdx/tree/tdx_kvm_dev-2025-02-10

A matching QEMU is here:
https://github.com/intel-staging/qemu-tdx/tree/tdx-qemu-upstream-v7

Testing 
=======
It has been tested as part of the development branch for the TDX base
series. The testing consisted of TDX kvm-unit-tests and booting a Linux
TD, and TDX enhanced KVM selftests.

[1] https://lore.kernel.org/kvm/Z1suNzg2Or743a7e@google.com
[2] https://lore.kernel.org/kvm/20241201035358.2193078-1-binbin.wu@linux.intel.com
[3] https://lore.kernel.org/kvm/20241213194137.315304-1-pbonzini@redhat.com
[4] https://cdrdv2.intel.com/v1/dl/getContent/726792
[5] https://lore.kernel.org/kvm/20250129095902.16391-1-adrian.hunter@intel.com
[6] https://github.com/intel/tdx-module/releases/tag/TDX_1.5.06

Binbin Wu (3):
  KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs
  KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError>

Isaku Yamahata (4):
  KVM: TDX: Add a place holder to handle TDX VM exit
  KVM: TDX: Add a place holder for handler of TDX hypercalls
    (TDG.VP.VMCALL)
  KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL
  KVM: TDX: Handle TDX PV port I/O hypercall

Sean Christopherson (1):
  KVM: TDX: Handle TDX PV MMIO hypercall

 Documentation/virt/kvm/api.rst    |   9 +
 arch/x86/include/asm/shared/tdx.h |   1 +
 arch/x86/include/asm/tdx.h        |   1 +
 arch/x86/include/uapi/asm/vmx.h   |   4 +-
 arch/x86/kvm/vmx/main.c           |  38 ++-
 arch/x86/kvm/vmx/tdx.c            | 532 +++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h            |   5 +
 arch/x86/kvm/vmx/tdx_errno.h      |   3 +
 arch/x86/kvm/vmx/x86_ops.h        |   8 +
 arch/x86/kvm/x86.c                |  16 +-
 arch/x86/kvm/x86.h                |  26 +-
 include/uapi/linux/kvm.h          |   1 +
 virt/kvm/kvm_main.c               |   1 +
 13 files changed, 616 insertions(+), 29 deletions(-)

-- 
2.46.0

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs
  2025-02-11  2:54 [PATCH v2 0/8] KVM: TDX: TDX hypercalls may exit to userspace Binbin Wu
@ 2025-02-11  2:54 ` Binbin Wu
  2025-02-11  5:05   ` Huang, Kai
  2025-02-11 10:23   ` Xiaoyao Li
  2025-02-11  2:54 ` [PATCH v2 2/8] KVM: TDX: Add a place holder to handle TDX VM exit Binbin Wu
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Binbin Wu @ 2025-02-11  2:54 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

Have ____kvm_emulate_hypercall() read the GPRs instead of passing them
in via the macro.

When emulating KVM hypercalls via TDVMCALL, TDX will marshall registers of
TDVMCALL ABI into KVM's x86 registers to match the definition of KVM
hypercall ABI _before_ ____kvm_emulate_hypercall() gets called.  Therefore,
____kvm_emulate_hypercall() can just read registers internally based on KVM
hypercall ABI, and those registers can be removed from the
__kvm_emulate_hypercall() macro.

Also, op_64_bit can be determined inside ____kvm_emulate_hypercall(),
remove it from the __kvm_emulate_hypercall() macro as well.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
 arch/x86/kvm/x86.c | 15 ++++++++-------
 arch/x86/kvm/x86.h | 26 +++++++++-----------------
 2 files changed, 17 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6ace11303f90..29f33f7c9da9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10022,13 +10022,16 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
-int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
-			      unsigned long a0, unsigned long a1,
-			      unsigned long a2, unsigned long a3,
-			      int op_64_bit, int cpl,
+int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, int cpl,
 			      int (*complete_hypercall)(struct kvm_vcpu *))
 {
 	unsigned long ret;
+	unsigned long nr = kvm_rax_read(vcpu);
+	unsigned long a0 = kvm_rbx_read(vcpu);
+	unsigned long a1 = kvm_rcx_read(vcpu);
+	unsigned long a2 = kvm_rdx_read(vcpu);
+	unsigned long a3 = kvm_rsi_read(vcpu);
+	int op_64_bit = is_64_bit_hypercall(vcpu);
 
 	++vcpu->stat.hypercalls;
 
@@ -10131,9 +10134,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	if (kvm_hv_hypercall_enabled(vcpu))
 		return kvm_hv_hypercall(vcpu);
 
-	return __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
-				       is_64_bit_hypercall(vcpu),
-				       kvm_x86_call(get_cpl)(vcpu),
+	return __kvm_emulate_hypercall(vcpu, kvm_x86_call(get_cpl)(vcpu),
 				       complete_hypercall_exit);
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 91e50a513100..8b27f70c6321 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -621,25 +621,17 @@ static inline bool user_exit_on_hypercall(struct kvm *kvm, unsigned long hc_nr)
 	return kvm->arch.hypercall_exit_enabled & BIT(hc_nr);
 }
 
-int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
-			      unsigned long a0, unsigned long a1,
-			      unsigned long a2, unsigned long a3,
-			      int op_64_bit, int cpl,
+int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, int cpl,
 			      int (*complete_hypercall)(struct kvm_vcpu *));
 
-#define __kvm_emulate_hypercall(_vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, complete_hypercall)	\
-({												\
-	int __ret;										\
-												\
-	__ret = ____kvm_emulate_hypercall(_vcpu,						\
-					  kvm_##nr##_read(_vcpu), kvm_##a0##_read(_vcpu),	\
-					  kvm_##a1##_read(_vcpu), kvm_##a2##_read(_vcpu),	\
-					  kvm_##a3##_read(_vcpu), op_64_bit, cpl,		\
-					  complete_hypercall);					\
-												\
-	if (__ret > 0)										\
-		__ret = complete_hypercall(_vcpu);						\
-	__ret;											\
+#define __kvm_emulate_hypercall(_vcpu, cpl, complete_hypercall)			\
+({										\
+	int __ret;								\
+	__ret = ____kvm_emulate_hypercall(_vcpu, cpl, complete_hypercall);	\
+										\
+	if (__ret > 0)								\
+		__ret = complete_hypercall(_vcpu);				\
+	__ret;									\
 })
 
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 2/8] KVM: TDX: Add a place holder to handle TDX VM exit
  2025-02-11  2:54 [PATCH v2 0/8] KVM: TDX: TDX hypercalls may exit to userspace Binbin Wu
  2025-02-11  2:54 ` [PATCH v2 1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs Binbin Wu
@ 2025-02-11  2:54 ` Binbin Wu
  2025-02-11  2:54 ` [PATCH v2 3/8] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) Binbin Wu
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Binbin Wu @ 2025-02-11  2:54 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Introduce the wiring for handling TDX VM exits by implementing the
callbacks .get_exit_info(), .get_entry_info(), and .handle_exit().
Additionally, add error handling during the TDX VM exit flow, and add a
place holder to handle various exit reasons.

Store VMX exit reason and exit qualification in struct vcpu_vt for TDX,
so that TDX/VMX can use the same helpers to get exit reason and exit
qualification. Store extended exit qualification and exit GPA info in
struct vcpu_tdx because they are used by TDX code only.

Contention Handling: The TDH.VP.ENTER operation may contend with TDH.MEM.*
operations due to secure EPT or TD EPOCH. If the contention occurs,
the return value will have TDX_OPERAND_BUSY set, prompting the vCPU to
attempt re-entry into the guest with EXIT_FASTPATH_EXIT_HANDLED,
not EXIT_FASTPATH_REENTER_GUEST, so that the interrupts pending during
IN_GUEST_MODE can be delivered for sure. Otherwise, the requester of
KVM_REQ_OUTSIDE_GUEST_MODE may be blocked endlessly.

Error Handling:
- TDX_SW_ERROR: This includes #UD caused by SEAMCALL instruction if the
  CPU isn't in VMX operation, #GP caused by SEAMCALL instruction when TDX
  isn't enabled by the BIOS, and TDX_SEAMCALL_VMFAILINVALID when SEAM
  firmware is not loaded or disabled.
- TDX_ERROR: This indicates some check failed in the TDX module, preventing
  the vCPU from running.
- Failed VM Entry: Exit to userspace with KVM_EXIT_FAIL_ENTRY. Handle it
  separately before handling TDX_NON_RECOVERABLE because when off-TD debug
  is not enabled, TDX_NON_RECOVERABLE is set.
- TDX_NON_RECOVERABLE: Set by the TDX module when the error is
  non-recoverable, indicating that the TDX guest is dead or the vCPU is
  disabled.
  A special case is triple fault, which also sets TDX_NON_RECOVERABLE but
  exits to userspace with KVM_EXIT_SHUTDOWN, aligning with the VMX case.
- Any unhandled VM exit reason will also return to userspace with
  KVM_EXIT_INTERNAL_ERROR.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
---
Hypercalls exit to userspace v2:
- Record vmx exit reason and exit_qualification in struct vcpu_vt for TDX,
  so that TDX/VMX can use the same helpers to get exit reason and exit
  qualification. (Sean)
- Handle failed vmentry separately by KVM_EXIT_FAIL_ENTRY. (Xiaoyao)
- Remove the print of hkid & set_hkid_to_hpa() for TDX_ERROR or
  TDX_NON_RECOVERABLE case. (Xiaoyao)
- Handle EXIT_REASON_TRIPLE_FAULT in switch case, and drop the helper
  tdx_handle_triple_fault(), open code it. (Sean)
- intr_info should be 0 for the case VMX exit reason is invalid in
  tdx_get_exit_info(). (Chao)
- Combine TDX_OPERAND_BUSY for TDX_OPERAND_ID_TD_EPOCH and TDX_OPERAND_ID_SEPT,
  use EXIT_FASTPATH_EXIT_HANDLED instead of EXIT_FASTPATH_REENTER_GUEST. Updated
  comments.
- Use helper tdx_operand_busy().
- Add vt_get_entry_info() to implement .get_entry_info() for TDX.

Hypercalls exit to userspace v1:
- Dropped Paolo's Reviewed-by since the change is not subtle.
- Mention addition of .get_exit_info() handler in changelog. (Binbin)
- tdh_sept_seamcall() -> tdx_seamcall_sept() in comments. (Binbin)
- Do not open code TDX_ERROR_SEPT_BUSY. (Binbin)
- "TDH.VP.ENTRY" -> "TDH.VP.ENTER". (Binbin)
- Remove the use of union tdx_exit_reason. (Sean)
  https://lore.kernel.org/kvm/ZfSExlemFMKjBtZb@google.com/
- Add tdx_check_exit_reason() to check a VMX exit reason against the
  status code of TDH.VP.ENTER.
- Move the handling of TDX_ERROR_SEPT_BUSY and (TDX_OPERAND_BUSY |
  TDX_OPERAND_ID_TD_EPOCH) into fast path, and add a helper function
  tdx_exit_handlers_fastpath().
- Remove the warning on TDX_SW_ERROR in fastpath, but return without
  further handling.
- Call kvm_machine_check() for EXIT_REASON_MCE_DURING_VMENTRY, align
  with VMX case.
- On failed_vmentry in fast path, return without further handling.
- Exit to userspace for #UD and #GP.
- Fix whitespace in tdx_get_exit_info()
- Add a comment in tdx_handle_exit() to describe failed_vmentry case
  is handled by TDX_NON_RECOVERABLE handling.
- Move the code of handling NMI, exception and external interrupts out
  of the patch, i.e., the NMI handling in tdx_vcpu_enter_exit() and the
  wiring of .handle_exit_irqoff() are removed.
- Drop the check for VCPU_TD_STATE_INITIALIZED in tdx_handle_exit()
  because it has been checked in tdx_vcpu_pre_run().
- Update changelog.
---
 arch/x86/include/asm/tdx.h   |   1 +
 arch/x86/kvm/vmx/main.c      |  38 +++++++++-
 arch/x86/kvm/vmx/tdx.c       | 141 ++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h       |   2 +
 arch/x86/kvm/vmx/tdx_errno.h |   3 +
 arch/x86/kvm/vmx/x86_ops.h   |   8 ++
 6 files changed, 189 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8a47a69c148e..897db9392d7d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -18,6 +18,7 @@
  * TDX module.
  */
 #define TDX_ERROR			_BITUL(63)
+#define TDX_NON_RECOVERABLE		_BITUL(62)
 #define TDX_SW_ERROR			(TDX_ERROR | GENMASK_ULL(47, 40))
 #define TDX_SEAMCALL_VMFAILINVALID	(TDX_SW_ERROR | _UL(0xFFFF0000))
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 341aa537ca72..7f1318c44040 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -169,6 +169,15 @@ static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	return vmx_vcpu_run(vcpu, force_immediate_exit);
 }
 
+static int vt_handle_exit(struct kvm_vcpu *vcpu,
+			  enum exit_fastpath_completion fastpath)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_handle_exit(vcpu, fastpath);
+
+	return vmx_handle_exit(vcpu, fastpath);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu)) {
@@ -216,6 +225,29 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
 }
 
+static void vt_get_entry_info(struct kvm_vcpu *vcpu, u32 *intr_info, u32 *error_code)
+{
+	*intr_info = 0;
+	*error_code = 0;
+
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_get_entry_info(vcpu, intr_info, error_code);
+}
+
+static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_get_exit_info(vcpu, reason, info1, info2, intr_info,
+				  error_code);
+		return;
+	}
+
+	vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -310,7 +342,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.vcpu_pre_run = vt_vcpu_pre_run,
 	.vcpu_run = vt_vcpu_run,
-	.handle_exit = vmx_handle_exit,
+	.handle_exit = vt_handle_exit,
 	.skip_emulated_instruction = vmx_skip_emulated_instruction,
 	.update_emulated_instruction = vmx_update_emulated_instruction,
 	.set_interrupt_shadow = vmx_set_interrupt_shadow,
@@ -344,8 +376,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_identity_map_addr = vmx_set_identity_map_addr,
 	.get_mt_mask = vmx_get_mt_mask,
 
-	.get_exit_info = vmx_get_exit_info,
-	.get_entry_info = vmx_get_entry_info,
+	.get_exit_info = vt_get_exit_info,
+	.get_entry_info = vt_get_entry_info,
 
 	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0863bdaf761a..cb64675e6ad9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -807,17 +807,70 @@ static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
 		!guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES));
 }
 
+static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	switch (tdx->vp_enter_ret & TDX_SEAMCALL_STATUS_MASK) {
+	case TDX_SUCCESS:
+	case TDX_NON_RECOVERABLE_VCPU:
+	case TDX_NON_RECOVERABLE_TD:
+	case TDX_NON_RECOVERABLE_TD_NON_ACCESSIBLE:
+	case TDX_NON_RECOVERABLE_TD_WRONG_APIC_MODE:
+		break;
+	default:
+		return -1u;
+	}
+
+	return tdx->vp_enter_ret;
+}
+
 static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct vcpu_vt *vt = to_vt(vcpu);
 
 	guest_state_enter_irqoff();
 
 	tdx->vp_enter_ret = tdh_vp_enter(&tdx->vp, &tdx->vp_enter_args);
 
+	vt->exit_reason.full = tdx_to_vmx_exit_reason(vcpu);
+
+	vt->exit_qualification = tdx->vp_enter_args.rcx;
+	tdx->ext_exit_qualification = tdx->vp_enter_args.rdx;
+	tdx->exit_gpa = tdx->vp_enter_args.r8;
+	vt->exit_intr_info = tdx->vp_enter_args.r9;
+
 	guest_state_exit_irqoff();
 }
 
+static bool tdx_failed_vmentry(struct kvm_vcpu *vcpu)
+{
+	return vmx_get_exit_reason(vcpu).failed_vmentry &&
+	       vmx_get_exit_reason(vcpu).full != -1u;
+}
+
+static fastpath_t tdx_exit_handlers_fastpath(struct kvm_vcpu *vcpu)
+{
+	u64 vp_enter_ret = to_tdx(vcpu)->vp_enter_ret;
+
+	/*
+	 * TDX_OPERAND_BUSY could be returned for SEPT due to 0-step mitigation
+	 * or for TD EPOCH due to contention with TDH.MEM.TRACK on TDH.VP.ENTER.
+	 *
+	 * When KVM requests KVM_REQ_OUTSIDE_GUEST_MODE, which has both
+	 * KVM_REQUEST_WAIT and KVM_REQUEST_NO_ACTION set, it requires target
+	 * vCPUs leaving fastpath so that interrupt can be enabled to ensure the
+	 * IPIs can be delivered. Return EXIT_FASTPATH_EXIT_HANDLED instead of
+	 * EXIT_FASTPATH_REENTER_GUEST to exit fastpath, otherwise, the
+	 * requester may be blocked endlessly.
+	 */
+	if (unlikely(tdx_operand_busy(vp_enter_ret)))
+		return EXIT_FASTPATH_EXIT_HANDLED;
+
+	return EXIT_FASTPATH_NONE;
+}
+
 #define TDX_REGS_UNSUPPORTED_SET	(BIT(VCPU_EXREG_RFLAGS) |	\
 					 BIT(VCPU_EXREG_SEGMENTS))
 
@@ -863,9 +916,18 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 
 	vcpu->arch.regs_avail &= ~TDX_REGS_UNSUPPORTED_SET;
 
+	if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
+		return EXIT_FASTPATH_NONE;
+
+	if (unlikely(vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
+		kvm_machine_check();
+
 	trace_kvm_exit(vcpu, KVM_ISA_VMX);
 
-	return EXIT_FASTPATH_NONE;
+	if (unlikely(tdx_failed_vmentry(vcpu)))
+		return EXIT_FASTPATH_NONE;
+
+	return tdx_exit_handlers_fastpath(vcpu);
 }
 
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
@@ -1155,6 +1217,83 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	return tdx_sept_drop_private_spte(kvm, gfn, level, pfn_to_page(pfn));
 }
 
+int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	u64 vp_enter_ret = tdx->vp_enter_ret;
+	union vmx_exit_reason exit_reason = vmx_get_exit_reason(vcpu);
+
+	if (fastpath != EXIT_FASTPATH_NONE)
+		return 1;
+
+	/*
+	 * Handle TDX SW errors, including TDX_SEAMCALL_UD, TDX_SEAMCALL_GP and
+	 * TDX_SEAMCALL_VMFAILINVALID.
+	 */
+	if (unlikely((vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR)) {
+		KVM_BUG_ON(!kvm_rebooting, vcpu->kvm);
+		goto unhandled_exit;
+	}
+
+	if (unlikely(tdx_failed_vmentry(vcpu))) {
+		/*
+		 * If the guest state is protected, that means off-TD debug is
+		 * not enabled, TDX_NON_RECOVERABLE must be set.
+		 */
+		WARN_ON_ONCE(vcpu->arch.guest_state_protected &&
+				!(vp_enter_ret & TDX_NON_RECOVERABLE));
+		vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
+		vcpu->run->fail_entry.hardware_entry_failure_reason = exit_reason.full;
+		vcpu->run->fail_entry.cpu = vcpu->arch.last_vmentry_cpu;
+		return 0;
+	}
+
+	if (unlikely(vp_enter_ret & (TDX_ERROR | TDX_NON_RECOVERABLE)) &&
+		exit_reason.basic != EXIT_REASON_TRIPLE_FAULT) {
+		kvm_pr_unimpl("TD vp_enter_ret 0x%llx\n", vp_enter_ret);
+		goto unhandled_exit;
+	}
+
+	WARN_ON_ONCE(exit_reason.basic != EXIT_REASON_TRIPLE_FAULT &&
+		     (vp_enter_ret & TDX_SEAMCALL_STATUS_MASK) != TDX_SUCCESS);
+
+	switch (exit_reason.basic) {
+	case EXIT_REASON_TRIPLE_FAULT:
+		vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+		vcpu->mmio_needed = 0;
+		return 0;
+	default:
+		break;
+	}
+
+unhandled_exit:
+	vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+	vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
+	vcpu->run->internal.ndata = 2;
+	vcpu->run->internal.data[0] = vp_enter_ret;
+	vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
+	return 0;
+}
+
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	*reason = tdx->vt.exit_reason.full;
+	if (*reason != -1u) {
+		*info1 = vmx_get_exit_qual(vcpu);
+		*info2 = tdx->ext_exit_qualification;
+		*intr_info = vmx_get_intr_info(vcpu);
+	} else {
+		*info1 = 0;
+		*info2 = 0;
+		*intr_info = 0;
+	}
+
+	*error_code = 0;
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 8339bbf0fdd4..0e3522e423cc 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -46,6 +46,8 @@ enum vcpu_tdx_state {
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
 	struct vcpu_vt vt;
+	u64 ext_exit_qualification;
+	gpa_t exit_gpa;
 	struct tdx_module_args vp_enter_args;
 
 	struct tdx_vp vp;
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
index f9dbb3a065cc..6ff4672c4181 100644
--- a/arch/x86/kvm/vmx/tdx_errno.h
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -10,6 +10,9 @@
  * TDX SEAMCALL Status Codes (returned in RAX)
  */
 #define TDX_NON_RECOVERABLE_VCPU		0x4000000100000000ULL
+#define TDX_NON_RECOVERABLE_TD			0x4000000200000000ULL
+#define TDX_NON_RECOVERABLE_TD_NON_ACCESSIBLE	0x6000000500000000ULL
+#define TDX_NON_RECOVERABLE_TD_WRONG_APIC_MODE	0x6000000700000000ULL
 #define TDX_INTERRUPTED_RESUMABLE		0x8000000300000000ULL
 #define TDX_OPERAND_INVALID			0xC000010000000000ULL
 #define TDX_OPERAND_BUSY			0x8000020000000000ULL
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f856eac8f1e8..92716f6486e9 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -135,6 +135,10 @@ int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
 fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
 void tdx_vcpu_put(struct kvm_vcpu *vcpu);
+int tdx_handle_exit(struct kvm_vcpu *vcpu,
+		enum exit_fastpath_completion fastpath);
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
@@ -169,6 +173,10 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediat
 }
 static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
+static inline int tdx_handle_exit(struct kvm_vcpu *vcpu,
+		enum exit_fastpath_completion fastpath) { return 0; }
+static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
+				     u64 *info2, u32 *intr_info, u32 *error_code) {}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 3/8] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2025-02-11  2:54 [PATCH v2 0/8] KVM: TDX: TDX hypercalls may exit to userspace Binbin Wu
  2025-02-11  2:54 ` [PATCH v2 1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs Binbin Wu
  2025-02-11  2:54 ` [PATCH v2 2/8] KVM: TDX: Add a place holder to handle TDX VM exit Binbin Wu
@ 2025-02-11  2:54 ` Binbin Wu
  2025-02-11  8:41   ` Chao Gao
  2025-02-11  2:54 ` [PATCH v2 4/8] KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL Binbin Wu
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-11  2:54 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a place holder and related helper functions for preparation of
TDG.VP.VMCALL handling.

The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short)
for the guest TD to call hypercall to VMM.  When the guest TD issues a
TDVMCALL, the guest TD exits to VMM with a new exit reason.  The arguments
from the guest TD and returned values from the VMM are passed in the guest
registers.  The guest RCX register indicates which registers are used.
Define helper functions to access those registers.

A new VMX exit reason TDCALL is added to indicate the exit is due to
TDVMCALL from the guest TD.  Define the TDCALL exit reason and add a place
holder to handle such exit.

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
---
Hypercalls exit to userspace v2:
- Get/set tdvmcall inputs/outputs from/to vp_enter_args.
- Morph the guest requested exit reason (via TDVMCALL) to KVM's tracked
  exit reason when it could, i.e. when the TDVMCALL leaf number is less
  than 0x10000. (Sean)
- Drop helpers for read/write a0~a3.

Hypercalls exit to userspace v1:
- Update changelog.
- Drop the unused tdx->tdvmcall. (Chao)
- Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)
---
 arch/x86/include/uapi/asm/vmx.h |  4 ++-
 arch/x86/kvm/vmx/tdx.c          | 49 ++++++++++++++++++++++++++++++++-
 2 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index a5faf6d88f1b..6a9f268a2d2c 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -92,6 +92,7 @@
 #define EXIT_REASON_TPAUSE              68
 #define EXIT_REASON_BUS_LOCK            74
 #define EXIT_REASON_NOTIFY              75
+#define EXIT_REASON_TDCALL              77
 
 #define VMX_EXIT_REASONS \
 	{ EXIT_REASON_EXCEPTION_NMI,         "EXCEPTION_NMI" }, \
@@ -155,7 +156,8 @@
 	{ EXIT_REASON_UMWAIT,                "UMWAIT" }, \
 	{ EXIT_REASON_TPAUSE,                "TPAUSE" }, \
 	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }, \
-	{ EXIT_REASON_NOTIFY,                "NOTIFY" }
+	{ EXIT_REASON_NOTIFY,                "NOTIFY" }, \
+	{ EXIT_REASON_TDCALL,                "TDCALL" }
 
 #define VMX_EXIT_REASON_FLAGS \
 	{ VMX_EXIT_REASONS_FAILED_VMENTRY,	"FAILED_VMENTRY" }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index cb64675e6ad9..420ee492e919 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -235,6 +235,25 @@ static bool tdx_operand_busy(u64 err)
  */
 static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
 
+static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
+{
+	return to_tdx(vcpu)->vp_enter_args.r10;
+}
+static __always_inline unsigned long tdvmcall_leaf(struct kvm_vcpu *vcpu)
+{
+	return to_tdx(vcpu)->vp_enter_args.r11;
+}
+static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
+						     long val)
+{
+	to_tdx(vcpu)->vp_enter_args.r10 = val;
+}
+static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
+						    unsigned long val)
+{
+	to_tdx(vcpu)->vp_enter_args.r11 = val;
+}
+
 static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
 {
 	tdx_guest_keyid_free(kvm_tdx->hkid);
@@ -810,6 +829,7 @@ static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
 static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	u32 exit_reason;
 
 	switch (tdx->vp_enter_ret & TDX_SEAMCALL_STATUS_MASK) {
 	case TDX_SUCCESS:
@@ -822,7 +842,21 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
 		return -1u;
 	}
 
-	return tdx->vp_enter_ret;
+	exit_reason = tdx->vp_enter_ret;
+
+	switch (exit_reason) {
+	case EXIT_REASON_TDCALL:
+		if (tdvmcall_exit_type(vcpu))
+			return EXIT_REASON_VMCALL;
+
+		if (tdvmcall_leaf(vcpu) < 0x10000)
+			return tdvmcall_leaf(vcpu);
+		break;
+	default:
+		break;
+	}
+
+	return exit_reason;
 }
 
 static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
@@ -930,6 +964,17 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	return tdx_exit_handlers_fastpath(vcpu);
 }
 
+static int handle_tdvmcall(struct kvm_vcpu *vcpu)
+{
+	switch (tdvmcall_leaf(vcpu)) {
+	default:
+		break;
+	}
+
+	tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
+	return 1;
+}
+
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
 	u64 shared_bit = (pgd_level == 5) ? TDX_SHARED_BIT_PWL_5 :
@@ -1262,6 +1307,8 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 		vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
 		vcpu->mmio_needed = 0;
 		return 0;
+	case EXIT_REASON_TDCALL:
+		return handle_tdvmcall(vcpu);
 	default:
 		break;
 	}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 4/8] KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL
  2025-02-11  2:54 [PATCH v2 0/8] KVM: TDX: TDX hypercalls may exit to userspace Binbin Wu
                   ` (2 preceding siblings ...)
  2025-02-11  2:54 ` [PATCH v2 3/8] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) Binbin Wu
@ 2025-02-11  2:54 ` Binbin Wu
  2025-02-11 23:48   ` Sean Christopherson
  2025-02-11  2:54 ` [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA> Binbin Wu
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-11  2:54 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Handle KVM hypercall for TDX according to TDX Guest-Host Communication
Interface (GHCI) specification.

The TDX GHCI specification defines the ABI for the guest TD to issue
hypercalls.  When R10 is non-zero, it indicates the TDG.VP.VMCALL is
vendor-specific.  KVM uses R10 as KVM hypercall number and R11-R14
as 4 arguments, while the error code is returned in R10.

Morph the TDG.VP.VMCALL with KVM hypercall to EXIT_REASON_VMCALL and
marshall r10~r14 from vp_enter_args to the appropriate x86 registers for
KVM hypercall handling.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
Hypercalls exit to userspace v2:
- Morph the TDG.VP.VMCALL with KVM hypercall to EXIT_REASON_VMCALL.
- Marshall values to the appropriate x86 registers for KVM hypercall
  handling.

Hypercalls exit to userspace v1:
- Renamed from "KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL" to
  "KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL".
- Update the change log.
- Rebased on Sean's "Prep KVM hypercall handling for TDX" patch set.
  https://lore.kernel.org/kvm/20241128004344.4072099-1-seanjc@google.com
- Use the right register (i.e. R10) to set the return code after returning
  back from userspace.
---
 arch/x86/kvm/vmx/tdx.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 420ee492e919..daa49f2ee2b3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -964,6 +964,23 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	return tdx_exit_handlers_fastpath(vcpu);
 }
 
+static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
+{
+	tdvmcall_set_return_code(vcpu, vcpu->run->hypercall.ret);
+	return 1;
+}
+
+static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
+{
+	kvm_rax_write(vcpu, to_tdx(vcpu)->vp_enter_args.r10);
+	kvm_rbx_write(vcpu, to_tdx(vcpu)->vp_enter_args.r11);
+	kvm_rcx_write(vcpu, to_tdx(vcpu)->vp_enter_args.r12);
+	kvm_rdx_write(vcpu, to_tdx(vcpu)->vp_enter_args.r13);
+	kvm_rsi_write(vcpu, to_tdx(vcpu)->vp_enter_args.r14);
+
+	return __kvm_emulate_hypercall(vcpu, 0, complete_hypercall_exit);
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	switch (tdvmcall_leaf(vcpu)) {
@@ -1309,6 +1326,8 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 		return 0;
 	case EXIT_REASON_TDCALL:
 		return handle_tdvmcall(vcpu);
+	case EXIT_REASON_VMCALL:
+		return tdx_emulate_vmcall(vcpu);
 	default:
 		break;
 	}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-11  2:54 [PATCH v2 0/8] KVM: TDX: TDX hypercalls may exit to userspace Binbin Wu
                   ` (3 preceding siblings ...)
  2025-02-11  2:54 ` [PATCH v2 4/8] KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL Binbin Wu
@ 2025-02-11  2:54 ` Binbin Wu
  2025-02-11  6:54   ` Yan Zhao
  2025-02-11  2:54 ` [PATCH v2 6/8] KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError> Binbin Wu
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-11  2:54 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling.

MapGPA is used by TDX guest to request to map a GPA range as private
or shared memory.  It needs to exit to userspace for handling.  KVM has
already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will
exit to userspace with exit reason KVM_EXIT_HYPERCALL.  Do sanity checks,
convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request
to userspace.

To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA
range is split into 2MB chunks and check interrupt pending between chunks.
This allows for timely injection of interrupts and prevents issues with
guest lockup detection.  TDX guest should retry the operation for the
GPA starting at the address specified in R11 when the TDVMCALL return
TDVMCALL_RETRY as status code.

Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE bit set for TD VM.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
Hypercalls exit to userspace v2:
- Skip setting of return code as TDVMCALL_STATUS_SUCCESS.
- Use vp_enter_args instead of x86 registers.
- Remove unnecessary comments.
- Zero run->hypercall.ret in __tdx_map_gpa() following the pattern of Paolo's
  patch, the feedback of adding a helper is still pending. (Rick)
  https://lore.kernel.org/kvm/20241213194137.315304-1-pbonzini@redhat.com

Hypercalls exit to userspace v1:
- New added.
  Implement one of the hypercalls need to exit to userspace for handling after
  dropping "KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL", which tries to resolve
  Sean's comment.
  https://lore.kernel.org/kvm/Zg18ul8Q4PGQMWam@google.com/
- Check interrupt pending between chunks suggested by Sean.
  https://lore.kernel.org/kvm/ZleJvmCawKqmpFIa@google.com/
- Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)
- Use vt_is_tdx_private_gpa()
---
 arch/x86/include/asm/shared/tdx.h |   1 +
 arch/x86/kvm/vmx/tdx.c            | 113 ++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h            |   3 +
 3 files changed, 117 insertions(+)

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 4aedab1f2a1a..f23657350d28 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -77,6 +77,7 @@
 #define TDVMCALL_STATUS_SUCCESS		0x0000000000000000ULL
 #define TDVMCALL_STATUS_RETRY		0x0000000000000001ULL
 #define TDVMCALL_STATUS_INVALID_OPERAND	0x8000000000000000ULL
+#define TDVMCALL_STATUS_ALIGN_ERROR	0x8000000000000002ULL
 
 /*
  * Bitmasks of exposed registers (with VMM).
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index daa49f2ee2b3..8b51b4c937e9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -981,9 +981,122 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
 	return __kvm_emulate_hypercall(vcpu, 0, complete_hypercall_exit);
 }
 
+/*
+ * Split into chunks and check interrupt pending between chunks.  This allows
+ * for timely injection of interrupts to prevent issues with guest lockup
+ * detection.
+ */
+#define TDX_MAP_GPA_MAX_LEN (2 * 1024 * 1024)
+static void __tdx_map_gpa(struct vcpu_tdx *tdx);
+
+static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (vcpu->run->hypercall.ret) {
+		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
+		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
+		return 1;
+	}
+
+	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
+	if (tdx->map_gpa_next >= tdx->map_gpa_end)
+		return 1;
+
+	/*
+	 * Stop processing the remaining part if there is pending interrupt.
+	 * Skip checking pending virtual interrupt (reflected by
+	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
+	 * if guest disabled interrupt, it's OK not returning back to guest
+	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
+	 * immediately after STI or MOV/POP SS.
+	 */
+	if (pi_has_pending_interrupt(vcpu) ||
+	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
+		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
+		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
+		return 1;
+	}
+
+	__tdx_map_gpa(tdx);
+	return 0;
+}
+
+static void __tdx_map_gpa(struct vcpu_tdx *tdx)
+{
+	u64 gpa = tdx->map_gpa_next;
+	u64 size = tdx->map_gpa_end - tdx->map_gpa_next;
+
+	if (size > TDX_MAP_GPA_MAX_LEN)
+		size = TDX_MAP_GPA_MAX_LEN;
+
+	tdx->vcpu.run->exit_reason       = KVM_EXIT_HYPERCALL;
+	tdx->vcpu.run->hypercall.nr      = KVM_HC_MAP_GPA_RANGE;
+	/*
+	 * In principle this should have been -KVM_ENOSYS, but userspace (QEMU <=9.2)
+	 * assumed that vcpu->run->hypercall.ret is never changed by KVM and thus that
+	 * it was always zero on KVM_EXIT_HYPERCALL.  Since KVM is now overwriting
+	 * vcpu->run->hypercall.ret, ensuring that it is zero to not break QEMU.
+	 */
+	tdx->vcpu.run->hypercall.ret = 0;
+	tdx->vcpu.run->hypercall.args[0] = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(tdx->vcpu.kvm));
+	tdx->vcpu.run->hypercall.args[1] = size / PAGE_SIZE;
+	tdx->vcpu.run->hypercall.args[2] = vt_is_tdx_private_gpa(tdx->vcpu.kvm, gpa) ?
+					   KVM_MAP_GPA_RANGE_ENCRYPTED :
+					   KVM_MAP_GPA_RANGE_DECRYPTED;
+	tdx->vcpu.run->hypercall.flags   = KVM_EXIT_HYPERCALL_LONG_MODE;
+
+	tdx->vcpu.arch.complete_userspace_io = tdx_complete_vmcall_map_gpa;
+}
+
+static int tdx_map_gpa(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	u64 gpa = tdx->vp_enter_args.r12;
+	u64 size = tdx->vp_enter_args.r13;
+	u64 ret;
+
+	/*
+	 * Converting TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE requires
+	 * userspace to enable KVM_CAP_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE
+	 * bit set.  If not, the error code is not defined in GHCI for TDX, use
+	 * TDVMCALL_STATUS_INVALID_OPERAND for this case.
+	 */
+	if (!user_exit_on_hypercall(vcpu->kvm, KVM_HC_MAP_GPA_RANGE)) {
+		ret = TDVMCALL_STATUS_INVALID_OPERAND;
+		goto error;
+	}
+
+	if (gpa + size <= gpa || !kvm_vcpu_is_legal_gpa(vcpu, gpa) ||
+	    !kvm_vcpu_is_legal_gpa(vcpu, gpa + size - 1) ||
+	    (vt_is_tdx_private_gpa(vcpu->kvm, gpa) !=
+	     vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1))) {
+		ret = TDVMCALL_STATUS_INVALID_OPERAND;
+		goto error;
+	}
+
+	if (!PAGE_ALIGNED(gpa) || !PAGE_ALIGNED(size)) {
+		ret = TDVMCALL_STATUS_ALIGN_ERROR;
+		goto error;
+	}
+
+	tdx->map_gpa_end = gpa + size;
+	tdx->map_gpa_next = gpa;
+
+	__tdx_map_gpa(tdx);
+	return 0;
+
+error:
+	tdvmcall_set_return_code(vcpu, ret);
+	tdx->vp_enter_args.r11 = gpa;
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	switch (tdvmcall_leaf(vcpu)) {
+	case TDVMCALL_MAP_GPA:
+		return tdx_map_gpa(vcpu);
 	default:
 		break;
 	}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 0e3522e423cc..45c1d064b6b7 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -57,6 +57,9 @@ struct vcpu_tdx {
 	u64 vp_enter_ret;
 
 	enum vcpu_tdx_state state;
+
+	u64 map_gpa_next;
+	u64 map_gpa_end;
 };
 
 void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 6/8] KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError>
  2025-02-11  2:54 [PATCH v2 0/8] KVM: TDX: TDX hypercalls may exit to userspace Binbin Wu
                   ` (4 preceding siblings ...)
  2025-02-11  2:54 ` [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA> Binbin Wu
@ 2025-02-11  2:54 ` Binbin Wu
  2025-02-12  0:18   ` Sean Christopherson
  2025-02-11  2:54 ` [PATCH v2 7/8] KVM: TDX: Handle TDX PV port I/O hypercall Binbin Wu
  2025-02-11  2:54 ` [PATCH v2 8/8] KVM: TDX: Handle TDX PV MMIO hypercall Binbin Wu
  7 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-11  2:54 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

Convert TDG.VP.VMCALL<ReportFatalError> to KVM_EXIT_SYSTEM_EVENT with
a new type KVM_SYSTEM_EVENT_TDX_FATAL and forward it to userspace for
handling.

TD guest can use TDG.VP.VMCALL<ReportFatalError> to report the fatal
error it has experienced.  This hypercall is special because TD guest
is requesting a termination with the error information, KVM needs to
forward the hypercall to userspace anyway, KVM doesn't do sanity checks
and let userspace decide what to do.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
Hypercalls exit to userspace v2:
- Use vp_enter_args instead of x86 registers.
- vcpu->run->system_event.ndata is not hardcoded to 10. (Xiaoyao)
- Undefine COPY_REG after use. (Yilun)
- Updated the document about KVM_SYSTEM_EVENT_TDX_FATAL. (Chao)

Hypercalls exit to userspace v1:
- New added.
  Implement one of the hypercalls need to exit to userspace for handling after
  reverting "KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL", which tries to resolve
  Sean's comment.
  https://lore.kernel.org/kvm/Zg18ul8Q4PGQMWam@google.com/
- Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)
---
 Documentation/virt/kvm/api.rst |  9 +++++++
 arch/x86/kvm/vmx/tdx.c         | 45 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  1 +
 3 files changed, 55 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 2ba70c1fad51..5e415b312ab0 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6823,6 +6823,7 @@ should put the acknowledged interrupt vector into the 'epr' field.
   #define KVM_SYSTEM_EVENT_WAKEUP         4
   #define KVM_SYSTEM_EVENT_SUSPEND        5
   #define KVM_SYSTEM_EVENT_SEV_TERM       6
+  #define KVM_SYSTEM_EVENT_TDX_FATAL      7
 			__u32 type;
                         __u32 ndata;
                         __u64 data[16];
@@ -6849,6 +6850,14 @@ Valid values for 'type' are:
    reset/shutdown of the VM.
  - KVM_SYSTEM_EVENT_SEV_TERM -- an AMD SEV guest requested termination.
    The guest physical address of the guest's GHCB is stored in `data[0]`.
+ - KVM_SYSTEM_EVENT_TDX_FATAL -- a TDX guest reported a fatal error state.
+   The error code reported by the TDX guest is stored in `data[0]`, the error
+   code format is defined in TDX GHCI specification.
+   If the bit 63 of `data[0]` is set, it indicates there is TD specified
+   additional information provided in a page, which is shared memory. The
+   guest physical address of the information page is stored in `data[1]`.
+   An optional error message is provided by `data[2]` ~ `data[9]`, which is
+   byte sequence, LSB filled first. Typically, ASCII code(0x20-0x7e) is filled.
  - KVM_SYSTEM_EVENT_WAKEUP -- the exiting vCPU is in a suspended state and
    KVM has recognized a wakeup event. Userspace may honor this event by
    marking the exiting vCPU as runnable, or deny it and call KVM_RUN again.
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8b51b4c937e9..85956768c515 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1092,11 +1092,56 @@ static int tdx_map_gpa(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	u64 reg_mask = tdx->vp_enter_args.rcx;
+	u64 *opt_regs;
+
+	/*
+	 * Skip sanity checks and let userspace decide what to do if sanity
+	 * checks fail.
+	 */
+	vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
+	vcpu->run->system_event.type = KVM_SYSTEM_EVENT_TDX_FATAL;
+	/* Error codes. */
+	vcpu->run->system_event.data[0] = tdx->vp_enter_args.r12;
+	/* GPA of additional information page. */
+	vcpu->run->system_event.data[1] = tdx->vp_enter_args.r13;
+	/* Information passed via registers (up to 64 bytes). */
+	opt_regs = &vcpu->run->system_event.data[2];
+
+#define COPY_REG(REG, MASK)						\
+	do {								\
+		if (reg_mask & MASK) {					\
+			*opt_regs = tdx->vp_enter_args.REG;		\
+			opt_regs++;					\
+		}							\
+	} while (0)
+
+	/* The order is defined in GHCI. */
+	COPY_REG(r14, BIT_ULL(14));
+	COPY_REG(r15, BIT_ULL(15));
+	COPY_REG(rbx, BIT_ULL(3));
+	COPY_REG(rdi, BIT_ULL(7));
+	COPY_REG(rsi, BIT_ULL(6));
+	COPY_REG(r8, BIT_ULL(8));
+	COPY_REG(r9, BIT_ULL(9));
+	COPY_REG(rdx, BIT_ULL(2));
+#undef COPY_REG
+
+	vcpu->run->system_event.ndata = opt_regs - vcpu->run->system_event.data;
+
+	return 0;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	switch (tdvmcall_leaf(vcpu)) {
 	case TDVMCALL_MAP_GPA:
 		return tdx_map_gpa(vcpu);
+	case TDVMCALL_REPORT_FATAL_ERROR:
+		return tdx_report_fatal_error(vcpu);
 	default:
 		break;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 45e6d8fca9b9..937400350317 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -375,6 +375,7 @@ struct kvm_run {
 #define KVM_SYSTEM_EVENT_WAKEUP         4
 #define KVM_SYSTEM_EVENT_SUSPEND        5
 #define KVM_SYSTEM_EVENT_SEV_TERM       6
+#define KVM_SYSTEM_EVENT_TDX_FATAL      7
 			__u32 type;
 			__u32 ndata;
 			union {
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 7/8] KVM: TDX: Handle TDX PV port I/O hypercall
  2025-02-11  2:54 [PATCH v2 0/8] KVM: TDX: TDX hypercalls may exit to userspace Binbin Wu
                   ` (5 preceding siblings ...)
  2025-02-11  2:54 ` [PATCH v2 6/8] KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError> Binbin Wu
@ 2025-02-11  2:54 ` Binbin Wu
  2025-02-11  2:54 ` [PATCH v2 8/8] KVM: TDX: Handle TDX PV MMIO hypercall Binbin Wu
  7 siblings, 0 replies; 39+ messages in thread
From: Binbin Wu @ 2025-02-11  2:54 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Isaku Yamahata <isaku.yamahata@intel.com>

Emulate port I/O requested by TDX guest via TDVMCALL with leaf
Instruction.IO (same value as EXIT_REASON_IO_INSTRUCTION) according to
TDX Guest Host Communication Interface (GHCI).

All port I/O instructions inside the TDX guest trigger the #VE exception.
On #VE triggered by I/O instructions, TDX guest can call TDVMCALL with
leaf Instruction.IO to request VMM to emulate I/O instructions.

Similar to normal port I/O emulation, try to handle the port I/O in kernel
first, if kernel can't support it, forward the request to userspace.

Note string I/O operations are not supported in TDX.  Guest should unroll
them before calling the TDVMCALL.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
Hypercalls exit to userspace v2:
- Morph PV port I/O hypercall to EXIT_REASON_IO_INSTRUCTION. (Sean)
- Use vp_enter_args instead of x86 registers.
- Check write is either 0 or 1. (Chao)
- Skip setting return code as TDVMCALL_STATUS_SUCCESS. (Sean)

Hypercalls exit to userspace v1:
- Renamed from "KVM: TDX: Handle TDX PV port io hypercall" to
  "KVM: TDX: Handle TDX PV port I/O hypercall".
- Update changelog.
- Add missing curly brackets.
- Move reset of pio.count to tdx_complete_pio_out() and remove the stale
  comment. (binbin)
- Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)
- Set status code to TDVMCALL_STATUS_SUCCESS when PIO is handled in kernel.
- Don't write to R11 when it is a write operation for output.

v18:
- Fix out case to set R10 and R11 correctly when user space handled port
  out.
---
 arch/x86/kvm/vmx/tdx.c | 60 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 85956768c515..f13da28dd4a2 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1135,6 +1135,64 @@ static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int tdx_complete_pio_out(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.pio.count = 0;
+	return 1;
+}
+
+static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
+{
+	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+	unsigned long val = 0;
+	int ret;
+
+	ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size,
+					 vcpu->arch.pio.port, &val, 1);
+
+	WARN_ON_ONCE(!ret);
+
+	tdvmcall_set_return_val(vcpu, val);
+
+	return 1;
+}
+
+static int tdx_emulate_io(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+	unsigned long val = 0;
+	unsigned int port;
+	u64 size, write;
+	int ret;
+
+	++vcpu->stat.io_exits;
+
+	size = tdx->vp_enter_args.r12;
+	write = tdx->vp_enter_args.r13;
+	port = tdx->vp_enter_args.r14;
+
+	if ((write != 0 && write != 1) || (size != 1 && size != 2 && size != 4)) {
+		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
+		return 1;
+	}
+
+	if (write) {
+		val = tdx->vp_enter_args.r15;
+		ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
+	} else {
+		ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
+	}
+
+	if (!ret)
+		vcpu->arch.complete_userspace_io = write ? tdx_complete_pio_out :
+							   tdx_complete_pio_in;
+	else if (!write)
+		tdvmcall_set_return_val(vcpu, val);
+
+	return ret;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	switch (tdvmcall_leaf(vcpu)) {
@@ -1486,6 +1544,8 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 		return handle_tdvmcall(vcpu);
 	case EXIT_REASON_VMCALL:
 		return tdx_emulate_vmcall(vcpu);
+	case EXIT_REASON_IO_INSTRUCTION:
+		return tdx_emulate_io(vcpu);
 	default:
 		break;
 	}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 8/8] KVM: TDX: Handle TDX PV MMIO hypercall
  2025-02-11  2:54 [PATCH v2 0/8] KVM: TDX: TDX hypercalls may exit to userspace Binbin Wu
                   ` (6 preceding siblings ...)
  2025-02-11  2:54 ` [PATCH v2 7/8] KVM: TDX: Handle TDX PV port I/O hypercall Binbin Wu
@ 2025-02-11  2:54 ` Binbin Wu
  2025-02-12  2:28   ` Chao Gao
  7 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-11  2:54 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
	linux-kernel, binbin.wu

From: Sean Christopherson <sean.j.christopherson@intel.com>

Handle TDX PV MMIO hypercall when TDX guest calls TDVMCALL with the
leaf #VE.RequestMMIO (same value as EXIT_REASON_EPT_VIOLATION) according
to TDX Guest Host Communication Interface (GHCI) spec.

For TDX guests, VMM is not allowed to access vCPU registers and the private
memory, and the code instructions must be fetched from the private memory.
So MMIO emulation implemented for non-TDX VMs is not possible for TDX
guests.

In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.  The #VE handling is supposed to emulate the MMIO
instruction inside the guest and convert it into a TDVMCALL with the
leaf #VE.RequestMMIO, which equals to EXIT_REASON_EPT_VIOLATION.

The requested MMIO address must be in shared GPA space.  The shared bit
is stripped after check because the existing code for MMIO emulation is
not aware of the shared bit.

The MMIO GPA shouldn't have a valid memslot, also the attribute of the GPA
should be shared.  KVM could do the checks before exiting to userspace,
however, even if KVM does the check, there still will be race conditions
between the check in KVM and the emulation of MMIO access in userspace due
to a memslot hotplug, or a memory attribute conversion.  If userspace
doesn't check the attribute of the GPA and the attribute happens to be
private, it will not pose a security risk or cause an MCE, but it can lead
to another issue. E.g., in QEMU, treating a GPA with private attribute as
shared when it falls within RAM's range can result in extra memory
consumption during the emulation to the access to the HVA of the GPA.
There are two options: 1) Do the check both in KVM and userspace.  2) Do
the check only in QEMU.  This patch chooses option 2, i.e. KVM omits the
memslot and attribute checks, and expects userspace to do the checks.

Similar to normal MMIO emulation, try to handle the MMIO in kernel first,
if kernel can't support it, forward the request to userspace.  Export
needed symbols used for MMIO handling.

Fragments handling is not needed for TDX PV MMIO because GPA is provided,
if a MMIO access crosses page boundary, it should be continuous in GPA.
Also, the size is limited to 1, 2, 4, 8 bytes.  No further split needed.
Allow cross page access because no extra handling needed after checking
both start and end GPA are shared GPAs.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
Hypercalls exit to userspace v2:
- Morph PV MMIO hypercall to EXIT_REASON_EPT_MISCONFIG. (Sean)
- Skip setting return code as TDVMCALL_STATUS_SUCCESS, as a result, the
  complete_userspace_io() callback for write is not needed. (Sean)
- Remove the code for reading/writing APIC mmio,since TDX guest supports x2APIC only.
- Use vp_enter_args directly instead of helpers.

Hypercalls exit to userspace v1:
- Update the changelog.
- Remove the check of memslot for GPA.
- Allow MMIO access crossing page boundary.
- Move the tracepoint for KVM_TRACE_MMIO_WRITE earlier so the tracepoint handles
  the cases both for kernel and userspace. (Isaku)
- Set TDVMCALL return code when back from userspace, which is missing in v19.
- Move fast MMIO write into tdx_mmio_write()
- Check GPA is shared GPA. (Binbin)
- Remove extra check for size > 8u. (Binbin)
- Removed KVM_BUG_ON() in tdx_complete_mmio() and tdx_emulate_mmio()
- Removed vcpu->mmio_needed code since it's not used after removing
  KVM_BUG_ON().
- Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)
- Use vt_is_tdx_private_gpa()
---
 arch/x86/kvm/vmx/tdx.c | 109 ++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c     |   1 +
 virt/kvm/kvm_main.c    |   1 +
 3 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f13da28dd4a2..8f3147c6e602 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -849,8 +849,12 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
 		if (tdvmcall_exit_type(vcpu))
 			return EXIT_REASON_VMCALL;
 
-		if (tdvmcall_leaf(vcpu) < 0x10000)
+		if (tdvmcall_leaf(vcpu) < 0x10000) {
+			if (tdvmcall_leaf(vcpu) == EXIT_REASON_EPT_VIOLATION)
+				return EXIT_REASON_EPT_MISCONFIG;
+
 			return tdvmcall_leaf(vcpu);
+		}
 		break;
 	default:
 		break;
@@ -1193,6 +1197,107 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
 	return ret;
 }
 
+static int tdx_complete_mmio_read(struct kvm_vcpu *vcpu)
+{
+	unsigned long val = 0;
+	gpa_t gpa;
+	int size;
+
+	gpa = vcpu->mmio_fragments[0].gpa;
+	size = vcpu->mmio_fragments[0].len;
+
+	memcpy(&val, vcpu->run->mmio.data, size);
+	tdvmcall_set_return_val(vcpu, val);
+	trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+	return 1;
+}
+
+static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
+				 unsigned long val)
+{
+	if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+		trace_kvm_fast_mmio(gpa);
+		return 0;
+	}
+
+	trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
+	if (kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
+static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
+{
+	unsigned long val;
+
+	if (kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+		return -EOPNOTSUPP;
+
+	tdvmcall_set_return_val(vcpu, val);
+	trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+	return 0;
+}
+
+static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int size, write, r;
+	unsigned long val;
+	gpa_t gpa;
+
+	size = tdx->vp_enter_args.r12;
+	write = tdx->vp_enter_args.r13;
+	gpa = tdx->vp_enter_args.r14;
+	val = write ? tdx->vp_enter_args.r15 : 0;
+
+	if (size != 1 && size != 2 && size != 4 && size != 8)
+		goto error;
+	if (write != 0 && write != 1)
+		goto error;
+
+	/*
+	 * TDG.VP.VMCALL<MMIO> allows only shared GPA, it makes no sense to
+	 * do MMIO emulation for private GPA.
+	 */
+	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa) ||
+	    vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1))
+		goto error;
+
+	gpa = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
+
+	if (write)
+		r = tdx_mmio_write(vcpu, gpa, size, val);
+	else
+		r = tdx_mmio_read(vcpu, gpa, size);
+	if (!r)
+		/* Kernel completed device emulation. */
+		return 1;
+
+	/* Request the device emulation to userspace device model. */
+	vcpu->mmio_is_write = write;
+	if (!write)
+		vcpu->arch.complete_userspace_io = tdx_complete_mmio_read;
+
+	vcpu->run->mmio.phys_addr = gpa;
+	vcpu->run->mmio.len = size;
+	vcpu->run->mmio.is_write = write;
+	vcpu->run->exit_reason = KVM_EXIT_MMIO;
+
+	if (write) {
+		memcpy(vcpu->run->mmio.data, &val, size);
+	} else {
+		vcpu->mmio_fragments[0].gpa = gpa;
+		vcpu->mmio_fragments[0].len = size;
+		trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
+	}
+	return 0;
+
+error:
+	tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	switch (tdvmcall_leaf(vcpu)) {
@@ -1546,6 +1651,8 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 		return tdx_emulate_vmcall(vcpu);
 	case EXIT_REASON_IO_INSTRUCTION:
 		return tdx_emulate_io(vcpu);
+	case EXIT_REASON_EPT_MISCONFIG:
+		return tdx_emulate_mmio(vcpu);
 	default:
 		break;
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 29f33f7c9da9..a41d57ba4a86 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14010,6 +14010,7 @@ EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 836e0c69f53b..783683d04939 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5835,6 +5835,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
 	r = __kvm_io_bus_read(vcpu, bus, &range, val);
 	return r < 0 ? r : 0;
 }
+EXPORT_SYMBOL_GPL(kvm_io_bus_read);
 
 int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
 			    int len, struct kvm_io_device *dev)
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs
  2025-02-11  2:54 ` [PATCH v2 1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs Binbin Wu
@ 2025-02-11  5:05   ` Huang, Kai
  2025-02-11 10:23   ` Xiaoyao Li
  1 sibling, 0 replies; 39+ messages in thread
From: Huang, Kai @ 2025-02-11  5:05 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	binbin.wu@linux.intel.com
  Cc: Gao, Chao, Edgecombe, Rick P, Li, Xiaoyao, Lindgren, Tony,
	Hunter, Adrian, Chatre, Reinette, linux-kernel@vger.kernel.org,
	Zhao, Yan Y, Yamahata, Isaku

On Tue, 2025-02-11 at 10:54 +0800, Binbin Wu wrote:
> Have ____kvm_emulate_hypercall() read the GPRs instead of passing them
> in via the macro.
> 
> When emulating KVM hypercalls via TDVMCALL, TDX will marshall registers of
> TDVMCALL ABI into KVM's x86 registers to match the definition of KVM
> hypercall ABI _before_ ____kvm_emulate_hypercall() gets called.  Therefore,
> ____kvm_emulate_hypercall() can just read registers internally based on KVM
> hypercall ABI, and those registers can be removed from the
> __kvm_emulate_hypercall() macro.
> 
> Also, op_64_bit can be determined inside ____kvm_emulate_hypercall(),
> remove it from the __kvm_emulate_hypercall() macro as well.
> 
> No functional change intended.
> 
> 

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-11  2:54 ` [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA> Binbin Wu
@ 2025-02-11  6:54   ` Yan Zhao
  2025-02-11  8:11     ` Binbin Wu
  0 siblings, 1 reply; 39+ messages in thread
From: Yan Zhao @ 2025-02-11  6:54 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	chao.gao, linux-kernel

On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote:
> +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	if (vcpu->run->hypercall.ret) {
> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
> +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
> +		return 1;
> +	}
> +
> +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
> +	if (tdx->map_gpa_next >= tdx->map_gpa_end)
> +		return 1;
> +
> +	/*
> +	 * Stop processing the remaining part if there is pending interrupt.
> +	 * Skip checking pending virtual interrupt (reflected by
> +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
> +	 * if guest disabled interrupt, it's OK not returning back to guest
> +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
> +	 * immediately after STI or MOV/POP SS.
> +	 */
> +	if (pi_has_pending_interrupt(vcpu) ||
> +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
Should here also use "kvm_vcpu_has_events()" to replace
"pi_has_pending_interrupt(vcpu) ||
 kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean
suggested at [1]?

[1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com

> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
> +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
> +		return 1;
> +	}
> +
> +	__tdx_map_gpa(tdx);
> +	return 0;
> +}
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-11  6:54   ` Yan Zhao
@ 2025-02-11  8:11     ` Binbin Wu
  2025-02-11  8:59       ` Chao Gao
  0 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-11  8:11 UTC (permalink / raw)
  To: Yan Zhao, seanjc
  Cc: pbonzini, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	chao.gao, linux-kernel



On 2/11/2025 2:54 PM, Yan Zhao wrote:
> On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote:
>> +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +
>> +	if (vcpu->run->hypercall.ret) {
>> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
>> +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
>> +		return 1;
>> +	}
>> +
>> +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
>> +	if (tdx->map_gpa_next >= tdx->map_gpa_end)
>> +		return 1;
>> +
>> +	/*
>> +	 * Stop processing the remaining part if there is pending interrupt.
>> +	 * Skip checking pending virtual interrupt (reflected by
>> +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
>> +	 * if guest disabled interrupt, it's OK not returning back to guest
>> +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
>> +	 * immediately after STI or MOV/POP SS.
>> +	 */
>> +	if (pi_has_pending_interrupt(vcpu) ||
>> +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
> Should here also use "kvm_vcpu_has_events()" to replace
> "pi_has_pending_interrupt(vcpu) ||
>   kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean
> suggested at [1]?
>
> [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com

For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt
via a SEAM call.  As noted in the comments, the check for pending virtual
interrupt is intentionally skipped to save the SEAM call. Additionally,
unnecessarily returning back to guest will has performance impact.

But according to the discussion thread above, it seems that Sean prioritized
code readability (i.e. reuse the common helper to make TDX code less special)
over performance considerations?

>
>> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
>> +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
>> +		return 1;
>> +	}
>> +
>> +	__tdx_map_gpa(tdx);
>> +	return 0;
>> +}
>   


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/8] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2025-02-11  2:54 ` [PATCH v2 3/8] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) Binbin Wu
@ 2025-02-11  8:41   ` Chao Gao
  2025-02-11  9:08     ` Binbin Wu
  2025-02-11 23:46     ` Sean Christopherson
  0 siblings, 2 replies; 39+ messages in thread
From: Chao Gao @ 2025-02-11  8:41 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	yan.y.zhao, linux-kernel

>+static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
>+{
>+	return to_tdx(vcpu)->vp_enter_args.r10;
>+}

please add a newline here.

>+static __always_inline unsigned long tdvmcall_leaf(struct kvm_vcpu *vcpu)
>+{
>+	return to_tdx(vcpu)->vp_enter_args.r11;
>+}

..

>+static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
>+						     long val)
>+{
>+	to_tdx(vcpu)->vp_enter_args.r10 = val;
>+}

ditto.

>+static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
>+						    unsigned long val)
>+{
>+	to_tdx(vcpu)->vp_enter_args.r11 = val;
>+}
>+
> static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> {
> 	tdx_guest_keyid_free(kvm_tdx->hkid);
>@@ -810,6 +829,7 @@ static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
> static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
> {
> 	struct vcpu_tdx *tdx = to_tdx(vcpu);
>+	u32 exit_reason;
> 
> 	switch (tdx->vp_enter_ret & TDX_SEAMCALL_STATUS_MASK) {
> 	case TDX_SUCCESS:
>@@ -822,7 +842,21 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
> 		return -1u;
> 	}
> 
>-	return tdx->vp_enter_ret;
>+	exit_reason = tdx->vp_enter_ret;
>+
>+	switch (exit_reason) {
>+	case EXIT_REASON_TDCALL:
>+		if (tdvmcall_exit_type(vcpu))
>+			return EXIT_REASON_VMCALL;
>+
>+		if (tdvmcall_leaf(vcpu) < 0x10000)

Can you add a comment for the hard-coded 0x10000?

I am wondering what would happen if the guest tries to make a tdvmcall with
leaf=0 or leaf=1 to mislead KVM into calling the NMI/interrupt handling
routine. Would it trigger the unknown NMI warning or effectively inject an
interrupt into the host?

I think we should do the conversion for leafs that are defined in the current
GHCI spec.

>+			return tdvmcall_leaf(vcpu);
>+		break;
>+	default:
>+		break;
>+	}
>+
>+	return exit_reason;
> }

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-11  8:11     ` Binbin Wu
@ 2025-02-11  8:59       ` Chao Gao
  2025-02-12  0:46         ` Sean Christopherson
  0 siblings, 1 reply; 39+ messages in thread
From: Chao Gao @ 2025-02-11  8:59 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Yan Zhao, seanjc, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	isaku.yamahata, linux-kernel

On Tue, Feb 11, 2025 at 04:11:19PM +0800, Binbin Wu wrote:
>
>
>On 2/11/2025 2:54 PM, Yan Zhao wrote:
>> On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote:
>> > +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
>> > +{
>> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> > +
>> > +	if (vcpu->run->hypercall.ret) {
>> > +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
>> > +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
>> > +		return 1;
>> > +	}
>> > +
>> > +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
>> > +	if (tdx->map_gpa_next >= tdx->map_gpa_end)
>> > +		return 1;
>> > +
>> > +	/*
>> > +	 * Stop processing the remaining part if there is pending interrupt.
>> > +	 * Skip checking pending virtual interrupt (reflected by
>> > +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
>> > +	 * if guest disabled interrupt, it's OK not returning back to guest
>> > +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
>> > +	 * immediately after STI or MOV/POP SS.
>> > +	 */
>> > +	if (pi_has_pending_interrupt(vcpu) ||
>> > +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
>> Should here also use "kvm_vcpu_has_events()" to replace
>> "pi_has_pending_interrupt(vcpu) ||
>>   kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean
>> suggested at [1]?
>> 
>> [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com
>
>For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt
>via a SEAM call.  As noted in the comments, the check for pending virtual
>interrupt is intentionally skipped to save the SEAM call. Additionally,
>unnecessarily returning back to guest will has performance impact.
>
>But according to the discussion thread above, it seems that Sean prioritized
>code readability (i.e. reuse the common helper to make TDX code less special)
>over performance considerations?

To mitigate the performance impact, we can cache the "pending interrupt" status
on the first read, similar to how guest RSP/RBP are cached to avoid VMREADs for
normal VMs. This optimization can be done in a separate patch or series.

And, future TDX modules will report the status via registers.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/8] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2025-02-11  8:41   ` Chao Gao
@ 2025-02-11  9:08     ` Binbin Wu
  2025-02-11 23:46     ` Sean Christopherson
  1 sibling, 0 replies; 39+ messages in thread
From: Binbin Wu @ 2025-02-11  9:08 UTC (permalink / raw)
  To: Chao Gao
  Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	yan.y.zhao, linux-kernel



On 2/11/2025 4:41 PM, Chao Gao wrote:
>> +static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
>> +{
>> +	return to_tdx(vcpu)->vp_enter_args.r10;
>> +}
> please add a newline here.
>
>> +static __always_inline unsigned long tdvmcall_leaf(struct kvm_vcpu *vcpu)
>> +{
>> +	return to_tdx(vcpu)->vp_enter_args.r11;
>> +}
> ..
>
>> +static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
>> +						     long val)
>> +{
>> +	to_tdx(vcpu)->vp_enter_args.r10 = val;
>> +}
> ditto.
>
>> +static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
>> +						    unsigned long val)
>> +{
>> +	to_tdx(vcpu)->vp_enter_args.r11 = val;
>> +}
>> +
>> static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
>> {
>> 	tdx_guest_keyid_free(kvm_tdx->hkid);
>> @@ -810,6 +829,7 @@ static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
>> static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
>> {
>> 	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +	u32 exit_reason;
>>
>> 	switch (tdx->vp_enter_ret & TDX_SEAMCALL_STATUS_MASK) {
>> 	case TDX_SUCCESS:
>> @@ -822,7 +842,21 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
>> 		return -1u;
>> 	}
>>
>> -	return tdx->vp_enter_ret;
>> +	exit_reason = tdx->vp_enter_ret;
>> +
>> +	switch (exit_reason) {
>> +	case EXIT_REASON_TDCALL:
>> +		if (tdvmcall_exit_type(vcpu))
>> +			return EXIT_REASON_VMCALL;
>> +
>> +		if (tdvmcall_leaf(vcpu) < 0x10000)
> Can you add a comment for the hard-coded 0x10000?
>
> I am wondering what would happen if the guest tries to make a tdvmcall with
> leaf=0 or leaf=1 to mislead KVM into calling the NMI/interrupt handling
> routine. Would it trigger the unknown NMI warning or effectively inject an
> interrupt into the host?
Oh, yes, it's possible.

>
> I think we should do the conversion for leafs that are defined in the current
> GHCI spec.
Yes, it should be limited to the supported leaves defined in the GHCI.
Thanks for pointing it out!

>
>> +			return tdvmcall_leaf(vcpu);
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +
>> +	return exit_reason;
>> }


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs
  2025-02-11  2:54 ` [PATCH v2 1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs Binbin Wu
  2025-02-11  5:05   ` Huang, Kai
@ 2025-02-11 10:23   ` Xiaoyao Li
  2025-02-12  1:32     ` Binbin Wu
  1 sibling, 1 reply; 39+ messages in thread
From: Xiaoyao Li @ 2025-02-11 10:23 UTC (permalink / raw)
  To: Binbin Wu, pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao, linux-kernel

On 2/11/2025 10:54 AM, Binbin Wu wrote:
> Have ____kvm_emulate_hypercall() read the GPRs instead of passing them
> in via the macro.
> 
> When emulating KVM hypercalls via TDVMCALL, TDX will marshall registers of
> TDVMCALL ABI into KVM's x86 registers to match the definition of KVM
> hypercall ABI _before_ ____kvm_emulate_hypercall() gets called.  Therefore,
> ____kvm_emulate_hypercall() can just read registers internally based on KVM
> hypercall ABI, and those registers can be removed from the
> __kvm_emulate_hypercall() macro.
> 
> Also, op_64_bit can be determined inside ____kvm_emulate_hypercall(),
> remove it from the __kvm_emulate_hypercall() macro as well.

After this patch, __kvm_emulate_hypercall() becomes superfluous.
we can just put the logic to call the "complete_hypercall" into 
____kvm_emulate_hypercall() and rename it to __kvm_emulate_hypercall()



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/8] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2025-02-11  8:41   ` Chao Gao
  2025-02-11  9:08     ` Binbin Wu
@ 2025-02-11 23:46     ` Sean Christopherson
  2025-02-12  2:21       ` Binbin Wu
  1 sibling, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2025-02-11 23:46 UTC (permalink / raw)
  To: Chao Gao
  Cc: Binbin Wu, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	isaku.yamahata, yan.y.zhao, linux-kernel

On Tue, Feb 11, 2025, Chao Gao wrote:
> >@@ -810,6 +829,7 @@ static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
> > static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
> > {
> > 	struct vcpu_tdx *tdx = to_tdx(vcpu);
> >+	u32 exit_reason;
> > 
> > 	switch (tdx->vp_enter_ret & TDX_SEAMCALL_STATUS_MASK) {
> > 	case TDX_SUCCESS:
> >@@ -822,7 +842,21 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
> > 		return -1u;
> > 	}
> > 
> >-	return tdx->vp_enter_ret;
> >+	exit_reason = tdx->vp_enter_ret;
> >+
> >+	switch (exit_reason) {
> >+	case EXIT_REASON_TDCALL:
> >+		if (tdvmcall_exit_type(vcpu))
> >+			return EXIT_REASON_VMCALL;
> >+
> >+		if (tdvmcall_leaf(vcpu) < 0x10000)
> 
> Can you add a comment for the hard-coded 0x10000?

Or better yet, a #define of some kind (with a comment ;-) ).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/8] KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL
  2025-02-11  2:54 ` [PATCH v2 4/8] KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL Binbin Wu
@ 2025-02-11 23:48   ` Sean Christopherson
  0 siblings, 0 replies; 39+ messages in thread
From: Sean Christopherson @ 2025-02-11 23:48 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	yan.y.zhao, chao.gao, linux-kernel

On Tue, Feb 11, 2025, Binbin Wu wrote:
> ---
>  arch/x86/kvm/vmx/tdx.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 420ee492e919..daa49f2ee2b3 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -964,6 +964,23 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>  	return tdx_exit_handlers_fastpath(vcpu);
>  }
>  
> +static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
> +{
> +	tdvmcall_set_return_code(vcpu, vcpu->run->hypercall.ret);
> +	return 1;
> +}
> +
> +static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
> +{
> +	kvm_rax_write(vcpu, to_tdx(vcpu)->vp_enter_args.r10);
> +	kvm_rbx_write(vcpu, to_tdx(vcpu)->vp_enter_args.r11);
> +	kvm_rcx_write(vcpu, to_tdx(vcpu)->vp_enter_args.r12);
> +	kvm_rdx_write(vcpu, to_tdx(vcpu)->vp_enter_args.r13);
> +	kvm_rsi_write(vcpu, to_tdx(vcpu)->vp_enter_args.r14);
> +
> +	return __kvm_emulate_hypercall(vcpu, 0, complete_hypercall_exit);

Thanks for persevering through all the ideas and churn, I like how this turned
out!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/8] KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError>
  2025-02-11  2:54 ` [PATCH v2 6/8] KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError> Binbin Wu
@ 2025-02-12  0:18   ` Sean Christopherson
  2025-02-12  5:37     ` Binbin Wu
  0 siblings, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2025-02-12  0:18 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	yan.y.zhao, chao.gao, linux-kernel

On Tue, Feb 11, 2025, Binbin Wu wrote:
> +static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	u64 reg_mask = tdx->vp_enter_args.rcx;
> +	u64 *opt_regs;
> +
> +	/*
> +	 * Skip sanity checks and let userspace decide what to do if sanity
> +	 * checks fail.
> +	 */
> +	vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
> +	vcpu->run->system_event.type = KVM_SYSTEM_EVENT_TDX_FATAL;
> +	/* Error codes. */
> +	vcpu->run->system_event.data[0] = tdx->vp_enter_args.r12;
> +	/* GPA of additional information page. */
> +	vcpu->run->system_event.data[1] = tdx->vp_enter_args.r13;
> +	/* Information passed via registers (up to 64 bytes). */
> +	opt_regs = &vcpu->run->system_event.data[2];
> +
> +#define COPY_REG(REG, MASK)						\
> +	do {								\
> +		if (reg_mask & MASK) {					\

Based on past experience with conditionally filling kvm_run fields, I think KVM
should copy all registers and let userspace sort out the reg_mask.  Unless the
guest passes an ASCII byte stream exactly as the GHCI suggests, the information
is quite useless because userspace doesn't have reg_mask and so can't know what's
in data[4], data[5], etc...  And I won't be the least bit surprised if guests
deviate from the GHCI.

> +			*opt_regs = tdx->vp_enter_args.REG;		\
> +			opt_regs++;					\
> +		}							\
> +	} while (0)
> +
> +	/* The order is defined in GHCI. */

Assuming I haven't missed something, to hell with the GCHI, just dump *all*
registers, sorted by their index (ascending).  Including RAX (TDCALL), RBP, and
RSP.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-11  8:59       ` Chao Gao
@ 2025-02-12  0:46         ` Sean Christopherson
  2025-02-12  5:16           ` Binbin Wu
  0 siblings, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2025-02-12  0:46 UTC (permalink / raw)
  To: Chao Gao
  Cc: Binbin Wu, Yan Zhao, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	isaku.yamahata, linux-kernel

On Tue, Feb 11, 2025, Chao Gao wrote:
> On Tue, Feb 11, 2025 at 04:11:19PM +0800, Binbin Wu wrote:
> >
> >
> >On 2/11/2025 2:54 PM, Yan Zhao wrote:
> >> On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote:
> >> > +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
> >> > +{
> >> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> >> > +
> >> > +	if (vcpu->run->hypercall.ret) {
> >> > +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
> >> > +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
> >> > +		return 1;
> >> > +	}
> >> > +
> >> > +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
> >> > +	if (tdx->map_gpa_next >= tdx->map_gpa_end)
> >> > +		return 1;
> >> > +
> >> > +	/*
> >> > +	 * Stop processing the remaining part if there is pending interrupt.
> >> > +	 * Skip checking pending virtual interrupt (reflected by
> >> > +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
> >> > +	 * if guest disabled interrupt, it's OK not returning back to guest
> >> > +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
> >> > +	 * immediately after STI or MOV/POP SS.
> >> > +	 */
> >> > +	if (pi_has_pending_interrupt(vcpu) ||
> >> > +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
> >> Should here also use "kvm_vcpu_has_events()" to replace
> >> "pi_has_pending_interrupt(vcpu) ||
> >>   kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean
> >> suggested at [1]?
> >> 
> >> [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com
> >
> >For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt
> >via a SEAM call.  As noted in the comments, the check for pending virtual
> >interrupt is intentionally skipped to save the SEAM call. Additionally,

Drat, I had a whole response typed up and then discovered the implementation of
tdx_protected_apic_has_interrupt() had changed.  But I think the basic gist
still holds.

The new version:

 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
-       return pi_has_pending_interrupt(vcpu);
+       u64 vcpu_state_details;
+
+       if (pi_has_pending_interrupt(vcpu))
+               return true;
+
+       vcpu_state_details =
+               td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
+
+       return tdx_vcpu_state_details_intr_pending(vcpu_state_details);
 }

is much better than the old:

 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
-       return pi_has_pending_interrupt(vcpu);
+       bool ret = pi_has_pending_interrupt(vcpu);
+       union tdx_vcpu_state_details details;
+       struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+       if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
+               return true;
+
+       if (tdx->interrupt_disabled_hlt)
+               return false;
+
+       details.full = td_state_non_arch_read64(tdx, TD_VCPU_STATE_DETAILS_NON_ARCH);
+       return !!details.vmxip;
 }

because assuming the vCPU has an interrupt if it's not HALTED is all kinds of
wrong.

However, checking VMXIP for the !HLT case is also wrong.  And undesirable, as
evidenced by both this path and the EPT violation retry path wanted to avoid
checking VMXIP.

Except for the guest being stupid (non-HLT TDCALL in an interrupt shadow), having
an interrupt in RVI that is fully unmasked will be extremely rare.  Actually,
outside of an interrupt shadow, I don't think it's even possible.  I can't think
of any CPU flows that modify RVI in the middle of instruction execution.  I.e. if
RVI is non-zero, then either the interrupt has been pending since before the
TDVMCALL, or the TDVMCALL is in an STI/SS shadow.  And if the interrupt was
pending before TDVMCALL, then it _must_ be blocked, otherwise the interrupt
would have been serviced at the instruction boundary.

I am completely comfortable saying that KVM doesn't care about STI/SS shadows
outside of the HALTED case, and so unless I'm missing something, I think it makes
sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
case, because it's impossible to know if the interrupt is actually unmasked, and
statistically it's far, far more likely that it _is_ masked.

> >unnecessarily returning back to guest will has performance impact.
> >
> >But according to the discussion thread above, it seems that Sean prioritized
> >code readability (i.e. reuse the common helper to make TDX code less special)
> >over performance considerations?
> 
> To mitigate the performance impact, we can cache the "pending interrupt" status
> on the first read, similar to how guest RSP/RBP are cached to avoid VMREADs for
> normal VMs. This optimization can be done in a separate patch or series.
> 
> And, future TDX modules will report the status via registers.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs
  2025-02-11 10:23   ` Xiaoyao Li
@ 2025-02-12  1:32     ` Binbin Wu
  2025-02-12  3:12       ` Xiaoyao Li
  0 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-12  1:32 UTC (permalink / raw)
  To: Xiaoyao Li, pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao, linux-kernel



On 2/11/2025 6:23 PM, Xiaoyao Li wrote:
> On 2/11/2025 10:54 AM, Binbin Wu wrote:
>> Have ____kvm_emulate_hypercall() read the GPRs instead of passing them
>> in via the macro.
>>
>> When emulating KVM hypercalls via TDVMCALL, TDX will marshall registers of
>> TDVMCALL ABI into KVM's x86 registers to match the definition of KVM
>> hypercall ABI _before_ ____kvm_emulate_hypercall() gets called. Therefore,
>> ____kvm_emulate_hypercall() can just read registers internally based on KVM
>> hypercall ABI, and those registers can be removed from the
>> __kvm_emulate_hypercall() macro.
>>
>> Also, op_64_bit can be determined inside ____kvm_emulate_hypercall(),
>> remove it from the __kvm_emulate_hypercall() macro as well.
>
> After this patch, __kvm_emulate_hypercall() becomes superfluous.
> we can just put the logic to call the "complete_hypercall" into ____kvm_emulate_hypercall() and rename it to __kvm_emulate_hypercall()
>
>
According to the commit message of
"KVM: x86: Refactor __kvm_emulate_hypercall() into a macro":
"Rework __kvm_emulate_hypercall() into a macro so that completion of
hypercalls that don't exit to userspace use direct function calls to the
completion helper, i.e. don't trigger a retpoline when RETPOLINE=y."

So I kept the macro.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/8] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2025-02-11 23:46     ` Sean Christopherson
@ 2025-02-12  2:21       ` Binbin Wu
  0 siblings, 0 replies; 39+ messages in thread
From: Binbin Wu @ 2025-02-12  2:21 UTC (permalink / raw)
  To: Sean Christopherson, Chao Gao
  Cc: pbonzini, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	yan.y.zhao, linux-kernel



On 2/12/2025 7:46 AM, Sean Christopherson wrote:
> On Tue, Feb 11, 2025, Chao Gao wrote:
>>> @@ -810,6 +829,7 @@ static bool tdx_guest_state_is_invalid(struct kvm_vcpu *vcpu)
>>> static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
>>> {
>>> 	struct vcpu_tdx *tdx = to_tdx(vcpu);
>>> +	u32 exit_reason;
>>>
>>> 	switch (tdx->vp_enter_ret & TDX_SEAMCALL_STATUS_MASK) {
>>> 	case TDX_SUCCESS:
>>> @@ -822,7 +842,21 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
>>> 		return -1u;
>>> 	}
>>>
>>> -	return tdx->vp_enter_ret;
>>> +	exit_reason = tdx->vp_enter_ret;
>>> +
>>> +	switch (exit_reason) {
>>> +	case EXIT_REASON_TDCALL:
>>> +		if (tdvmcall_exit_type(vcpu))
>>> +			return EXIT_REASON_VMCALL;
>>> +
>>> +		if (tdvmcall_leaf(vcpu) < 0x10000)
>> Can you add a comment for the hard-coded 0x10000?
> Or better yet, a #define of some kind (with a comment ;-) ).

As Chao pointed out, we should convert the leaves defined in the GHCI spec
and supported in KVM only.  Specific leaf numbers will be used instead of
comparing to 0x10000.

I plan to change it to:

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2b24f50ad0ee..af8276402212 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -920,11 +920,17 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
                 if (tdvmcall_exit_type(vcpu))
                         return EXIT_REASON_VMCALL;

-               if (tdvmcall_leaf(vcpu) < 0x10000) {
-                       if (tdvmcall_leaf(vcpu) == EXIT_REASON_EPT_VIOLATION)
+               switch(tdvmcall_leaf(vcpu)) {
+                       case EXIT_REASON_EPT_VIOLATION:
                                 return EXIT_REASON_EPT_MISCONFIG;
-
-                       return tdvmcall_leaf(vcpu);
+                       case EXIT_REASON_CPUID:
+                       case EXIT_REASON_HLT:
+                       case EXIT_REASON_IO_INSTRUCTION:
+                       case EXIT_REASON_MSR_READ:
+                       case EXIT_REASON_MSR_WRITE:
+                               return tdvmcall_leaf(vcpu);
+                       default:
+                               break;
                 }
                 break;
         case EXIT_REASON_EPT_MISCONFIG:





^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 8/8] KVM: TDX: Handle TDX PV MMIO hypercall
  2025-02-11  2:54 ` [PATCH v2 8/8] KVM: TDX: Handle TDX PV MMIO hypercall Binbin Wu
@ 2025-02-12  2:28   ` Chao Gao
  2025-02-12  2:39     ` Binbin Wu
  0 siblings, 1 reply; 39+ messages in thread
From: Chao Gao @ 2025-02-12  2:28 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	yan.y.zhao, linux-kernel

>diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>index f13da28dd4a2..8f3147c6e602 100644
>--- a/arch/x86/kvm/vmx/tdx.c
>+++ b/arch/x86/kvm/vmx/tdx.c
>@@ -849,8 +849,12 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
> 		if (tdvmcall_exit_type(vcpu))
> 			return EXIT_REASON_VMCALL;
> 
>-		if (tdvmcall_leaf(vcpu) < 0x10000)
>+		if (tdvmcall_leaf(vcpu) < 0x10000) {
>+			if (tdvmcall_leaf(vcpu) == EXIT_REASON_EPT_VIOLATION)
>+				return EXIT_REASON_EPT_MISCONFIG;

IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to distinguish
between a genuine EPT MISCONFIG and a morphed one, and handle them differently?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 8/8] KVM: TDX: Handle TDX PV MMIO hypercall
  2025-02-12  2:28   ` Chao Gao
@ 2025-02-12  2:39     ` Binbin Wu
  2025-02-13 21:41       ` Edgecombe, Rick P
  0 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-12  2:39 UTC (permalink / raw)
  To: Chao Gao
  Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	yan.y.zhao, linux-kernel



On 2/12/2025 10:28 AM, Chao Gao wrote:
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index f13da28dd4a2..8f3147c6e602 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -849,8 +849,12 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
>> 		if (tdvmcall_exit_type(vcpu))
>> 			return EXIT_REASON_VMCALL;
>>
>> -		if (tdvmcall_leaf(vcpu) < 0x10000)
>> +		if (tdvmcall_leaf(vcpu) < 0x10000) {
>> +			if (tdvmcall_leaf(vcpu) == EXIT_REASON_EPT_VIOLATION)
>> +				return EXIT_REASON_EPT_MISCONFIG;
> IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to distinguish
> between a genuine EPT MISCONFIG and a morphed one, and handle them differently?
It will be handled separately, which will be in the last section of the KVM
basic support.  But the v2 of "the rest" section is on hold because there is
a discussion related to MTRR MSR handling:
https://lore.kernel.org/all/20250201005048.657470-1-seanjc@google.com/
Want to send the v2 of "the rest" section after the MTRR discussion is
finalized.

For the genuine EPT misconfig handling, you can refer to the patch on the
full KVM branch:
https://github.com/intel/tdx/commit/e576682ac586f994bf54eb11b357f3e835d3c042




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs
  2025-02-12  1:32     ` Binbin Wu
@ 2025-02-12  3:12       ` Xiaoyao Li
  0 siblings, 0 replies; 39+ messages in thread
From: Xiaoyao Li @ 2025-02-12  3:12 UTC (permalink / raw)
  To: Binbin Wu, pbonzini, seanjc, kvm
  Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao, linux-kernel

On 2/12/2025 9:32 AM, Binbin Wu wrote:
> 
> 
> On 2/11/2025 6:23 PM, Xiaoyao Li wrote:
>> On 2/11/2025 10:54 AM, Binbin Wu wrote:
>>> Have ____kvm_emulate_hypercall() read the GPRs instead of passing them
>>> in via the macro.
>>>
>>> When emulating KVM hypercalls via TDVMCALL, TDX will marshall 
>>> registers of
>>> TDVMCALL ABI into KVM's x86 registers to match the definition of KVM
>>> hypercall ABI _before_ ____kvm_emulate_hypercall() gets called. 
>>> Therefore,
>>> ____kvm_emulate_hypercall() can just read registers internally based 
>>> on KVM
>>> hypercall ABI, and those registers can be removed from the
>>> __kvm_emulate_hypercall() macro.
>>>
>>> Also, op_64_bit can be determined inside ____kvm_emulate_hypercall(),
>>> remove it from the __kvm_emulate_hypercall() macro as well.
>>
>> After this patch, __kvm_emulate_hypercall() becomes superfluous.
>> we can just put the logic to call the "complete_hypercall" into 
>> ____kvm_emulate_hypercall() and rename it to __kvm_emulate_hypercall()
>>
>>
> According to the commit message of
> "KVM: x86: Refactor __kvm_emulate_hypercall() into a macro":
> "Rework __kvm_emulate_hypercall() into a macro so that completion of
> hypercalls that don't exit to userspace use direct function calls to the
> completion helper, i.e. don't trigger a retpoline when RETPOLINE=y."

I see.

I thought the purpose of introducing the macro was for TDX usage. My 
fault that didn't checking the commit message of that change.

It makes sense for retpoline reason.

> So I kept the macro.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-12  0:46         ` Sean Christopherson
@ 2025-02-12  5:16           ` Binbin Wu
  2025-02-12 18:56             ` Sean Christopherson
  0 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-12  5:16 UTC (permalink / raw)
  To: Sean Christopherson, Chao Gao
  Cc: Yan Zhao, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	isaku.yamahata, linux-kernel



On 2/12/2025 8:46 AM, Sean Christopherson wrote:
> On Tue, Feb 11, 2025, Chao Gao wrote:
>> On Tue, Feb 11, 2025 at 04:11:19PM +0800, Binbin Wu wrote:
>>>
>>> On 2/11/2025 2:54 PM, Yan Zhao wrote:
>>>> On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote:
>>>>> +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>>>>> +
>>>>> +	if (vcpu->run->hypercall.ret) {
>>>>> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
>>>>> +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
>>>>> +		return 1;
>>>>> +	}
>>>>> +
>>>>> +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
>>>>> +	if (tdx->map_gpa_next >= tdx->map_gpa_end)
>>>>> +		return 1;
>>>>> +
>>>>> +	/*
>>>>> +	 * Stop processing the remaining part if there is pending interrupt.
>>>>> +	 * Skip checking pending virtual interrupt (reflected by
>>>>> +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
>>>>> +	 * if guest disabled interrupt, it's OK not returning back to guest
>>>>> +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
>>>>> +	 * immediately after STI or MOV/POP SS.
>>>>> +	 */
>>>>> +	if (pi_has_pending_interrupt(vcpu) ||
>>>>> +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
>>>> Should here also use "kvm_vcpu_has_events()" to replace
>>>> "pi_has_pending_interrupt(vcpu) ||
>>>>    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean
>>>> suggested at [1]?
>>>>
>>>> [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com
>>> For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt
>>> via a SEAM call.  As noted in the comments, the check for pending virtual
>>> interrupt is intentionally skipped to save the SEAM call. Additionally,
> Drat, I had a whole response typed up and then discovered the implementation of
> tdx_protected_apic_has_interrupt() had changed.  But I think the basic gist
> still holds.
>
> The new version:
>
>   bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
>   {
> -       return pi_has_pending_interrupt(vcpu);
> +       u64 vcpu_state_details;
> +
> +       if (pi_has_pending_interrupt(vcpu))
> +               return true;
> +
> +       vcpu_state_details =
> +               td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
> +
> +       return tdx_vcpu_state_details_intr_pending(vcpu_state_details);
>   }
>
> is much better than the old:
>
>   bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
>   {
> -       return pi_has_pending_interrupt(vcpu);
> +       bool ret = pi_has_pending_interrupt(vcpu);
> +       union tdx_vcpu_state_details details;
> +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +       if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
> +               return true;
> +
> +       if (tdx->interrupt_disabled_hlt)
> +               return false;
> +
> +       details.full = td_state_non_arch_read64(tdx, TD_VCPU_STATE_DETAILS_NON_ARCH);
> +       return !!details.vmxip;
>   }
>
> because assuming the vCPU has an interrupt if it's not HALTED is all kinds of
> wrong.
>
> However, checking VMXIP for the !HLT case is also wrong.  And undesirable, as
> evidenced by both this path and the EPT violation retry path wanted to avoid
> checking VMXIP.
>
> Except for the guest being stupid (non-HLT TDCALL in an interrupt shadow), having
> an interrupt in RVI that is fully unmasked will be extremely rare.  Actually,
> outside of an interrupt shadow, I don't think it's even possible.  I can't think
> of any CPU flows that modify RVI in the middle of instruction execution.  I.e. if
> RVI is non-zero, then either the interrupt has been pending since before the
> TDVMCALL, or the TDVMCALL is in an STI/SS shadow.  And if the interrupt was
> pending before TDVMCALL, then it _must_ be blocked, otherwise the interrupt
> would have been serviced at the instruction boundary.

Agree.

>
> I am completely comfortable saying that KVM doesn't care about STI/SS shadows
> outside of the HALTED case, and so unless I'm missing something, I think it makes
> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
> case, because it's impossible to know if the interrupt is actually unmasked, and
> statistically it's far, far more likely that it _is_ masked.
OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
And use kvm_vcpu_has_events() to replace the open code in this patch.

Thanks!

>
>>> unnecessarily returning back to guest will has performance impact.
>>>
>>> But according to the discussion thread above, it seems that Sean prioritized
>>> code readability (i.e. reuse the common helper to make TDX code less special)
>>> over performance considerations?
>> To mitigate the performance impact, we can cache the "pending interrupt" status
>> on the first read, similar to how guest RSP/RBP are cached to avoid VMREADs for
>> normal VMs. This optimization can be done in a separate patch or series.
>>
>> And, future TDX modules will report the status via registers.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/8] KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError>
  2025-02-12  0:18   ` Sean Christopherson
@ 2025-02-12  5:37     ` Binbin Wu
  2025-02-12 13:53       ` Sean Christopherson
  0 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-12  5:37 UTC (permalink / raw)
  To: Sean Christopherson, xiaoyao.li
  Cc: pbonzini, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, tony.lindgren, isaku.yamahata, yan.y.zhao,
	chao.gao, linux-kernel



On 2/12/2025 8:18 AM, Sean Christopherson wrote:
> On Tue, Feb 11, 2025, Binbin Wu wrote:
>> +static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +	u64 reg_mask = tdx->vp_enter_args.rcx;
>> +	u64 *opt_regs;
>> +
>> +	/*
>> +	 * Skip sanity checks and let userspace decide what to do if sanity
>> +	 * checks fail.
>> +	 */
>> +	vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
>> +	vcpu->run->system_event.type = KVM_SYSTEM_EVENT_TDX_FATAL;
>> +	/* Error codes. */
>> +	vcpu->run->system_event.data[0] = tdx->vp_enter_args.r12;
>> +	/* GPA of additional information page. */
>> +	vcpu->run->system_event.data[1] = tdx->vp_enter_args.r13;
>> +	/* Information passed via registers (up to 64 bytes). */
>> +	opt_regs = &vcpu->run->system_event.data[2];
>> +
>> +#define COPY_REG(REG, MASK)						\
>> +	do {								\
>> +		if (reg_mask & MASK) {					\
> Based on past experience with conditionally filling kvm_run fields, I think KVM
> should copy all registers and let userspace sort out the reg_mask.  Unless the
> guest passes an ASCII byte stream exactly as the GHCI suggests,
Yea, GHCI doesn't enforce it to be ASCII byte stream.

> the information
> is quite useless because userspace doesn't have reg_mask and so can't know what's
> in data[4], data[5], etc...  And I won't be the least bit surprised if guests
> deviate from the GHCI.

But it also confuses userspace if guests uses special protocol to pass
information other than ASCII byte stream.

Anyway, dumping all registers to userspace and let userspace to have all
the information passed from guest for parsing is definitely workable.

>
>> +			*opt_regs = tdx->vp_enter_args.REG;		\
>> +			opt_regs++;					\
>> +		}							\
>> +	} while (0)
>> +
>> +	/* The order is defined in GHCI. */
> Assuming I haven't missed something, to hell with the GCHI, just dump *all*
> registers, sorted by their index (ascending).  Including RAX (TDCALL), RBP, and
> RSP.
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/8] KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError>
  2025-02-12  5:37     ` Binbin Wu
@ 2025-02-12 13:53       ` Sean Christopherson
  0 siblings, 0 replies; 39+ messages in thread
From: Sean Christopherson @ 2025-02-12 13:53 UTC (permalink / raw)
  To: Binbin Wu
  Cc: xiaoyao.li, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, tony.lindgren, isaku.yamahata,
	yan.y.zhao, chao.gao, linux-kernel

On Wed, Feb 12, 2025, Binbin Wu wrote:
> On 2/12/2025 8:18 AM, Sean Christopherson wrote:
> > On Tue, Feb 11, 2025, Binbin Wu wrote:
> > the information is quite useless because userspace doesn't have reg_mask
> > and so can't know what's in data[4], data[5], etc...  And I won't be the
> > least bit surprised if guests deviate from the GHCI.
> 
> But it also confuses userspace if guests uses special protocol to pass
> information other than ASCII byte stream.

Yes, but only if userspace and the guest aren't in cahoots.  There are use cases
where the VMM and the VM are managed/owned by the same entity.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-12  5:16           ` Binbin Wu
@ 2025-02-12 18:56             ` Sean Christopherson
  2025-02-13  3:23               ` Binbin Wu
  0 siblings, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2025-02-12 18:56 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Chao Gao, Yan Zhao, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	isaku.yamahata, linux-kernel

On Wed, Feb 12, 2025, Binbin Wu wrote:
> On 2/12/2025 8:46 AM, Sean Christopherson wrote:
> > I am completely comfortable saying that KVM doesn't care about STI/SS shadows
> > outside of the HALTED case, and so unless I'm missing something, I think it makes
> > sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
> > case, because it's impossible to know if the interrupt is actually unmasked, and
> > statistically it's far, far more likely that it _is_ masked.
> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
> And use kvm_vcpu_has_events() to replace the open code in this patch.

Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
If the guest initiates a spurious wakeup, pv_unhalted could be left set in
perpetuity.

I _think_ this would work and is generally desirable?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e77e61d4fbd..435ca2782c3c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11114,9 +11114,6 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
            kvm_apic_init_sipi_allowed(vcpu))
                return true;
 
-       if (vcpu->arch.pv.pv_unhalted)
-               return true;
-
        if (kvm_is_exception_pending(vcpu))
                return true;
 
@@ -11157,7 +11154,8 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
-       return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
+       return kvm_vcpu_running(vcpu) || vcpu->arch.pv.pv_unhalted ||
+              kvm_vcpu_has_events(vcpu);
 }
 
 /* Called within kvm->srcu read side.  */
@@ -11293,7 +11291,7 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
         */
        ++vcpu->stat.halt_exits;
        if (lapic_in_kernel(vcpu)) {
-               if (kvm_vcpu_has_events(vcpu))
+               if (kvm_vcpu_has_events(vcpu) || vcpu->arch.pv.pv_unhalted)
                        vcpu->arch.pv.pv_unhalted = false;
                else
                        vcpu->arch.mp_state = state;


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-12 18:56             ` Sean Christopherson
@ 2025-02-13  3:23               ` Binbin Wu
  2025-02-13  5:11                 ` Binbin Wu
  0 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-13  3:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, Yan Zhao, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	isaku.yamahata, linux-kernel



On 2/13/2025 2:56 AM, Sean Christopherson wrote:
> On Wed, Feb 12, 2025, Binbin Wu wrote:
>> On 2/12/2025 8:46 AM, Sean Christopherson wrote:
>>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows
>>> outside of the HALTED case, and so unless I'm missing something, I think it makes
>>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
>>> case, because it's impossible to know if the interrupt is actually unmasked, and
>>> statistically it's far, far more likely that it _is_ masked.
>> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
>> And use kvm_vcpu_has_events() to replace the open code in this patch.
> Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
> is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
> If the guest initiates a spurious wakeup, pv_unhalted could be left set in
> perpetuity.

Oh, yes.
KVM_HC_KICK_CPU is allowed in TDX guests.

The change below looks good to me.

One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
left set, then later when the guest want to halt the vcpu, in
__kvm_emulate_halt(), since pv_unhalted is still set and the state will not
transit to KVM_MP_STATE_HALTED.
But I guess it's guests' responsibility to not initiate spurious wakeup,
guests need to bear the fact that HLT could fail due to a previous
spurious wakeup?

>
> I _think_ this would work and is generally desirable?
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8e77e61d4fbd..435ca2782c3c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11114,9 +11114,6 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>              kvm_apic_init_sipi_allowed(vcpu))
>                  return true;
>   
> -       if (vcpu->arch.pv.pv_unhalted)
> -               return true;
> -
>          if (kvm_is_exception_pending(vcpu))
>                  return true;
>   
> @@ -11157,7 +11154,8 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>   
>   int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>   {
> -       return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
> +       return kvm_vcpu_running(vcpu) || vcpu->arch.pv.pv_unhalted ||
> +              kvm_vcpu_has_events(vcpu);
>   }
>   
>   /* Called within kvm->srcu read side.  */
> @@ -11293,7 +11291,7 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
>           */
>          ++vcpu->stat.halt_exits;
>          if (lapic_in_kernel(vcpu)) {
> -               if (kvm_vcpu_has_events(vcpu))
> +               if (kvm_vcpu_has_events(vcpu) || vcpu->arch.pv.pv_unhalted)
>                          vcpu->arch.pv.pv_unhalted = false;
>                  else
>                          vcpu->arch.mp_state = state;
>
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-13  3:23               ` Binbin Wu
@ 2025-02-13  5:11                 ` Binbin Wu
  2025-02-13 15:17                   ` Sean Christopherson
  0 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-13  5:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, Yan Zhao, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	isaku.yamahata, linux-kernel



On 2/13/2025 11:23 AM, Binbin Wu wrote:
>
>
> On 2/13/2025 2:56 AM, Sean Christopherson wrote:
>> On Wed, Feb 12, 2025, Binbin Wu wrote:
>>> On 2/12/2025 8:46 AM, Sean Christopherson wrote:
>>>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows
>>>> outside of the HALTED case, and so unless I'm missing something, I think it makes
>>>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
>>>> case, because it's impossible to know if the interrupt is actually unmasked, and
>>>> statistically it's far, far more likely that it _is_ masked.
>>> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
>>> And use kvm_vcpu_has_events() to replace the open code in this patch.
>> Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
>> is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
>> If the guest initiates a spurious wakeup, pv_unhalted could be left set in
>> perpetuity.
>
> Oh, yes.
> KVM_HC_KICK_CPU is allowed in TDX guests.
>
> The change below looks good to me.
>
> One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
> left set, then later when the guest want to halt the vcpu, in
> __kvm_emulate_halt(), since pv_unhalted is still set and the state will not
> transit to KVM_MP_STATE_HALTED.
> But I guess it's guests' responsibility to not initiate spurious wakeup,
> guests need to bear the fact that HLT could fail due to a previous
> spurious wakeup?

Just found a patch set for fixing the issue.
https://lore.kernel.org/kvm/20250113200150.487409-1-jmattson@google.com/

>
>>
>> I _think_ this would work and is generally desirable?
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 8e77e61d4fbd..435ca2782c3c 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -11114,9 +11114,6 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>>              kvm_apic_init_sipi_allowed(vcpu))
>>                  return true;
>>   -       if (vcpu->arch.pv.pv_unhalted)
>> -               return true;
>> -
>>          if (kvm_is_exception_pending(vcpu))
>>                  return true;
>>   @@ -11157,7 +11154,8 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>>     int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>   {
>> -       return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
>> +       return kvm_vcpu_running(vcpu) || vcpu->arch.pv.pv_unhalted ||
>> +              kvm_vcpu_has_events(vcpu);
>>   }
>>     /* Called within kvm->srcu read side.  */
>> @@ -11293,7 +11291,7 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
>>           */
>>          ++vcpu->stat.halt_exits;
>>          if (lapic_in_kernel(vcpu)) {
>> -               if (kvm_vcpu_has_events(vcpu))
>> +               if (kvm_vcpu_has_events(vcpu) || vcpu->arch.pv.pv_unhalted)
>>                          vcpu->arch.pv.pv_unhalted = false;
>>                  else
>>                          vcpu->arch.mp_state = state;
>>
>>
>
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-13  5:11                 ` Binbin Wu
@ 2025-02-13 15:17                   ` Sean Christopherson
  2025-02-17  3:41                     ` Binbin Wu
  0 siblings, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2025-02-13 15:17 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Chao Gao, Yan Zhao, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	isaku.yamahata, linux-kernel

On Thu, Feb 13, 2025, Binbin Wu wrote:
> On 2/13/2025 11:23 AM, Binbin Wu wrote:
> > On 2/13/2025 2:56 AM, Sean Christopherson wrote:
> > > On Wed, Feb 12, 2025, Binbin Wu wrote:
> > > > On 2/12/2025 8:46 AM, Sean Christopherson wrote:
> > > > > I am completely comfortable saying that KVM doesn't care about STI/SS shadows
> > > > > outside of the HALTED case, and so unless I'm missing something, I think it makes
> > > > > sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
> > > > > case, because it's impossible to know if the interrupt is actually unmasked, and
> > > > > statistically it's far, far more likely that it _is_ masked.
> > > > OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
> > > > And use kvm_vcpu_has_events() to replace the open code in this patch.
> > > Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
> > > is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
> > > If the guest initiates a spurious wakeup, pv_unhalted could be left set in
> > > perpetuity.
> > 
> > Oh, yes.
> > KVM_HC_KICK_CPU is allowed in TDX guests.

And a clever guest can send a REMRD IPI.

> > The change below looks good to me.
> > 
> > One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
> > left set, then later when the guest want to halt the vcpu, in
> > __kvm_emulate_halt(), since pv_unhalted is still set and the state will not
> > transit to KVM_MP_STATE_HALTED.
> > But I guess it's guests' responsibility to not initiate spurious wakeup,
> > guests need to bear the fact that HLT could fail due to a previous
> > spurious wakeup?
> 
> Just found a patch set for fixing the issue.

FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures
pv_unhalted is cleared when transitioning to RUNNING.  If the vCPU is already
RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it
until the next transition to RUNNING (which implies at least an attempted
transition away from RUNNING).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 8/8] KVM: TDX: Handle TDX PV MMIO hypercall
  2025-02-12  2:39     ` Binbin Wu
@ 2025-02-13 21:41       ` Edgecombe, Rick P
  2025-02-14  0:47         ` Binbin Wu
  0 siblings, 1 reply; 39+ messages in thread
From: Edgecombe, Rick P @ 2025-02-13 21:41 UTC (permalink / raw)
  To: binbin.wu@linux.intel.com, Gao, Chao
  Cc: seanjc@google.com, Huang, Kai, Chatre, Reinette, Li, Xiaoyao,
	Hunter, Adrian, Lindgren, Tony, kvm@vger.kernel.org,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, Zhao, Yan Y

On Wed, 2025-02-12 at 10:39 +0800, Binbin Wu wrote:
> > IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to
> > distinguish
> > between a genuine EPT MISCONFIG and a morphed one, and handle them
> > differently?
> It will be handled separately, which will be in the last section of the KVM
> basic support.  But the v2 of "the rest" section is on hold because there is
> a discussion related to MTRR MSR handling:
> https://lore.kernel.org/all/20250201005048.657470-1-seanjc@google.com/
> Want to send the v2 of "the rest" section after the MTRR discussion is
> finalized.

I think we can just put back the original MTRR code (post KVM MTRR removal
version) for the next posting of the rest. The reason being Sean was pointing
that it is more architecturally correct given that the CPUID bit is exposed. So
we will need that regardless of the guest solution.

But it would probably would be good to update this before re-posting:
https://lore.kernel.org/kvm/20241210004946.3718496-19-binbin.wu@linux.intel.com/#t

Given the last one got hardly any comments and the mostly recent patches are
already in kvm-coco-queue, I say we try to review that version a bit more. This
is different then previously discussed. Any objections?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 8/8] KVM: TDX: Handle TDX PV MMIO hypercall
  2025-02-13 21:41       ` Edgecombe, Rick P
@ 2025-02-14  0:47         ` Binbin Wu
  2025-02-14  1:01           ` Edgecombe, Rick P
  0 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-14  0:47 UTC (permalink / raw)
  To: Edgecombe, Rick P, Gao, Chao
  Cc: seanjc@google.com, Huang, Kai, Chatre, Reinette, Li, Xiaoyao,
	Hunter, Adrian, Lindgren, Tony, kvm@vger.kernel.org,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, Zhao, Yan Y

On 2/14/2025 5:41 AM, Edgecombe, Rick P wrote:
> On Wed, 2025-02-12 at 10:39 +0800, Binbin Wu wrote:
>>> IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to
>>> distinguish
>>> between a genuine EPT MISCONFIG and a morphed one, and handle them
>>> differently?
>> It will be handled separately, which will be in the last section of the KVM
>> basic support.  But the v2 of "the rest" section is on hold because there is
>> a discussion related to MTRR MSR handling:
>> https://lore.kernel.org/all/20250201005048.657470-1-seanjc@google.com/
>> Want to send the v2 of "the rest" section after the MTRR discussion is
>> finalized.
> I think we can just put back the original MTRR code (post KVM MTRR removal
> version) for the next posting of the rest. The reason being Sean was pointing
> that it is more architecturally correct given that the CPUID bit is exposed. So
> we will need that regardless of the guest solution.
The original MTRR code before removing is:
https://lore.kernel.org/kvm/81119d66392bc9446340a16f8a532c7e1b2665a2.1708933498.git.isaku.yamahata@intel.com/

It enforces WB as default memtype and disables fixed/variable range MTRRs.
That means this solution doesn't allow guest to use MTRRs as a communication
channel if the guest firmware wants to program some ranges to UC for legacy
devices.

How about to allow TDX guests to access MTRR MSRs as what KVM does for
normal VMs?

Guest kernels may use MTRRs as a crutch to get the desired memtype for devices.
E.g., in most KVM-based setups, legacy devices such as the HPET and TPM are
enumerated via ACPI.  And in Linux kernel, for unknown reasons, ACPI auto-maps
such devices as WB, whereas the dedicated device drivers map memory as WC or
UC.  The ACPI mappings rely on firmware to configure PCI hole (and other device
memory) to be UC in the MTRRs to end up UC-, which is compatible with the
drivers' requested WC/UC-.

So KVM needs to allow guests to program the desired value in MTRRs in case
guests want to use MTRRs as a communication channel between guest firmware
and the kernel.

Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e.,
KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR memory
types.  One open is whether enforce the value of default MTRR memtype as WB.

However, TDX disallows toggling CR0.CD.  If a TDX guest wants to use MTRRs
as the communication channel, it should skip toggling CR0.CD when it
programs MTRRs both in guest firmware and guest kernel.  For a guest, there
is no reason to disable caches because it's in a virtual environment.  It
makes sense for guest firmware/kernel to skip toggling CR0.CD when it
detects it's running as a TDX guest.

>
> But it would probably would be good to update this before re-posting:
> https://lore.kernel.org/kvm/20241210004946.3718496-19-binbin.wu@linux.intel.com/#t
>
> Given the last one got hardly any comments and the mostly recent patches are
> already in kvm-coco-queue, I say we try to review that version a bit more. This
> is different then previously discussed. Any objections?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 8/8] KVM: TDX: Handle TDX PV MMIO hypercall
  2025-02-14  0:47         ` Binbin Wu
@ 2025-02-14  1:01           ` Edgecombe, Rick P
  2025-02-14  1:20             ` Binbin Wu
  0 siblings, 1 reply; 39+ messages in thread
From: Edgecombe, Rick P @ 2025-02-14  1:01 UTC (permalink / raw)
  To: binbin.wu@linux.intel.com, Gao, Chao
  Cc: seanjc@google.com, Huang, Kai, Li, Xiaoyao, Lindgren, Tony,
	Hunter, Adrian, Chatre, Reinette, pbonzini@redhat.com,
	kvm@vger.kernel.org, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, Zhao, Yan Y

On Fri, 2025-02-14 at 08:47 +0800, Binbin Wu wrote:
> 
> On 2/14/2025 5:41 AM, Edgecombe, Rick P wrote:
> > On Wed, 2025-02-12 at 10:39 +0800, Binbin Wu wrote:
> > > > IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to
> > > > distinguish
> > > > between a genuine EPT MISCONFIG and a morphed one, and handle them
> > > > differently?
> > > It will be handled separately, which will be in the last section of the KVM
> > > basic support.  But the v2 of "the rest" section is on hold because there is
> > > a discussion related to MTRR MSR handling:
> > > https://lore.kernel.org/all/20250201005048.657470-1-seanjc@google.com/
> > > Want to send the v2 of "the rest" section after the MTRR discussion is
> > > finalized.
> > I think we can just put back the original MTRR code (post KVM MTRR removal
> > version) for the next posting of the rest. The reason being Sean was pointing
> > that it is more architecturally correct given that the CPUID bit is exposed. So
> > we will need that regardless of the guest solution.
> The original MTRR code before removing is:
> https://lore.kernel.org/kvm/81119d66392bc9446340a16f8a532c7e1b2665a2.1708933498.git.isaku.yamahata@intel.com/
> 
> It enforces WB as default memtype and disables fixed/variable range MTRRs.
> That means this solution doesn't allow guest to use MTRRs as a communication
> channel if the guest firmware wants to program some ranges to UC for legacy
> devices.

I'm talking about the internal version that existed after KVM removed MTRRs for
normal VMs. We are not talking about adding back KVM MTRRs.

> 
> 
> How about to allow TDX guests to access MTRR MSRs as what KVM does for
> normal VMs?
> 
> Guest kernels may use MTRRs as a crutch to get the desired memtype for devices.
> E.g., in most KVM-based setups, legacy devices such as the HPET and TPM are
> enumerated via ACPI.  And in Linux kernel, for unknown reasons, ACPI auto-maps
> such devices as WB, whereas the dedicated device drivers map memory as WC or
> UC.  The ACPI mappings rely on firmware to configure PCI hole (and other device
> memory) to be UC in the MTRRs to end up UC-, which is compatible with the
> drivers' requested WC/UC-.
> 
> So KVM needs to allow guests to program the desired value in MTRRs in case
> guests want to use MTRRs as a communication channel between guest firmware
> and the kernel.
> 
> Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e.,
> KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR memory
> types.  One open is whether enforce the value of default MTRR memtype as WB.

This is basically what we had previously (internally), right?

> 
> However, TDX disallows toggling CR0.CD.  If a TDX guest wants to use MTRRs
> as the communication channel, it should skip toggling CR0.CD when it
> programs MTRRs both in guest firmware and guest kernel.  For a guest, there
> is no reason to disable caches because it's in a virtual environment.  It
> makes sense for guest firmware/kernel to skip toggling CR0.CD when it
> detects it's running as a TDX guest.

I don't see why we have to tie exposing MTRR to a particular solution for the
guest and bios. Let's focus on the work we know we need regardless for KVM.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 8/8] KVM: TDX: Handle TDX PV MMIO hypercall
  2025-02-14  1:01           ` Edgecombe, Rick P
@ 2025-02-14  1:20             ` Binbin Wu
  0 siblings, 0 replies; 39+ messages in thread
From: Binbin Wu @ 2025-02-14  1:20 UTC (permalink / raw)
  To: Edgecombe, Rick P, Gao, Chao
  Cc: seanjc@google.com, Huang, Kai, Li, Xiaoyao, Lindgren, Tony,
	Hunter, Adrian, Chatre, Reinette, pbonzini@redhat.com,
	kvm@vger.kernel.org, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, Zhao, Yan Y



On 2/14/2025 9:01 AM, Edgecombe, Rick P wrote:
> On Fri, 2025-02-14 at 08:47 +0800, Binbin Wu wrote:
>> On 2/14/2025 5:41 AM, Edgecombe, Rick P wrote:
>>> On Wed, 2025-02-12 at 10:39 +0800, Binbin Wu wrote:
>>>>> IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to
>>>>> distinguish
>>>>> between a genuine EPT MISCONFIG and a morphed one, and handle them
>>>>> differently?
>>>> It will be handled separately, which will be in the last section of the KVM
>>>> basic support.  But the v2 of "the rest" section is on hold because there is
>>>> a discussion related to MTRR MSR handling:
>>>> https://lore.kernel.org/all/20250201005048.657470-1-seanjc@google.com/
>>>> Want to send the v2 of "the rest" section after the MTRR discussion is
>>>> finalized.
>>> I think we can just put back the original MTRR code (post KVM MTRR removal
>>> version) for the next posting of the rest. The reason being Sean was pointing
>>> that it is more architecturally correct given that the CPUID bit is exposed. So
>>> we will need that regardless of the guest solution.
>> The original MTRR code before removing is:
>> https://lore.kernel.org/kvm/81119d66392bc9446340a16f8a532c7e1b2665a2.1708933498.git.isaku.yamahata@intel.com/
>>
>> It enforces WB as default memtype and disables fixed/variable range MTRRs.
>> That means this solution doesn't allow guest to use MTRRs as a communication
>> channel if the guest firmware wants to program some ranges to UC for legacy
>> devices.
> I'm talking about the internal version that existed after KVM removed MTRRs for
> normal VMs. We are not talking about adding back KVM MTRRs.
Sorry, I misunderstood it.

>
>>
>> How about to allow TDX guests to access MTRR MSRs as what KVM does for
>> normal VMs?
>>
>> Guest kernels may use MTRRs as a crutch to get the desired memtype for devices.
>> E.g., in most KVM-based setups, legacy devices such as the HPET and TPM are
>> enumerated via ACPI.  And in Linux kernel, for unknown reasons, ACPI auto-maps
>> such devices as WB, whereas the dedicated device drivers map memory as WC or
>> UC.  The ACPI mappings rely on firmware to configure PCI hole (and other device
>> memory) to be UC in the MTRRs to end up UC-, which is compatible with the
>> drivers' requested WC/UC-.
>>
>> So KVM needs to allow guests to program the desired value in MTRRs in case
>> guests want to use MTRRs as a communication channel between guest firmware
>> and the kernel.
>>
>> Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e.,
>> KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR memory
>> types.  One open is whether enforce the value of default MTRR memtype as WB.
> This is basically what we had previously (internally), right?
Yes. Then we are aligned. :)

>
>> However, TDX disallows toggling CR0.CD.  If a TDX guest wants to use MTRRs
>> as the communication channel, it should skip toggling CR0.CD when it
>> programs MTRRs both in guest firmware and guest kernel.  For a guest, there
>> is no reason to disable caches because it's in a virtual environment.  It
>> makes sense for guest firmware/kernel to skip toggling CR0.CD when it
>> detects it's running as a TDX guest.
> I don't see why we have to tie exposing MTRR to a particular solution for the
> guest and bios. Let's focus on the work we know we need regardless for KVM.

Guest could choose to use MTRRs or other SW protocal to communicate the memtype
for devices. I just wanted to point it out that if guest chooses to use MTRRs
as the communicate channel, it will face the #VE issue caused by toggleing
CR0.CD.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-13 15:17                   ` Sean Christopherson
@ 2025-02-17  3:41                     ` Binbin Wu
  2025-02-19  0:29                       ` Sean Christopherson
  0 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2025-02-17  3:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, Yan Zhao, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	isaku.yamahata, linux-kernel



On 2/13/2025 11:17 PM, Sean Christopherson wrote:
> On Thu, Feb 13, 2025, Binbin Wu wrote:
>> On 2/13/2025 11:23 AM, Binbin Wu wrote:
>>> On 2/13/2025 2:56 AM, Sean Christopherson wrote:
>>>> On Wed, Feb 12, 2025, Binbin Wu wrote:
>>>>> On 2/12/2025 8:46 AM, Sean Christopherson wrote:
>>>>>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows
>>>>>> outside of the HALTED case, and so unless I'm missing something, I think it makes
>>>>>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
>>>>>> case, because it's impossible to know if the interrupt is actually unmasked, and
>>>>>> statistically it's far, far more likely that it _is_ masked.
>>>>> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
>>>>> And use kvm_vcpu_has_events() to replace the open code in this patch.
>>>> Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
>>>> is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
>>>> If the guest initiates a spurious wakeup, pv_unhalted could be left set in
>>>> perpetuity.
>>> Oh, yes.
>>> KVM_HC_KICK_CPU is allowed in TDX guests.
> And a clever guest can send a REMRD IPI.
>
>>> The change below looks good to me.
>>>
>>> One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
>>> left set, then later when the guest want to halt the vcpu, in
>>> __kvm_emulate_halt(), since pv_unhalted is still set and the state will not
>>> transit to KVM_MP_STATE_HALTED.
>>> But I guess it's guests' responsibility to not initiate spurious wakeup,
>>> guests need to bear the fact that HLT could fail due to a previous
>>> spurious wakeup?
>> Just found a patch set for fixing the issue.
> FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures
> pv_unhalted is cleared when transitioning to RUNNING.  If the vCPU is already
> RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it
> until the next transition to RUNNING (which implies at least an attempted
> transition away from RUNNING).
>
Indeed.

I am wondering why KVM doesn't clear pv_unhalted before the vcpu entering guest?
Is the additional memory access a concern or is there some other reason?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-17  3:41                     ` Binbin Wu
@ 2025-02-19  0:29                       ` Sean Christopherson
  2025-02-19  0:49                         ` Binbin Wu
  0 siblings, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2025-02-19  0:29 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Chao Gao, Yan Zhao, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	isaku.yamahata, linux-kernel

On Mon, Feb 17, 2025, Binbin Wu wrote:
> On 2/13/2025 11:17 PM, Sean Christopherson wrote:
> > On Thu, Feb 13, 2025, Binbin Wu wrote:
> > > On 2/13/2025 11:23 AM, Binbin Wu wrote:
> > > > On 2/13/2025 2:56 AM, Sean Christopherson wrote:
> > > > > On Wed, Feb 12, 2025, Binbin Wu wrote:
> > > > > > On 2/12/2025 8:46 AM, Sean Christopherson wrote:
> > > > > > > I am completely comfortable saying that KVM doesn't care about STI/SS shadows
> > > > > > > outside of the HALTED case, and so unless I'm missing something, I think it makes
> > > > > > > sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
> > > > > > > case, because it's impossible to know if the interrupt is actually unmasked, and
> > > > > > > statistically it's far, far more likely that it _is_ masked.
> > > > > > OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
> > > > > > And use kvm_vcpu_has_events() to replace the open code in this patch.
> > > > > Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
> > > > > is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
> > > > > If the guest initiates a spurious wakeup, pv_unhalted could be left set in
> > > > > perpetuity.
> > > > Oh, yes.
> > > > KVM_HC_KICK_CPU is allowed in TDX guests.
> > And a clever guest can send a REMRD IPI.
> > 
> > > > The change below looks good to me.
> > > > 
> > > > One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
> > > > left set, then later when the guest want to halt the vcpu, in
> > > > __kvm_emulate_halt(), since pv_unhalted is still set and the state will not
> > > > transit to KVM_MP_STATE_HALTED.
> > > > But I guess it's guests' responsibility to not initiate spurious wakeup,
> > > > guests need to bear the fact that HLT could fail due to a previous
> > > > spurious wakeup?
> > > Just found a patch set for fixing the issue.
> > FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures
> > pv_unhalted is cleared when transitioning to RUNNING.  If the vCPU is already
> > RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it
> > until the next transition to RUNNING (which implies at least an attempted
> > transition away from RUNNING).
> > 
> Indeed.
> 
> I am wondering why KVM doesn't clear pv_unhalted before the vcpu entering guest?
> Is the additional memory access a concern or is there some other reason?

Not clearing pv_unhalted when entering the guest is necessary to avoid races.
Stating the obvious, the guest must set up all of its lock tracking before executing
HLT, which means that the soon-to-be-blocking vCPU is eligible for wakeup *before*
it executes HLT.  If an asynchronous exit happens on the vCPU at just the right
time, KVM could clear pv_unhalted before the vCPU executes HLT.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
  2025-02-19  0:29                       ` Sean Christopherson
@ 2025-02-19  0:49                         ` Binbin Wu
  0 siblings, 0 replies; 39+ messages in thread
From: Binbin Wu @ 2025-02-19  0:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, Yan Zhao, pbonzini, kvm, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	isaku.yamahata, linux-kernel



On 2/19/2025 8:29 AM, Sean Christopherson wrote:
> On Mon, Feb 17, 2025, Binbin Wu wrote:
>> On 2/13/2025 11:17 PM, Sean Christopherson wrote:
>>> On Thu, Feb 13, 2025, Binbin Wu wrote:
>>>> On 2/13/2025 11:23 AM, Binbin Wu wrote:
>>>>> On 2/13/2025 2:56 AM, Sean Christopherson wrote:
>>>>>> On Wed, Feb 12, 2025, Binbin Wu wrote:
>>>>>>> On 2/12/2025 8:46 AM, Sean Christopherson wrote:
>>>>>>>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows
>>>>>>>> outside of the HALTED case, and so unless I'm missing something, I think it makes
>>>>>>>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
>>>>>>>> case, because it's impossible to know if the interrupt is actually unmasked, and
>>>>>>>> statistically it's far, far more likely that it _is_ masked.
>>>>>>> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
>>>>>>> And use kvm_vcpu_has_events() to replace the open code in this patch.
>>>>>> Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
>>>>>> is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
>>>>>> If the guest initiates a spurious wakeup, pv_unhalted could be left set in
>>>>>> perpetuity.
>>>>> Oh, yes.
>>>>> KVM_HC_KICK_CPU is allowed in TDX guests.
>>> And a clever guest can send a REMRD IPI.
>>>
>>>>> The change below looks good to me.
>>>>>
>>>>> One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
>>>>> left set, then later when the guest want to halt the vcpu, in
>>>>> __kvm_emulate_halt(), since pv_unhalted is still set and the state will not
>>>>> transit to KVM_MP_STATE_HALTED.
>>>>> But I guess it's guests' responsibility to not initiate spurious wakeup,
>>>>> guests need to bear the fact that HLT could fail due to a previous
>>>>> spurious wakeup?
>>>> Just found a patch set for fixing the issue.
>>> FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures
>>> pv_unhalted is cleared when transitioning to RUNNING.  If the vCPU is already
>>> RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it
>>> until the next transition to RUNNING (which implies at least an attempted
>>> transition away from RUNNING).
>>>
>> Indeed.
>>
>> I am wondering why KVM doesn't clear pv_unhalted before the vcpu entering guest?
>> Is the additional memory access a concern or is there some other reason?
> Not clearing pv_unhalted when entering the guest is necessary to avoid races.
> Stating the obvious, the guest must set up all of its lock tracking before executing
> HLT, which means that the soon-to-be-blocking vCPU is eligible for wakeup *before*
> it executes HLT.  If an asynchronous exit happens on the vCPU at just the right
> time, KVM could clear pv_unhalted before the vCPU executes HLT.
>
Got it.
Thanks for the explanation.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-02-19  0:49 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-11  2:54 [PATCH v2 0/8] KVM: TDX: TDX hypercalls may exit to userspace Binbin Wu
2025-02-11  2:54 ` [PATCH v2 1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs Binbin Wu
2025-02-11  5:05   ` Huang, Kai
2025-02-11 10:23   ` Xiaoyao Li
2025-02-12  1:32     ` Binbin Wu
2025-02-12  3:12       ` Xiaoyao Li
2025-02-11  2:54 ` [PATCH v2 2/8] KVM: TDX: Add a place holder to handle TDX VM exit Binbin Wu
2025-02-11  2:54 ` [PATCH v2 3/8] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) Binbin Wu
2025-02-11  8:41   ` Chao Gao
2025-02-11  9:08     ` Binbin Wu
2025-02-11 23:46     ` Sean Christopherson
2025-02-12  2:21       ` Binbin Wu
2025-02-11  2:54 ` [PATCH v2 4/8] KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL Binbin Wu
2025-02-11 23:48   ` Sean Christopherson
2025-02-11  2:54 ` [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA> Binbin Wu
2025-02-11  6:54   ` Yan Zhao
2025-02-11  8:11     ` Binbin Wu
2025-02-11  8:59       ` Chao Gao
2025-02-12  0:46         ` Sean Christopherson
2025-02-12  5:16           ` Binbin Wu
2025-02-12 18:56             ` Sean Christopherson
2025-02-13  3:23               ` Binbin Wu
2025-02-13  5:11                 ` Binbin Wu
2025-02-13 15:17                   ` Sean Christopherson
2025-02-17  3:41                     ` Binbin Wu
2025-02-19  0:29                       ` Sean Christopherson
2025-02-19  0:49                         ` Binbin Wu
2025-02-11  2:54 ` [PATCH v2 6/8] KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError> Binbin Wu
2025-02-12  0:18   ` Sean Christopherson
2025-02-12  5:37     ` Binbin Wu
2025-02-12 13:53       ` Sean Christopherson
2025-02-11  2:54 ` [PATCH v2 7/8] KVM: TDX: Handle TDX PV port I/O hypercall Binbin Wu
2025-02-11  2:54 ` [PATCH v2 8/8] KVM: TDX: Handle TDX PV MMIO hypercall Binbin Wu
2025-02-12  2:28   ` Chao Gao
2025-02-12  2:39     ` Binbin Wu
2025-02-13 21:41       ` Edgecombe, Rick P
2025-02-14  0:47         ` Binbin Wu
2025-02-14  1:01           ` Edgecombe, Rick P
2025-02-14  1:20             ` Binbin Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox