kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry
@ 2025-01-13  2:09 Yan Zhao
  2025-01-13  2:10 ` [PATCH 1/7] KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters TDX_OPERAND_BUSY Yan Zhao
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Yan Zhao @ 2025-01-13  2:09 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata, Yan Zhao

This series aims to provide a clean solution to avoid the blind retries in
the previous hack [1] in "TDX MMU Part 2," following the initial
discussions to [2], further discussions in the RFC, and the PUCK [3].

A full analysis of the lock status for each SEAMCALL relevant to KVM is
available at [4].

This series categorizes the SEPT-related SEAMCALLs (used for page
installation and uninstallation) into three groups:

Group 1: tdh_mem_page_add().
       - Invoked only during TD build time.
       - Proposal: Return -EBUSY on TDX_OPERAND_BUSY.
       - Patch 1.

Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
       - Invoked for TD runtime page installation.
       - Proposal: Retry locally in the TDX EPT violation handler for
         RET_PF_RETRY.
       - Patches 2-3.

Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
       - Invoked for page uninstallation, with KVM mmu_lock held for write.
       - Proposal: Kick off vCPUs and no vCPU entry on TDX_OPERAND_BUSY.
       - Patch 4.

Patches 5/6/7 are fixup patches:
Patch 5: Return -EBUSY instead of -EAGAIN when tdh_mem_sept_add() is busy.
Patch 6: Remove the retry loop for tdh_phymem_page_wbinvd_hkid().
Patch 7: Warn on force_immediate_exit in tdx_vcpu_run().

Code base: kvm-coco-queue 2f30b837bf7b.
Applies to the tail since the dependence on
commit 8e801e55ba8f ("KVM: TDX: Handle EPT violation/misconfig exit"),

Thanks
Yan

RFC --> v1:
- Split patch 1 in RFC into patches 1,2,3,5, and add new fixup patches 6/7.
- Add contention analysis of tdh_mem_page_add() in patch 1 log.
- Provide justification in patch 2 log and add checks for RET_PF_CONTINUE.
- Use "a per-VM flag wait_for_sept_zap + KVM_REQ_OUTSIDE_GUEST_MODE"
  instead of a arch-specific request to prevent vCPUs from TD entry in patch 4
  (Sean).

RFC: https://lore.kernel.org/all/20241121115139.26338-1-yan.y.zhao@intel.com
[1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com
[2] https://lore.kernel.org/kvm/20240904030751.117579-10-rick.p.edgecombe@intel.com/
[3] https://drive.google.com/drive/folders/1k0qOarKuZXpzRsKDtVeC5Lpl9-amJ6AJ?resourcekey=0-l9uVpVEBC34Uar1ReaqisQ
[4] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com

Yan Zhao (7):
  KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters
    TDX_OPERAND_BUSY
  KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
  KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
  KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU
    mirror page table
  fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU
    mirror page table
  fixup! KVM: TDX: Implement TDX vcpu enter/exit path

 arch/x86/kvm/mmu/mmu.c          |  10 ++-
 arch/x86/kvm/mmu/mmu_internal.h |  12 ++-
 arch/x86/kvm/vmx/tdx.c          | 135 ++++++++++++++++++++++++++------
 arch/x86/kvm/vmx/tdx.h          |   7 ++
 4 files changed, 134 insertions(+), 30 deletions(-)

-- 
2.43.2


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/7] KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters TDX_OPERAND_BUSY
  2025-01-13  2:09 [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Yan Zhao
@ 2025-01-13  2:10 ` Yan Zhao
  2025-01-14 22:24   ` Edgecombe, Rick P
  2025-01-13  2:11 ` [PATCH 2/7] KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault() Yan Zhao
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Yan Zhao @ 2025-01-13  2:10 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata

tdh_mem_page_add() is called during TD build time. Within the TDX module,
it acquires the exclusive lock on the TDR resource (eliminating the need to
hold locks for TDCS/SEPT tree) and the exclusive lock on the PAMT entry for
the page to be added. The TDX module returns TDX_OPERAND_BUSY if
tdh_mem_page_add() contends with other SEAMCALLs.

SEAMCALL                Lock Type        Resource
-----------------------------------------------------------------------
tdh_mem_page_add        EXCLUSIVE        TDR
                        NO_LOCK          TDCS
                        NO_LOCK          SEPT tree
                        EXCLUSIVE        PAMT entry for the page to add

Given (1)-(4) and the expected behavior from userspace (5), KVM doesn't
expect tdh_mem_page_add() to encounter TDX_OPERAND_BUSY:
(1) tdx_vcpu_create() only allows vCPU creation when the TD state is
    TD_STATE_INITIALIZED, so tdh_mem_page_add(), as invoked in vCPU ioctl,
    does not contend with
    tdh_mng_create()/tdh_mng_addcx()/tdh_mng_key_config()/tdh_mng_init().

(2) tdx_vcpu_ioctl() bails out on TD_STATE_RUNNABLE, so tdh_mem_page_add()
    does not contend with tdh_vp_enter()/tdh_mem_page_aug()/tdh_mem_track()
    and TDCALLs.

(3) By holding slots_lock and the filemap invalidate lock,
    tdh_mem_page_add() does not contend with tdh_mr_finalize(),
    tdh_mem_page_remove()/tdh_mem_range_block()/
    tdh_phymem_page_wbinvd_hkid() or another tdh_mem_page_add(),
    tdh_mem_sept_add()/tdh_mr_extend().

(4) By holding reference to kvm, tdh_mem_page_add() does not contend with
    tdh_mng_vpflushdone()/tdh_phymem_cache_wb()/tdh_mng_key_freeid()/
    tdh_phymem_page_wbinvd_tdr()/tdh_phymem_page_reclaim().

(5) A well-behaved userspace invokes ioctl KVM_TDX_INIT_MEM_REGION on one
    vCPU after initializing all vCPUs and does not invoke ioctls on the
    other vCPUs before the KVM_TDX_INIT_MEM_REGION completes. Thus,
    tdh_mem_page_add() does not contend with
    tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()/
    tdh_mng_rd()/tdh_vp_flush() on the other vCPUs.

However, if the userspace breaks (5), tdh_mem_page_add() could encounter
TDX_OPERAND_BUSY when trying to acquire the exclusive lock on the TDR
resource in the TDX module. In this case, simply return -EBUSY.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
tdx_vcpu_pre_run() will check TD_STATE_RUNNABLE for (2).
https://lore.kernel.org/kvm/3576c721-3ef2-40bd-8764-b50912df93a2@intel.com/
---
 arch/x86/kvm/vmx/tdx.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d0dc3200fa37..1cf3ef0faff7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3024,13 +3024,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	}
 
 	ret = 0;
-	do {
-		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
-				       pfn_to_hpa(page_to_pfn(page)),
-				       &entry, &level_state);
-	} while (err == TDX_ERROR_SEPT_BUSY);
+	err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
+			       pfn_to_hpa(page_to_pfn(page)),
+			       &entry, &level_state);
 	if (err) {
-		ret = -EIO;
+		ret = unlikely(err & TDX_OPERAND_BUSY) ? -EBUSY : -EIO;
 		goto out;
 	}
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/7] KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
  2025-01-13  2:09 [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Yan Zhao
  2025-01-13  2:10 ` [PATCH 1/7] KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters TDX_OPERAND_BUSY Yan Zhao
@ 2025-01-13  2:11 ` Yan Zhao
  2025-01-14 22:24   ` Edgecombe, Rick P
  2025-01-13  2:12 ` [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY Yan Zhao
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Yan Zhao @ 2025-01-13  2:11 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata

Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID)
instead of 1 in kvm_mmu_page_fault().

The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e.,
npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(),
kvm_handle_page_fault()). They either check if the return value is > 0 (as
in npf_interception()) or pass it further to vcpu_run() to decide whether
to break out of the kernel loop and return to the user when r <= 0.
Therefore, returning any positive value is equivalent to returning 1.

Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure
a positive return value.

This is a preparation to allow TDX's EPT violation handler to check the
RET_PF* value and retry internally for RET_PF_RETRY.

No functional changes are intended.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c          | 10 +++++++++-
 arch/x86/kvm/mmu/mmu_internal.h | 12 +++++++++---
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index eedc6ff37b89..53dcf600e934 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6120,8 +6120,16 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 	else if (r == RET_PF_SPURIOUS)
 		vcpu->stat.pf_spurious++;
 
+	/*
+	 * None of handle_mmio_page_fault(), kvm_mmu_do_page_fault(), or
+	 * kvm_mmu_write_protect_fault() return RET_PF_CONTINUE.
+	 * kvm_mmu_do_page_fault() only uses RET_PF_CONTINUE internally to
+	 * indicate continuing the page fault handling until to the final
+	 * page table mapping phase.
+	 */
+	WARN_ON_ONCE(r == RET_PF_CONTINUE);
 	if (r != RET_PF_EMULATE)
-		return 1;
+		return r;
 
 emulate:
 	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 957636660149..4fde91cade1b 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -315,9 +315,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * tracepoints via TRACE_DEFINE_ENUM() in mmutrace.h
  *
  * Note, all values must be greater than or equal to zero so as not to encroach
- * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
- * will allow for efficient machine code when checking for CONTINUE, e.g.
- * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
+ * on -errno return values.
  */
 enum {
 	RET_PF_CONTINUE = 0,
@@ -329,6 +327,14 @@ enum {
 	RET_PF_SPURIOUS,
 };
 
+/*
+ * Define RET_PF_CONTINUE as 0 to allow for
+ * - efficient machine code when checking for CONTINUE, e.g.
+ *   "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero,
+ * - kvm_mmu_do_page_fault() to return other RET_PF_* as a positive value.
+ */
+static_assert(RET_PF_CONTINUE == 0);
+
 static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 						     struct kvm_page_fault *fault)
 {
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
  2025-01-13  2:09 [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Yan Zhao
  2025-01-13  2:10 ` [PATCH 1/7] KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters TDX_OPERAND_BUSY Yan Zhao
  2025-01-13  2:11 ` [PATCH 2/7] KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault() Yan Zhao
@ 2025-01-13  2:12 ` Yan Zhao
  2025-01-17 21:14   ` Sean Christopherson
  2025-01-13  2:12 ` [PATCH 4/7] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal Yan Zhao
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Yan Zhao @ 2025-01-13  2:12 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata

Retry locally in the TDX EPT violation handler for private memory to reduce
the chances for tdh_mem_sept_add()/tdh_mem_page_aug() to contend with
tdh_vp_enter().

TDX EPT violation installs private pages via tdh_mem_sept_add() and
tdh_mem_page_aug(). The two may have contention with tdh_vp_enter() or
TDCALLs.

Resources    SHARED  users      EXCLUSIVE users
------------------------------------------------------------
SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
           tdh_mem_page_aug
------------------------------------------------------------
SEPT entry                      tdh_mem_sept_add (Host lock)
                                tdh_mem_page_aug (Host lock)
                                tdg_mem_page_accept (Guest lock)
                                tdg_mem_page_attr_rd (Guest lock)
                                tdg_mem_page_attr_wr (Guest lock)

Though the contention between tdh_mem_sept_add()/tdh_mem_page_aug() and
TDCALLs may be removed in future TDX module, their contention with
tdh_vp_enter() due to 0-step mitigation still persists.

The TDX module may trigger 0-step mitigation in SEAMCALL TDH.VP.ENTER,
which works as follows:
0. Each TDH.VP.ENTER records the guest RIP on TD entry.
1. When the TDX module encounters a VM exit with reason EPT_VIOLATION, it
   checks if the guest RIP is the same as last guest RIP on TD entry.
   -if yes, it means the EPT violation is caused by the same instruction
            that caused the last VM exit.
            Then, the TDX module increases the guest RIP no-progress count.
            When the count increases from 0 to the threshold (currently 6),
            the TDX module records the faulting GPA into a
            last_epf_gpa_list.
   -if no,  it means the guest RIP has made progress.
            So, the TDX module resets the RIP no-progress count and the
            last_epf_gpa_list.
2. On the next TDH.VP.ENTER, the TDX module (after saving the guest RIP on
   TD entry) checks if the last_epf_gpa_list is empty.
   -if yes, TD entry continues without acquiring the lock on the SEPT tree.
   -if no,  it triggers the 0-step mitigation by acquiring the exclusive
            lock on SEPT tree, walking the EPT tree to check if all page
            faults caused by the GPAs in the last_epf_gpa_list have been
            resolved before continuing TD entry.

Since KVM TDP MMU usually re-enters guest whenever it exits to userspace
(e.g. for KVM_EXIT_MEMORY_FAULT) or encounters a BUSY, it is possible for a
tdh_vp_enter() to be called more than the threshold count before a page
fault is addressed, triggering contention when tdh_vp_enter() attempts to
acquire exclusive lock on SEPT tree.

Retry locally in TDX EPT violation handler to reduce the count of invoking
tdh_vp_enter(), hence reducing the possibility of its contention with
tdh_mem_sept_add()/tdh_mem_page_aug(). However, the 0-step mitigation and
the contention are still not eliminated due to KVM_EXIT_MEMORY_FAULT,
signals/interrupts, and cases when one instruction faults more GFNs than
the threshold count.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1cf3ef0faff7..bb9d914765fc 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1854,6 +1854,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 {
 	gpa_t gpa = tdexit_gpa(vcpu);
 	unsigned long exit_qual;
+	bool local_retry = false;
+	int ret;
 
 	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
 		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
@@ -1872,6 +1874,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 		 * due to aliasing a single HPA to multiple GPAs.
 		 */
 		exit_qual = EPT_VIOLATION_ACC_WRITE;
+
+		/*
+		 * Mapping of private memory may return RET_PF_RETRY due to
+		 * SEAMCALL contention, e.g.
+		 * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU may
+		 *   contend with TDH.VP.ENTER (due to 0-step mitigation)
+		 *   on a remote vCPU.
+		 * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU may
+		 *   contend with TDG.MEM.PAGE.ACCEPT on a remote vCPU.
+		 *
+		 * Retry internally in TDX to prevent exacerbating the
+		 * activation of 0-step mitigation on local vCPU.
+		 * However, despite these retries, the 0-step mitigation on the
+		 * local vCPU may still be triggered due to:
+		 * - Exiting on signals, interrupts.
+		 * - KVM_EXIT_MEMORY_FAULT.
+		 */
+		local_retry = true;
 	} else {
 		exit_qual = tdexit_exit_qual(vcpu);
 		/*
@@ -1884,7 +1904,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 	}
 
 	trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
-	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+
+	while (1) {
+		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
+
+		if (ret != RET_PF_RETRY || !local_retry)
+			break;
+
+		/*
+		 * Break and keep the orig return value.
+		 * Signal & irq handling will be done later in vcpu_run()
+		 */
+		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
+		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
+			break;
+
+		cond_resched();
+	}
+	return ret;
 }
 
 int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/7] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2025-01-13  2:09 [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Yan Zhao
                   ` (2 preceding siblings ...)
  2025-01-13  2:12 ` [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY Yan Zhao
@ 2025-01-13  2:12 ` Yan Zhao
  2025-01-16  6:23   ` Binbin Wu
  2025-01-13  2:13 ` [PATCH 5/7] fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Yan Zhao
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Yan Zhao @ 2025-01-13  2:12 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata

Kick off all vCPUs and prevent tdh_vp_enter() from executing whenever
tdh_mem_range_block()/tdh_mem_track()/tdh_mem_page_remove() encounters
contention, since the page removal path does not expect error and is less
sensitive to the performance penalty caused by kicking off vCPUs.

Although KVM has protected SEPT zap-related SEAMCALLs with kvm->mmu_lock,
KVM may still encounter TDX_OPERAND_BUSY due to the contention in the TDX
module.
- tdh_mem_track() may contend with tdh_vp_enter().
- tdh_mem_range_block()/tdh_mem_page_remove() may contend with
  tdh_vp_enter() and TDCALLs.

Resources     SHARED users      EXCLUSIVE users
------------------------------------------------------------
TDCS epoch    tdh_vp_enter      tdh_mem_track
------------------------------------------------------------
SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
                                tdh_mem_range_block
------------------------------------------------------------
SEPT entry                      tdh_mem_range_block (Host lock)
                                tdh_mem_page_remove (Host lock)
                                tdg_mem_page_accept (Guest lock)
                                tdg_mem_page_attr_rd (Guest lock)
                                tdg_mem_page_attr_wr (Guest lock)

Use a TDX specific per-VM flag wait_for_sept_zap along with
KVM_REQ_OUTSIDE_GUEST_MODE to kick off vCPUs and prevent them from entering
TD, thereby avoiding the potential contention. Apply the kick-off and no
vCPU entering only after each SEAMCALL busy error to minimize the window of
no TD entry, as the contention due to 0-step mitigation or TDCALLs is
expected to be rare.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 62 ++++++++++++++++++++++++++++++++++++------
 arch/x86/kvm/vmx/tdx.h |  7 +++++
 2 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index bb9d914765fc..09677a4cd605 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -312,6 +312,26 @@ static void tdx_clear_page(unsigned long page_pa)
 	__mb();
 }
 
+static void tdx_no_vcpus_enter_start(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	WRITE_ONCE(kvm_tdx->wait_for_sept_zap, true);
+
+	kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
+}
+
+static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	WRITE_ONCE(kvm_tdx->wait_for_sept_zap, false);
+}
+
 /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
 static int __tdx_reclaim_page(hpa_t pa)
 {
@@ -979,6 +999,14 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 		return EXIT_FASTPATH_NONE;
 	}
 
+	/*
+	 * Wait until retry of SEPT-zap-related SEAMCALL completes before
+	 * allowing vCPU entry to avoid contention with tdh_vp_enter() and
+	 * TDCALLs.
+	 */
+	if (unlikely(READ_ONCE(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap)))
+		return EXIT_FASTPATH_EXIT_HANDLED;
+
 	trace_kvm_entry(vcpu, force_immediate_exit);
 
 	if (pi_test_on(&tdx->pi_desc)) {
@@ -1647,15 +1675,23 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
 		return -EINVAL;
 
-	do {
+	/*
+	 * When zapping private page, write lock is held. So no race condition
+	 * with other vcpu sept operation.
+	 * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
+	 */
+	err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
+				  &level_state);
+	if ((err & TDX_OPERAND_BUSY)) {
 		/*
-		 * When zapping private page, write lock is held. So no race
-		 * condition with other vcpu sept operation.  Race only with
-		 * TDH.VP.ENTER.
+		 * The second retry is expected to succeed after kicking off all
+		 * other vCPUs and prevent them from invoking TDH.VP.ENTER.
 		 */
+		tdx_no_vcpus_enter_start(kvm);
 		err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
 					  &level_state);
-	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
+		tdx_no_vcpus_enter_stop(kvm);
+	}
 
 	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE &&
 		     err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
@@ -1726,8 +1762,12 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	WARN_ON_ONCE(level != PG_LEVEL_4K);
 
 	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
-	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
-		return -EAGAIN;
+	if (unlikely(err & TDX_OPERAND_BUSY)) {
+		/* After no vCPUs enter, the second retry is expected to succeed */
+		tdx_no_vcpus_enter_start(kvm);
+		err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
+		tdx_no_vcpus_enter_stop(kvm);
+	}
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
 		return -EIO;
@@ -1770,9 +1810,13 @@ static void tdx_track(struct kvm *kvm)
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
-	do {
+	err = tdh_mem_track(kvm_tdx->tdr_pa);
+	if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY) {
+		/* After no vCPUs enter, the second retry is expected to succeed */
+		tdx_no_vcpus_enter_start(kvm);
 		err = tdh_mem_track(kvm_tdx->tdr_pa);
-	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
+		tdx_no_vcpus_enter_stop(kvm);
+	}
 
 	if (KVM_BUG_ON(err, kvm))
 		pr_tdx_error(TDH_MEM_TRACK, err);
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 0833d1084331..e369a6f8721b 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -48,6 +48,13 @@ struct kvm_tdx {
 
 	/* For KVM_TDX_INIT_MEM_REGION. */
 	atomic64_t nr_premapped;
+
+	/*
+	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
+	 * not contend with tdh_vp_enter() and TDCALLs.
+	 * Set/unset is protected with kvm->mmu_lock.
+	 */
+	bool wait_for_sept_zap;
 };
 
 /* TDX module vCPU states */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/7] fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2025-01-13  2:09 [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Yan Zhao
                   ` (3 preceding siblings ...)
  2025-01-13  2:12 ` [PATCH 4/7] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal Yan Zhao
@ 2025-01-13  2:13 ` Yan Zhao
  2025-01-16  6:30   ` Binbin Wu
  2025-01-13  2:13 ` [PATCH 6/7] " Yan Zhao
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Yan Zhao @ 2025-01-13  2:13 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata

Return -EBUSY instead of -EAGAIN when tdh_mem_sept_add() encounters any err
of TDX_OPERAND_BUSY.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 09677a4cd605..4fb9faca5db2 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1740,8 +1740,9 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 
 	err = tdh_mem_sept_add(to_kvm_tdx(kvm)->tdr_pa, gpa, tdx_level, hpa, &entry,
 			       &level_state);
-	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
-		return -EAGAIN;
+	if (unlikely(err & TDX_OPERAND_BUSY))
+		return -EBUSY;
+
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_SEPT_ADD, err, entry, level_state);
 		return -EIO;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/7] fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2025-01-13  2:09 [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Yan Zhao
                   ` (4 preceding siblings ...)
  2025-01-13  2:13 ` [PATCH 5/7] fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Yan Zhao
@ 2025-01-13  2:13 ` Yan Zhao
  2025-01-13  2:13 ` [PATCH 7/7] fixup! KVM: TDX: Implement TDX vcpu enter/exit path Yan Zhao
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Yan Zhao @ 2025-01-13  2:13 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata

Remove the retry loop for tdh_phymem_page_wbinvd_hkid().

tdh_phymem_page_wbinvd_hkid() just acquires the lock on the PAMT entry of
the page to be wbinvd, so it's not expected to encounter TDX_OPERAND_BUSY.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4fb9faca5db2..baabae95504b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1712,14 +1712,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 	}
 
-	do {
-		/*
-		 * TDX_OPERAND_BUSY can happen on locking PAMT entry.  Because
-		 * this page was removed above, other thread shouldn't be
-		 * repeatedly operating on this page.  Just retry loop.
-		 */
-		err = tdh_phymem_page_wbinvd_hkid(hpa, (u16)kvm_tdx->hkid);
-	} while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
+	err = tdh_phymem_page_wbinvd_hkid(hpa, (u16)kvm_tdx->hkid);
 
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 7/7] fixup! KVM: TDX: Implement TDX vcpu enter/exit path
  2025-01-13  2:09 [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Yan Zhao
                   ` (5 preceding siblings ...)
  2025-01-13  2:13 ` [PATCH 6/7] " Yan Zhao
@ 2025-01-13  2:13 ` Yan Zhao
  2025-01-14 22:27 ` [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Edgecombe, Rick P
  2025-01-15 16:43 ` Paolo Bonzini
  8 siblings, 0 replies; 27+ messages in thread
From: Yan Zhao @ 2025-01-13  2:13 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata

Warn on force_immediate_exit in tdx_vcpu_run().

force_immediate_exit requires vCPU entering for events injection with an
immediately exit followed. But The TDX module doesn't guarantee entry, it's
already possible for KVM to _think_ it completely entry to the guest
without actually having done so.

Since KVM never needs to force an immediate exit for TDX, and can't do
direct injection, there's no need to implement force_immediate_exit, i.e.
carrying out the kicking vCPU and event reinjection. Simply warn on
force_immediate_exit.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index baabae95504b..0e684f4683f2 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -999,6 +999,16 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 		return EXIT_FASTPATH_NONE;
 	}
 
+	/*
+	 * force_immediate_exit requires vCPU entering for events injection with
+	 * an immediately exit followed. But The TDX module doesn't guarantee
+	 * entry, it's already possible for KVM to _think_ it completely entry
+	 * to the guest without actually having done so.
+	 * Since KVM never needs to force an immediate exit for TDX, and can't
+	 * do direct injection, just warn on force_immediate_exit.
+	 */
+	WARN_ON_ONCE(force_immediate_exit);
+
 	/*
 	 * Wait until retry of SEPT-zap-related SEAMCALL completes before
 	 * allowing vCPU entry to avoid contention with tdh_vp_enter() and
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters TDX_OPERAND_BUSY
  2025-01-13  2:10 ` [PATCH 1/7] KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters TDX_OPERAND_BUSY Yan Zhao
@ 2025-01-14 22:24   ` Edgecombe, Rick P
  2025-01-15  4:59     ` Yan Zhao
  0 siblings, 1 reply; 27+ messages in thread
From: Edgecombe, Rick P @ 2025-01-14 22:24 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	Lindgren, Tony, linux-kernel@vger.kernel.org, Chatre, Reinette,
	dmatlack@google.com, Hunter, Adrian, Yamahata, Isaku,
	isaku.yamahata@gmail.com

On Mon, 2025-01-13 at 10:10 +0800, Yan Zhao wrote:
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index d0dc3200fa37..1cf3ef0faff7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3024,13 +3024,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	}
>  
>  	ret = 0;
> -	do {
> -		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
> -				       pfn_to_hpa(page_to_pfn(page)),
> -				       &entry, &level_state);
> -	} while (err == TDX_ERROR_SEPT_BUSY);
> +	err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
> +			       pfn_to_hpa(page_to_pfn(page)),
> +			       &entry, &level_state);
>  	if (err) {
> -		ret = -EIO;
> +		ret = unlikely(err & TDX_OPERAND_BUSY) ? -EBUSY : -EIO;
>  		goto out;
>  	}

Should we just squash this into "KVM: TDX: Add an ioctl to create initial guest
memory"? I guess we get a little more specific log history on this corner as a
separate patch, but seems strange to add and remove a loop before it even can
get exercised.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/7] KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
  2025-01-13  2:11 ` [PATCH 2/7] KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault() Yan Zhao
@ 2025-01-14 22:24   ` Edgecombe, Rick P
  2025-01-15  4:58     ` Yan Zhao
  0 siblings, 1 reply; 27+ messages in thread
From: Edgecombe, Rick P @ 2025-01-14 22:24 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	Lindgren, Tony, linux-kernel@vger.kernel.org, Chatre, Reinette,
	dmatlack@google.com, Hunter, Adrian, Yamahata, Isaku,
	isaku.yamahata@gmail.com

On Mon, 2025-01-13 at 10:11 +0800, Yan Zhao wrote:
> Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID)
> instead of 1 in kvm_mmu_page_fault().
> 
> The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e.,
> npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(),
> kvm_handle_page_fault()). They either check if the return value is > 0 (as
> in npf_interception()) or pass it further to vcpu_run() to decide whether
> to break out of the kernel loop and return to the user when r <= 0.
> Therefore, returning any positive value is equivalent to returning 1.
> 
> Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure
> a positive return value.
> 
> This is a preparation to allow TDX's EPT violation handler to check the
> RET_PF* value and retry internally for RET_PF_RETRY.
> 
> No functional changes are intended.

Any reason why this can't go ahead of the TDX patches? Seems pretty generic
cleanup.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry
  2025-01-13  2:09 [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Yan Zhao
                   ` (6 preceding siblings ...)
  2025-01-13  2:13 ` [PATCH 7/7] fixup! KVM: TDX: Implement TDX vcpu enter/exit path Yan Zhao
@ 2025-01-14 22:27 ` Edgecombe, Rick P
  2025-01-15 16:43 ` Paolo Bonzini
  8 siblings, 0 replies; 27+ messages in thread
From: Edgecombe, Rick P @ 2025-01-14 22:27 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	Lindgren, Tony, linux-kernel@vger.kernel.org, Chatre, Reinette,
	dmatlack@google.com, Hunter, Adrian, Yamahata, Isaku,
	isaku.yamahata@gmail.com

We should probably...

On Mon, 2025-01-13 at 10:09 +0800, Yan Zhao wrote:
>   KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters
>     TDX_OPERAND_BUSY

Squash this in "KVM: TDX: Add an ioctl to create initial guest
memory" in kvm-coco-queue.

>   KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()

Recommend this to go through kvm/x86 tree separately?

>   KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
>   KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal

Include these in kvm-coco-queue and drop "x86/virt/tdx: Retry seamcall when
TDX_OPERAND_BUSY with operand SEPT"

>   fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU
>     mirror page table
>   fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU
>     mirror page table

Squash these into kvm-coco-queue.

>   fixup! KVM: TDX: Implement TDX vcpu enter/exit path

Add this in the next version of "TDX vcpu enter/exit path".

How does that sound? 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/7] KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
  2025-01-14 22:24   ` Edgecombe, Rick P
@ 2025-01-15  4:58     ` Yan Zhao
  0 siblings, 0 replies; 27+ messages in thread
From: Yan Zhao @ 2025-01-15  4:58 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	Lindgren, Tony, linux-kernel@vger.kernel.org, Chatre, Reinette,
	dmatlack@google.com, Hunter, Adrian, Yamahata, Isaku,
	isaku.yamahata@gmail.com

On Wed, Jan 15, 2025 at 06:24:43AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-01-13 at 10:11 +0800, Yan Zhao wrote:
> > Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID)
> > instead of 1 in kvm_mmu_page_fault().
> > 
> > The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e.,
> > npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(),
> > kvm_handle_page_fault()). They either check if the return value is > 0 (as
> > in npf_interception()) or pass it further to vcpu_run() to decide whether
> > to break out of the kernel loop and return to the user when r <= 0.
> > Therefore, returning any positive value is equivalent to returning 1.
> > 
> > Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure
> > a positive return value.
> > 
> > This is a preparation to allow TDX's EPT violation handler to check the
> > RET_PF* value and retry internally for RET_PF_RETRY.
> > 
> > No functional changes are intended.
> 
> Any reason why this can't go ahead of the TDX patches? Seems pretty generic
> cleanup.
Hmm, I wouldn't consider this a cleanup, as returning 1 to indicate continuation
of the kernel loop is a well-established convention in arch/x86/kvm.

Returning a positive RET_PF_* in this patch is primarily a preparatory step for
the subsequent patch,
"KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY".

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters TDX_OPERAND_BUSY
  2025-01-14 22:24   ` Edgecombe, Rick P
@ 2025-01-15  4:59     ` Yan Zhao
  0 siblings, 0 replies; 27+ messages in thread
From: Yan Zhao @ 2025-01-15  4:59 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	Lindgren, Tony, linux-kernel@vger.kernel.org, Chatre, Reinette,
	dmatlack@google.com, Hunter, Adrian, Yamahata, Isaku,
	isaku.yamahata@gmail.com

On Wed, Jan 15, 2025 at 06:24:34AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-01-13 at 10:10 +0800, Yan Zhao wrote:
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index d0dc3200fa37..1cf3ef0faff7 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -3024,13 +3024,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >  	}
> >  
> >  	ret = 0;
> > -	do {
> > -		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
> > -				       pfn_to_hpa(page_to_pfn(page)),
> > -				       &entry, &level_state);
> > -	} while (err == TDX_ERROR_SEPT_BUSY);
> > +	err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
> > +			       pfn_to_hpa(page_to_pfn(page)),
> > +			       &entry, &level_state);
> >  	if (err) {
> > -		ret = -EIO;
> > +		ret = unlikely(err & TDX_OPERAND_BUSY) ? -EBUSY : -EIO;
> >  		goto out;
> >  	}
> 
> Should we just squash this into "KVM: TDX: Add an ioctl to create initial guest
> memory"? I guess we get a little more specific log history on this corner as a
> separate patch, but seems strange to add and remove a loop before it even can
> get exercised.
No problem to me.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry
  2025-01-13  2:09 [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Yan Zhao
                   ` (7 preceding siblings ...)
  2025-01-14 22:27 ` [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Edgecombe, Rick P
@ 2025-01-15 16:43 ` Paolo Bonzini
  2025-01-16  0:52   ` Yan Zhao
  8 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2025-01-15 16:43 UTC (permalink / raw)
  To: Yan Zhao, seanjc, kvm
  Cc: linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, isaku.yamahata

On 1/13/25 03:09, Yan Zhao wrote:
> This series aims to provide a clean solution to avoid the blind retries in
> the previous hack [1] in "TDX MMU Part 2," following the initial
> discussions to [2], further discussions in the RFC, and the PUCK [3].
> 
> A full analysis of the lock status for each SEAMCALL relevant to KVM is
> available at [4].
> 
> This series categorizes the SEPT-related SEAMCALLs (used for page
> installation and uninstallation) into three groups:
> 
> Group 1: tdh_mem_page_add().
>         - Invoked only during TD build time.
>         - Proposal: Return -EBUSY on TDX_OPERAND_BUSY.
>         - Patch 1.
> 
> Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
>         - Invoked for TD runtime page installation.
>         - Proposal: Retry locally in the TDX EPT violation handler for
>           RET_PF_RETRY.
>         - Patches 2-3.
> 
> Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
>         - Invoked for page uninstallation, with KVM mmu_lock held for write.
>         - Proposal: Kick off vCPUs and no vCPU entry on TDX_OPERAND_BUSY.
>         - Patch 4.
> 
> Patches 5/6/7 are fixup patches:
> Patch 5: Return -EBUSY instead of -EAGAIN when tdh_mem_sept_add() is busy.
> Patch 6: Remove the retry loop for tdh_phymem_page_wbinvd_hkid().
> Patch 7: Warn on force_immediate_exit in tdx_vcpu_run().
> 
> Code base: kvm-coco-queue 2f30b837bf7b.
> Applies to the tail since the dependence on
> commit 8e801e55ba8f ("KVM: TDX: Handle EPT violation/misconfig exit"),
> 
> Thanks
> Yan
> 
> RFC --> v1:
> - Split patch 1 in RFC into patches 1,2,3,5, and add new fixup patches 6/7.
> - Add contention analysis of tdh_mem_page_add() in patch 1 log.
> - Provide justification in patch 2 log and add checks for RET_PF_CONTINUE.
> - Use "a per-VM flag wait_for_sept_zap + KVM_REQ_OUTSIDE_GUEST_MODE"
>    instead of a arch-specific request to prevent vCPUs from TD entry in patch 4
>    (Sean).
> 
> RFC: https://lore.kernel.org/all/20241121115139.26338-1-yan.y.zhao@intel.com
> [1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com
> [2] https://lore.kernel.org/kvm/20240904030751.117579-10-rick.p.edgecombe@intel.com/
> [3] https://drive.google.com/drive/folders/1k0qOarKuZXpzRsKDtVeC5Lpl9-amJ6AJ?resourcekey=0-l9uVpVEBC34Uar1ReaqisQ
> [4] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com

Thanks, I applied this to kvm-coco-queue and patch 2 to kvm/queue.  It 
is spread all over the branch to make the dependencies clearer, so 
here's some ideas on how to include these.

Patches 6 and 7 should be squashed into the respective bases, as they 
have essentially no functional change.

For the rest, patch 1 can be treated as a fixup too, and I have two 
proposals.

First possibility, separate series:
* patches 1+5 are merged into a single patch.
* patches 3+4 become two more patches in this separate series

Second possibility, squash everything:
* patches 1+5 are squashed into the respective bases
* patches 3+4 are included in the EPT violation series

On the PUCK call I said that I prefer the first, mostly to keep track of 
who needs to handle TDX_OPERAND_BUSY, but if it makes it easier for Yan 
then feel free to go for the second.

Paolo

> Yan Zhao (7):
>    KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters
>      TDX_OPERAND_BUSY
>    KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
>    KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
>    KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
>    fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU
>      mirror page table
>    fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU
>      mirror page table
>    fixup! KVM: TDX: Implement TDX vcpu enter/exit path
> 
>   arch/x86/kvm/mmu/mmu.c          |  10 ++-
>   arch/x86/kvm/mmu/mmu_internal.h |  12 ++-
>   arch/x86/kvm/vmx/tdx.c          | 135 ++++++++++++++++++++++++++------
>   arch/x86/kvm/vmx/tdx.h          |   7 ++
>   4 files changed, 134 insertions(+), 30 deletions(-)
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry
  2025-01-15 16:43 ` Paolo Bonzini
@ 2025-01-16  0:52   ` Yan Zhao
  2025-01-16 11:07     ` Paolo Bonzini
  0 siblings, 1 reply; 27+ messages in thread
From: Yan Zhao @ 2025-01-16  0:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: seanjc, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata

On Wed, Jan 15, 2025 at 05:43:51PM +0100, Paolo Bonzini wrote:
> On 1/13/25 03:09, Yan Zhao wrote:
> > This series aims to provide a clean solution to avoid the blind retries in
> > the previous hack [1] in "TDX MMU Part 2," following the initial
> > discussions to [2], further discussions in the RFC, and the PUCK [3].
> > 
> > A full analysis of the lock status for each SEAMCALL relevant to KVM is
> > available at [4].
> > 
> > This series categorizes the SEPT-related SEAMCALLs (used for page
> > installation and uninstallation) into three groups:
> > 
> > Group 1: tdh_mem_page_add().
> >         - Invoked only during TD build time.
> >         - Proposal: Return -EBUSY on TDX_OPERAND_BUSY.
> >         - Patch 1.
> > 
> > Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
> >         - Invoked for TD runtime page installation.
> >         - Proposal: Retry locally in the TDX EPT violation handler for
> >           RET_PF_RETRY.
> >         - Patches 2-3.
> > 
> > Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
> >         - Invoked for page uninstallation, with KVM mmu_lock held for write.
> >         - Proposal: Kick off vCPUs and no vCPU entry on TDX_OPERAND_BUSY.
> >         - Patch 4.
> > 
> > Patches 5/6/7 are fixup patches:
> > Patch 5: Return -EBUSY instead of -EAGAIN when tdh_mem_sept_add() is busy.
> > Patch 6: Remove the retry loop for tdh_phymem_page_wbinvd_hkid().
> > Patch 7: Warn on force_immediate_exit in tdx_vcpu_run().
> > 
> > Code base: kvm-coco-queue 2f30b837bf7b.
> > Applies to the tail since the dependence on
> > commit 8e801e55ba8f ("KVM: TDX: Handle EPT violation/misconfig exit"),
> > 
> > Thanks
> > Yan
> > 
> > RFC --> v1:
> > - Split patch 1 in RFC into patches 1,2,3,5, and add new fixup patches 6/7.
> > - Add contention analysis of tdh_mem_page_add() in patch 1 log.
> > - Provide justification in patch 2 log and add checks for RET_PF_CONTINUE.
> > - Use "a per-VM flag wait_for_sept_zap + KVM_REQ_OUTSIDE_GUEST_MODE"
> >    instead of a arch-specific request to prevent vCPUs from TD entry in patch 4
> >    (Sean).
> > 
> > RFC: https://lore.kernel.org/all/20241121115139.26338-1-yan.y.zhao@intel.com
> > [1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com
> > [2] https://lore.kernel.org/kvm/20240904030751.117579-10-rick.p.edgecombe@intel.com/
> > [3] https://drive.google.com/drive/folders/1k0qOarKuZXpzRsKDtVeC5Lpl9-amJ6AJ?resourcekey=0-l9uVpVEBC34Uar1ReaqisQ
> > [4] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com
> 
> Thanks, I applied this to kvm-coco-queue and patch 2 to kvm/queue.  It is
> spread all over the branch to make the dependencies clearer, so here's some
> ideas on how to include these.
> 
> Patches 6 and 7 should be squashed into the respective bases, as they have
> essentially no functional change.
> 
> For the rest, patch 1 can be treated as a fixup too, and I have two
> proposals.
> 
> First possibility, separate series:
> * patches 1+5 are merged into a single patch.
> * patches 3+4 become two more patches in this separate series
> 
> Second possibility, squash everything:
> * patches 1+5 are squashed into the respective bases
> * patches 3+4 are included in the EPT violation series
> 
> On the PUCK call I said that I prefer the first, mostly to keep track of who
> needs to handle TDX_OPERAND_BUSY, but if it makes it easier for Yan then
> feel free to go for the second.
Thank you, Paolo.

I'm ok with either way.

For the first, hmm, a bad thing is that though
tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_sept_add() all need to handle
TDX_OPERAND_BUSY, the one for tdh_mem_page_aug() has already been squashed
into the MMU part 2.

If you like, maybe I can extract the one for tdh_mem_page_aug() and merge it
with 1+5.

Then since 3 depends on EPT violation, for the first, 3+4 can also be included
in the EPT violation series, right?

> 
> Paolo
> 
> > Yan Zhao (7):
> >    KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters
> >      TDX_OPERAND_BUSY
> >    KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
> >    KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
> >    KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
> >    fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU
> >      mirror page table
> >    fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU
> >      mirror page table
> >    fixup! KVM: TDX: Implement TDX vcpu enter/exit path
> > 
> >   arch/x86/kvm/mmu/mmu.c          |  10 ++-
> >   arch/x86/kvm/mmu/mmu_internal.h |  12 ++-
> >   arch/x86/kvm/vmx/tdx.c          | 135 ++++++++++++++++++++++++++------
> >   arch/x86/kvm/vmx/tdx.h          |   7 ++
> >   4 files changed, 134 insertions(+), 30 deletions(-)
> > 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2025-01-13  2:12 ` [PATCH 4/7] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal Yan Zhao
@ 2025-01-16  6:23   ` Binbin Wu
  2025-01-16  6:28     ` Binbin Wu
  2025-01-16  8:18     ` Yan Zhao
  0 siblings, 2 replies; 27+ messages in thread
From: Binbin Wu @ 2025-01-16  6:23 UTC (permalink / raw)
  To: Yan Zhao, pbonzini, seanjc, kvm
  Cc: linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, dmatlack,
	isaku.yamahata, isaku.yamahata




On 1/13/2025 10:12 AM, Yan Zhao wrote:
[...]
> +
>   /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
>   static int __tdx_reclaim_page(hpa_t pa)
>   {
> @@ -979,6 +999,14 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>   		return EXIT_FASTPATH_NONE;
>   	}
>   
> +	/*
> +	 * Wait until retry of SEPT-zap-related SEAMCALL completes before
> +	 * allowing vCPU entry to avoid contention with tdh_vp_enter() and
> +	 * TDCALLs.
> +	 */
> +	if (unlikely(READ_ONCE(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap)))
> +		return EXIT_FASTPATH_EXIT_HANDLED;
> +
>   	trace_kvm_entry(vcpu, force_immediate_exit);
>   
>   	if (pi_test_on(&tdx->pi_desc)) {
> @@ -1647,15 +1675,23 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
>   		return -EINVAL;
>   
> -	do {
> +	/*
> +	 * When zapping private page, write lock is held. So no race condition
> +	 * with other vcpu sept operation.
> +	 * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
> +	 */
> +	err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
> +				  &level_state);
> +	if ((err & TDX_OPERAND_BUSY)) {

It is not safe to use "err & TDX_OPERAND_BUSY".
E.g., if the error is TDX_EPT_WALK_FAILED, "err & TDX_OPERAND_BUSY" will be true.

Maybe you can add a helper to check it.

staticinlinebooltdx_operand_busy(u64err)
{
return(err &TDX_SEAMCALL_STATUS_MASK) ==TDX_OPERAND_BUSY;
}


>   		/*
> -		 * When zapping private page, write lock is held. So no race
> -		 * condition with other vcpu sept operation.  Race only with
> -		 * TDH.VP.ENTER.
> +		 * The second retry is expected to succeed after kicking off all
> +		 * other vCPUs and prevent them from invoking TDH.VP.ENTER.
>   		 */
> +		tdx_no_vcpus_enter_start(kvm);
>   		err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
>   					  &level_state);
> -	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
> +		tdx_no_vcpus_enter_stop(kvm);
> +	}
>   
>   	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE &&
>   		     err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
> @@ -1726,8 +1762,12 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>   	WARN_ON_ONCE(level != PG_LEVEL_4K);
>   
>   	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
> -	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
> -		return -EAGAIN;
> +	if (unlikely(err & TDX_OPERAND_BUSY)) {
Ditto.

> +		/* After no vCPUs enter, the second retry is expected to succeed */
> +		tdx_no_vcpus_enter_start(kvm);
> +		err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
> +		tdx_no_vcpus_enter_stop(kvm);
> +	}
>   	if (KVM_BUG_ON(err, kvm)) {
>   		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
>   		return -EIO;
> @@ -1770,9 +1810,13 @@ static void tdx_track(struct kvm *kvm)
>   
>   	lockdep_assert_held_write(&kvm->mmu_lock);
>   
> -	do {
> +	err = tdh_mem_track(kvm_tdx->tdr_pa);
> +	if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY) {
> +		/* After no vCPUs enter, the second retry is expected to succeed */
> +		tdx_no_vcpus_enter_start(kvm);
>   		err = tdh_mem_track(kvm_tdx->tdr_pa);
> -	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
> +		tdx_no_vcpus_enter_stop(kvm);
> +	}
>   
>
[...]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2025-01-16  6:23   ` Binbin Wu
@ 2025-01-16  6:28     ` Binbin Wu
  2025-01-16  8:18     ` Yan Zhao
  1 sibling, 0 replies; 27+ messages in thread
From: Binbin Wu @ 2025-01-16  6:28 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	dmatlack, isaku.yamahata, isaku.yamahata


On 1/16/2025 2:23 PM, Binbin Wu wrote:
>
>
>
> On 1/13/2025 10:12 AM, Yan Zhao wrote:
> [...]
>> +
>>   /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
>>   static int __tdx_reclaim_page(hpa_t pa)
>>   {
>> @@ -979,6 +999,14 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>           return EXIT_FASTPATH_NONE;
>>       }
>>   +    /*
>> +     * Wait until retry of SEPT-zap-related SEAMCALL completes before
>> +     * allowing vCPU entry to avoid contention with tdh_vp_enter() and
>> +     * TDCALLs.
>> +     */
>> +    if (unlikely(READ_ONCE(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap)))
>> +        return EXIT_FASTPATH_EXIT_HANDLED;
>> +
>>       trace_kvm_entry(vcpu, force_immediate_exit);
>>         if (pi_test_on(&tdx->pi_desc)) {
>> @@ -1647,15 +1675,23 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>>       if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
>>           return -EINVAL;
>>   -    do {
>> +    /*
>> +     * When zapping private page, write lock is held. So no race condition
>> +     * with other vcpu sept operation.
>> +     * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
>> +     */
>> +    err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
>> +                  &level_state);
>> +    if ((err & TDX_OPERAND_BUSY)) {
>
> It is not safe to use "err & TDX_OPERAND_BUSY".
> E.g., if the error is TDX_EPT_WALK_FAILED, "err & TDX_OPERAND_BUSY" will be true.
>
> Maybe you can add a helper to check it.
>
> staticinlinebooltdx_operand_busy(u64err)
> {
> return(err &TDX_SEAMCALL_STATUS_MASK) ==TDX_OPERAND_BUSY;
> }
>
Don't know why some spaces were dropped by thunderbird. :-(

>
>>           /*
>> -         * When zapping private page, write lock is held. So no race
>> -         * condition with other vcpu sept operation.  Race only with
>> -         * TDH.VP.ENTER.
>> +         * The second retry is expected to succeed after kicking off all
>> +         * other vCPUs and prevent them from invoking TDH.VP.ENTER.
>>            */
>> +        tdx_no_vcpus_enter_start(kvm);
>>           err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
>>                         &level_state);
>> -    } while (unlikely(err == TDX_ERROR_SEPT_BUSY));
>> +        tdx_no_vcpus_enter_stop(kvm);
>> +    }
>>         if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE &&
>>                err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
>> @@ -1726,8 +1762,12 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>>       WARN_ON_ONCE(level != PG_LEVEL_4K);
>>         err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
>> -    if (unlikely(err == TDX_ERROR_SEPT_BUSY))
>> -        return -EAGAIN;
>> +    if (unlikely(err & TDX_OPERAND_BUSY)) {
> Ditto.
>
>> +        /* After no vCPUs enter, the second retry is expected to succeed */
>> +        tdx_no_vcpus_enter_start(kvm);
>> +        err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
>> +        tdx_no_vcpus_enter_stop(kvm);
>> +    }
>>       if (KVM_BUG_ON(err, kvm)) {
>>           pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
>>           return -EIO;
>> @@ -1770,9 +1810,13 @@ static void tdx_track(struct kvm *kvm)
>>         lockdep_assert_held_write(&kvm->mmu_lock);
>>   -    do {
>> +    err = tdh_mem_track(kvm_tdx->tdr_pa);
>> +    if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY) {
>> +        /* After no vCPUs enter, the second retry is expected to succeed */
>> +        tdx_no_vcpus_enter_start(kvm);
>>           err = tdh_mem_track(kvm_tdx->tdr_pa);
>> -    } while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
>> +        tdx_no_vcpus_enter_stop(kvm);
>> +    }
>>
> [...]
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/7] fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2025-01-13  2:13 ` [PATCH 5/7] fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Yan Zhao
@ 2025-01-16  6:30   ` Binbin Wu
  0 siblings, 0 replies; 27+ messages in thread
From: Binbin Wu @ 2025-01-16  6:30 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	dmatlack, isaku.yamahata, isaku.yamahata




On 1/13/2025 10:13 AM, Yan Zhao wrote:
> Return -EBUSY instead of -EAGAIN when tdh_mem_sept_add() encounters any err
> of TDX_OPERAND_BUSY.
>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 09677a4cd605..4fb9faca5db2 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1740,8 +1740,9 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>   
>   	err = tdh_mem_sept_add(to_kvm_tdx(kvm)->tdr_pa, gpa, tdx_level, hpa, &entry,
>   			       &level_state);
> -	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
> -		return -EAGAIN;
> +	if (unlikely(err & TDX_OPERAND_BUSY))

Some issue here as mentioned in the previous patch.

> +		return -EBUSY;
> +
>   	if (KVM_BUG_ON(err, kvm)) {
>   		pr_tdx_error_2(TDH_MEM_SEPT_ADD, err, entry, level_state);
>   		return -EIO;


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/7] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
  2025-01-16  6:23   ` Binbin Wu
  2025-01-16  6:28     ` Binbin Wu
@ 2025-01-16  8:18     ` Yan Zhao
  1 sibling, 0 replies; 27+ messages in thread
From: Yan Zhao @ 2025-01-16  8:18 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	dmatlack, isaku.yamahata, isaku.yamahata

On Thu, Jan 16, 2025 at 02:23:24PM +0800, Binbin Wu wrote:
> 
> 
> 
> On 1/13/2025 10:12 AM, Yan Zhao wrote:
> [...]
> > +
> >   /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
> >   static int __tdx_reclaim_page(hpa_t pa)
> >   {
> > @@ -979,6 +999,14 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> >   		return EXIT_FASTPATH_NONE;
> >   	}
> > +	/*
> > +	 * Wait until retry of SEPT-zap-related SEAMCALL completes before
> > +	 * allowing vCPU entry to avoid contention with tdh_vp_enter() and
> > +	 * TDCALLs.
> > +	 */
> > +	if (unlikely(READ_ONCE(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap)))
> > +		return EXIT_FASTPATH_EXIT_HANDLED;
> > +
> >   	trace_kvm_entry(vcpu, force_immediate_exit);
> >   	if (pi_test_on(&tdx->pi_desc)) {
> > @@ -1647,15 +1675,23 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> >   	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
> >   		return -EINVAL;
> > -	do {
> > +	/*
> > +	 * When zapping private page, write lock is held. So no race condition
> > +	 * with other vcpu sept operation.
> > +	 * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
> > +	 */
> > +	err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
> > +				  &level_state);
> > +	if ((err & TDX_OPERAND_BUSY)) {
> 
> It is not safe to use "err & TDX_OPERAND_BUSY".
> E.g., if the error is TDX_EPT_WALK_FAILED, "err & TDX_OPERAND_BUSY" will be true.
> 
> Maybe you can add a helper to check it.
> 
> staticinlinebooltdx_operand_busy(u64err)
> {
> return(err &TDX_SEAMCALL_STATUS_MASK) ==TDX_OPERAND_BUSY;
> }
> 
Good catch!
Thanks!

> 
> >   		/*
> > -		 * When zapping private page, write lock is held. So no race
> > -		 * condition with other vcpu sept operation.  Race only with
> > -		 * TDH.VP.ENTER.
> > +		 * The second retry is expected to succeed after kicking off all
> > +		 * other vCPUs and prevent them from invoking TDH.VP.ENTER.
> >   		 */
> > +		tdx_no_vcpus_enter_start(kvm);
> >   		err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
> >   					  &level_state);
> > -	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
> > +		tdx_no_vcpus_enter_stop(kvm);
> > +	}
> >   	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE &&
> >   		     err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
> > @@ -1726,8 +1762,12 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> >   	WARN_ON_ONCE(level != PG_LEVEL_4K);
> >   	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
> > -	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
> > -		return -EAGAIN;
> > +	if (unlikely(err & TDX_OPERAND_BUSY)) {
> Ditto.
> 
> > +		/* After no vCPUs enter, the second retry is expected to succeed */
> > +		tdx_no_vcpus_enter_start(kvm);
> > +		err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
> > +		tdx_no_vcpus_enter_stop(kvm);
> > +	}
> >   	if (KVM_BUG_ON(err, kvm)) {
> >   		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
> >   		return -EIO;
> > @@ -1770,9 +1810,13 @@ static void tdx_track(struct kvm *kvm)
> >   	lockdep_assert_held_write(&kvm->mmu_lock);
> > -	do {
> > +	err = tdh_mem_track(kvm_tdx->tdr_pa);
> > +	if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY) {
> > +		/* After no vCPUs enter, the second retry is expected to succeed */
> > +		tdx_no_vcpus_enter_start(kvm);
> >   		err = tdh_mem_track(kvm_tdx->tdr_pa);
> > -	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
> > +		tdx_no_vcpus_enter_stop(kvm);
> > +	}
> > 
> [...]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry
  2025-01-16  0:52   ` Yan Zhao
@ 2025-01-16 11:07     ` Paolo Bonzini
  2025-01-17  9:52       ` Yan Zhao
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2025-01-16 11:07 UTC (permalink / raw)
  To: Yan Zhao
  Cc: seanjc, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata

On Thu, Jan 16, 2025 at 1:53 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> For the first, hmm, a bad thing is that though
> tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_sept_add() all need to handle
> TDX_OPERAND_BUSY, the one for tdh_mem_page_aug() has already been squashed
> into the MMU part 2.
>
> If you like, maybe I can extract the one for tdh_mem_page_aug() and merge it
> with 1+5.

That works for me, but if it's easier for you to merge the fixups in
the respective base patches that's okay too.

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry
  2025-01-16 11:07     ` Paolo Bonzini
@ 2025-01-17  9:52       ` Yan Zhao
  0 siblings, 0 replies; 27+ messages in thread
From: Yan Zhao @ 2025-01-17  9:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: seanjc, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata

On Thu, Jan 16, 2025 at 12:07:00PM +0100, Paolo Bonzini wrote:
> On Thu, Jan 16, 2025 at 1:53 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > For the first, hmm, a bad thing is that though
> > tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_sept_add() all need to handle
> > TDX_OPERAND_BUSY, the one for tdh_mem_page_aug() has already been squashed
> > into the MMU part 2.
> >
> > If you like, maybe I can extract the one for tdh_mem_page_aug() and merge it
> > with 1+5.
> 
> That works for me, but if it's easier for you to merge the fixups in
> the respective base patches that's okay too.
Then I'll merge them into the respective base patches to save effort and make
the latter cleaner :)

Thank you, Paolo!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
  2025-01-13  2:12 ` [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY Yan Zhao
@ 2025-01-17 21:14   ` Sean Christopherson
  2025-01-20  8:05     ` Yan Zhao
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2025-01-17 21:14 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata

On Mon, Jan 13, 2025, Yan Zhao wrote:
> @@ -1884,7 +1904,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>  	}
>  
>  	trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
> -	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
> +
> +	while (1) {
> +		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> +
> +		if (ret != RET_PF_RETRY || !local_retry)
> +			break;
> +
> +		/*
> +		 * Break and keep the orig return value.

Wrap at 80.

> +		 * Signal & irq handling will be done later in vcpu_run()

Please don't use "&" as shorthand.  It saves all of two characters.  That said,
I don't see any point in adding this comment, if the reader can't follow the
logic of this code, these comments aren't going to help them.  And the comment
about vcpu_run() in particular is misleading, as posted interrupts aren't truly
handled by vcpu_run(), rather they're handled by hardware (although KVM does send
a self-IPI).

> +		 */
> +		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
> +		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)

This needs to check that the IRQ/NMI is actually allowed.  I guess it doesn't
matter for IRQs, but it does matter for NMIs.  Why not use kvm_vcpu_has_events()?
Ah, it's a local function.  At a glance, I don't see any harm in exposing that
to TDX.

> +			break;
> +
> +		cond_resched();
> +	}

Nit, IMO this reads better as:

	do {
		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
	} while (ret == RET_PF_RETY && local_retry &&
		 !kvm_vcpu_has_events(vcpu) && !signal_pending(current));

> +	return ret;
>  }
>  
>  int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> -- 
> 2.43.2
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
  2025-01-17 21:14   ` Sean Christopherson
@ 2025-01-20  8:05     ` Yan Zhao
  2025-01-25  1:23       ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Yan Zhao @ 2025-01-20  8:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata

On Fri, Jan 17, 2025 at 01:14:02PM -0800, Sean Christopherson wrote:
> On Mon, Jan 13, 2025, Yan Zhao wrote:
> > @@ -1884,7 +1904,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >  	}
> >  
> >  	trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
> > -	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
> > +
> > +	while (1) {
> > +		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> > +
> > +		if (ret != RET_PF_RETRY || !local_retry)
> > +			break;
> > +
> > +		/*
> > +		 * Break and keep the orig return value.
> 
> Wrap at 80.
> 
> > +		 * Signal & irq handling will be done later in vcpu_run()
> 
> Please don't use "&" as shorthand.  It saves all of two characters.  That said,
Got it!

> I don't see any point in adding this comment, if the reader can't follow the
> logic of this code, these comments aren't going to help them.  And the comment
> about vcpu_run() in particular is misleading, as posted interrupts aren't truly
> handled by vcpu_run(), rather they're handled by hardware (although KVM does send
> a self-IPI).
What about below version?

"
Bail out the local retry
- for pending signal, so that vcpu_run() --> xfer_to_guest_mode_handle_work()
  --> kvm_handle_signal_exit() can exit to userspace for signal handling.
- for pending interrupts, so that tdx_vcpu_enter_exit() --> tdh_vp_enter() will
  be re-executed for interrupt injection through posted interrupt.
- for pending nmi or KVM_REQ_NMI, so that vcpu_enter_guest() will be
  re-executed to process and pend NMI to the TDX module. KVM always regards NMI
  as allowed and the TDX module will inject it when NMI is allowed in the TD.
"

> > +		 */
> > +		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
> > +		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
> 
> This needs to check that the IRQ/NMI is actually allowed.  I guess it doesn't
> matter for IRQs, but it does matter for NMIs.  Why not use kvm_vcpu_has_events()?
Yes. However, vt_nmi_allowed() is always true for TDs.
For interrupt, tdx_interrupt_allowed() is always true unless the exit reason is
EXIT_REASON_HLT. For the EPT violation handler, the exit reason should not be
EXIT_REASON_HLT.

> Ah, it's a local function.  At a glance, I don't see any harm in exposing that
> to TDX.
Besides that kvm_vcpu_has_events() is a local function, the consideration to
check "pi_has_pending_interrupt() || kvm_test_request(KVM_REQ_NMI, vcpu) ||
vcpu->arch.nmi_pending" instead that
(1) the two are effectively equivalent for TDs (as nested is not supported yet)
(2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
    pending. However, vt_inject_exception() is NULL for TDs.

> > +			break;
> > +
> > +		cond_resched();
> > +	}
> 
> Nit, IMO this reads better as:
> 
> 	do {
> 		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> 	} while (ret == RET_PF_RETY && local_retry &&
> 		 !kvm_vcpu_has_events(vcpu) && !signal_pending(current));
>
Hmm, the previous way can save one "cond_resched()" for the common cases, i.e.,
when ret != RET_PF_RETRY or when gpa is shared .

> > +	return ret;
> >  }
> >  
> >  int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> > -- 
> > 2.43.2
> > 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
  2025-01-20  8:05     ` Yan Zhao
@ 2025-01-25  1:23       ` Sean Christopherson
  2025-01-27  9:24         ` Yan Zhao
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2025-01-25  1:23 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata

On Mon, Jan 20, 2025, Yan Zhao wrote:
> On Fri, Jan 17, 2025 at 01:14:02PM -0800, Sean Christopherson wrote:
> > On Mon, Jan 13, 2025, Yan Zhao wrote:
> > I don't see any point in adding this comment, if the reader can't follow the
> > logic of this code, these comments aren't going to help them.  And the comment
> > about vcpu_run() in particular is misleading, as posted interrupts aren't truly
> > handled by vcpu_run(), rather they're handled by hardware (although KVM does send
> > a self-IPI).
> What about below version?
> 
> "
> Bail out the local retry
> - for pending signal, so that vcpu_run() --> xfer_to_guest_mode_handle_work()
>   --> kvm_handle_signal_exit() can exit to userspace for signal handling.

Eh, pending signals should be self-explanatory.

> - for pending interrupts, so that tdx_vcpu_enter_exit() --> tdh_vp_enter() will
>   be re-executed for interrupt injection through posted interrupt.
> - for pending nmi or KVM_REQ_NMI, so that vcpu_enter_guest() will be
>   re-executed to process and pend NMI to the TDX module. KVM always regards NMI
>   as allowed and the TDX module will inject it when NMI is allowed in the TD.
> "
> 
> > > +		 */
> > > +		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
> > > +		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
> > 
> > This needs to check that the IRQ/NMI is actually allowed.  I guess it doesn't
> > matter for IRQs, but it does matter for NMIs.  Why not use kvm_vcpu_has_events()?
> Yes. However, vt_nmi_allowed() is always true for TDs.
> For interrupt, tdx_interrupt_allowed() is always true unless the exit reason is
> EXIT_REASON_HLT. For the EPT violation handler, the exit reason should not be
> EXIT_REASON_HLT.
> 
> > Ah, it's a local function.  At a glance, I don't see any harm in exposing that
> > to TDX.
> Besides that kvm_vcpu_has_events() is a local function, the consideration to
> check "pi_has_pending_interrupt() || kvm_test_request(KVM_REQ_NMI, vcpu) ||
> vcpu->arch.nmi_pending" instead that

*sigh*

  PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one NMI in
  the TDX module. Also, TDX doesn't allow KVM to request NMI-window exit
  directly. When there is already one NMI pending in the TDX module, i.e. it
  has not been delivered to TDX guest yet, if there is NMI pending in KVM,
  collapse the pending NMI in KVM into the one pending in the TDX module.
  Such collapse is OK considering on X86 bare metal, multiple NMIs could
  collapse into one NMI, e.g. when NMI is blocked by SMI.  It's OS's
  responsibility to poll all NMI sources in the NMI handler to avoid missing
  handling of some NMI events. More details can be found in the changelog of
  the patch "KVM: TDX: Implement methods to inject NMI".

That's probably fine?  But it's still unfortunate that TDX manages to be different
at almost every opportunity :-(

> (1) the two are effectively equivalent for TDs (as nested is not supported yet)

If they're all equivalent, then *not* open coding is desriable, IMO.  Ah, but
they aren't equivalent.  tdx_protected_apic_has_interrupt() also checks whatever
TD_VCPU_STATE_DETAILS_NON_ARCH is.

	vcpu_state_details =
		td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);

	return tdx_vcpu_state_details_intr_pending(vcpu_state_details);

That code needs a comment, because depending on the behavior of that field, it
might not even be correct.

> (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
>     pending. However, vt_inject_exception() is NULL for TDs.

Wouldn't a pending exception be a KVM bug?

The bigger oddity, which I think is worth calling out, is that because KVM can't
determine if IRQs (or NMIs) are blocked at the time of the EPT violation, false
positives are inevitable.  I.e. KVM may re-enter the guest even if the IRQ/NMI
can't be delivered.  Call *that* out, and explain why it's fine.

> > > +			break;
> > > +
> > > +		cond_resched();
> > > +	}
> > 
> > Nit, IMO this reads better as:
> > 
> > 	do {
> > 		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> > 	} while (ret == RET_PF_RETY && local_retry &&
> > 		 !kvm_vcpu_has_events(vcpu) && !signal_pending(current));
> >
> Hmm, the previous way can save one "cond_resched()" for the common cases, i.e.,
> when ret != RET_PF_RETRY or when gpa is shared .

Hrm, right.  Maybe this?  Dunno if that's any better.

	ret = 0;
	do {
		if (ret)
			cond_resched();

		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
	} while (...)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
  2025-01-25  1:23       ` Sean Christopherson
@ 2025-01-27  9:24         ` Yan Zhao
  2025-01-27 17:04           ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Yan Zhao @ 2025-01-27  9:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata

On Fri, Jan 24, 2025 at 05:23:36PM -0800, Sean Christopherson wrote:
> On Mon, Jan 20, 2025, Yan Zhao wrote:
> > On Fri, Jan 17, 2025 at 01:14:02PM -0800, Sean Christopherson wrote:
> > > On Mon, Jan 13, 2025, Yan Zhao wrote:
> > > I don't see any point in adding this comment, if the reader can't follow the
> > > logic of this code, these comments aren't going to help them.  And the comment
> > > about vcpu_run() in particular is misleading, as posted interrupts aren't truly
> > > handled by vcpu_run(), rather they're handled by hardware (although KVM does send
> > > a self-IPI).
> > What about below version?
> > 
> > "
> > Bail out the local retry
> > - for pending signal, so that vcpu_run() --> xfer_to_guest_mode_handle_work()
> >   --> kvm_handle_signal_exit() can exit to userspace for signal handling.
> 
> Eh, pending signals should be self-explanatory.
Ok.

> 
> > - for pending interrupts, so that tdx_vcpu_enter_exit() --> tdh_vp_enter() will
> >   be re-executed for interrupt injection through posted interrupt.
> > - for pending nmi or KVM_REQ_NMI, so that vcpu_enter_guest() will be
> >   re-executed to process and pend NMI to the TDX module. KVM always regards NMI
> >   as allowed and the TDX module will inject it when NMI is allowed in the TD.
> > "
> > 
> > > > +		 */
> > > > +		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
> > > > +		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
> > > 
> > > This needs to check that the IRQ/NMI is actually allowed.  I guess it doesn't
> > > matter for IRQs, but it does matter for NMIs.  Why not use kvm_vcpu_has_events()?
> > Yes. However, vt_nmi_allowed() is always true for TDs.
> > For interrupt, tdx_interrupt_allowed() is always true unless the exit reason is
> > EXIT_REASON_HLT. For the EPT violation handler, the exit reason should not be
> > EXIT_REASON_HLT.
> > 
> > > Ah, it's a local function.  At a glance, I don't see any harm in exposing that
> > > to TDX.
> > Besides that kvm_vcpu_has_events() is a local function, the consideration to
> > check "pi_has_pending_interrupt() || kvm_test_request(KVM_REQ_NMI, vcpu) ||
> > vcpu->arch.nmi_pending" instead that
> 
> *sigh*
> 
>   PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one NMI in
>   the TDX module. Also, TDX doesn't allow KVM to request NMI-window exit
>   directly. When there is already one NMI pending in the TDX module, i.e. it
>   has not been delivered to TDX guest yet, if there is NMI pending in KVM,
>   collapse the pending NMI in KVM into the one pending in the TDX module.
>   Such collapse is OK considering on X86 bare metal, multiple NMIs could
>   collapse into one NMI, e.g. when NMI is blocked by SMI.  It's OS's
>   responsibility to poll all NMI sources in the NMI handler to avoid missing
>   handling of some NMI events. More details can be found in the changelog of
>   the patch "KVM: TDX: Implement methods to inject NMI".
> 
> That's probably fine?  But it's still unfortunate that TDX manages to be different
> at almost every opportunity :-(
:(

> > (1) the two are effectively equivalent for TDs (as nested is not supported yet)
> 
> If they're all equivalent, then *not* open coding is desriable, IMO.  Ah, but
> they aren't equivalent.  tdx_protected_apic_has_interrupt() also checks whatever
> TD_VCPU_STATE_DETAILS_NON_ARCH is.
> 
> 	vcpu_state_details =
> 		td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
> 
> 	return tdx_vcpu_state_details_intr_pending(vcpu_state_details);
Right. That's why I asked that in the PUCK.

The "tdx_vcpu_state_details_intr_pending(vcpu_state_details)" checks if there's
a pending interrupt that can be recognized.
As in the code snippet of TDX module:

            case MD_TDVPS_VCPU_STATE_DETAILS_FIELD_CODE:
            {
                // Calculate virtual interrupt pending status
                vcpu_state_t vcpu_state_details;
                guest_interrupt_status_t interrupt_status;
                uint64_t interrupt_status_raw;

                set_vm_vmcs_as_active(md_ctx.tdvps_ptr, 0);

                ia32_vmread(VMX_GUEST_INTERRUPT_STATUS_ENCODE, &interrupt_status_raw);
                interrupt_status.raw = (uint16_t)interrupt_status_raw;
                vcpu_state_details.raw = 0ULL;
                if ((interrupt_status.rvi & 0xF0UL) > (md_ctx.tdvps_ptr->vapic.apic[PPR_INDEX] & 0xF0UL))
                {
                    vcpu_state_details.vmxip = 1ULL;
                }
                read_value = vcpu_state_details.raw;

                break;
            }

My previous consideration is that
when there's a pending interrupt that can be recognized, given the current VM
Exit reason is EPT violation, the next VM Entry will not deliver the interrupt
since the condition to recognize and deliver interrupt is unchanged after the
EPT violation VM Exit.
So checking pending interrupt brings only false positive, which is unlike
checking PID that the vector in the PID could arrive after the EPT violation VM
Exit and PID would be cleared after VM Entry even if the interrupts are not
deliverable. So checking PID may lead to true positive and less false positive.

But I understand your point now. As checking PID can also be false positive, it
would be no harm to introduce another source of false positive.

So using kvm_vcpu_has_events() looks like a kind of trade-off?
kvm_vcpu_has_events() can make TDX's code less special but might lead to the
local vCPU more vulnerable to the 0-step mitigation, especially when interrupts
are disabled in the guest.


> 
> That code needs a comment, because depending on the behavior of that field, it
> might not even be correct.
> 
> > (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
> >     pending. However, vt_inject_exception() is NULL for TDs.
> 
> Wouldn't a pending exception be a KVM bug?
Hmm, yes, it should be.
Although kvm_vcpu_ioctl_x86_set_mce() can invoke kvm_queue_exception() to queue
an exception for TDs, this should not occur while VCPU_RUN is in progress.

> The bigger oddity, which I think is worth calling out, is that because KVM can't
> determine if IRQs (or NMIs) are blocked at the time of the EPT violation, false
> positives are inevitable.  I.e. KVM may re-enter the guest even if the IRQ/NMI
> can't be delivered.  Call *that* out, and explain why it's fine.
Will do.

It's fine because:
In the worst-case scenario where every break is a false positive, the local vCPU
becomes more vulnerable to 0-step mitigation due to the lack of local retries.
If the local vCPU triggers a 0-step mitigation in tdh_vp_enter(), the remote
vCPUs would either retry while contending for the SEPT tree lock if they are in
the page installation path, or retry after waiting for the local vCPU to be
kicked off and ensuring no tdh_vp_enter() is occurring in the local vCPU.

Moreover, even there's no break out for interrupt injection in the local vCPU,
the local vCPU could still suffer 0-step mitigation, e.g. when an RIP faults
more than 6 GFNs.

> > > > +			break;
> > > > +
> > > > +		cond_resched();
> > > > +	}
> > > 
> > > Nit, IMO this reads better as:
> > > 
> > > 	do {
> > > 		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> > > 	} while (ret == RET_PF_RETY && local_retry &&
> > > 		 !kvm_vcpu_has_events(vcpu) && !signal_pending(current));
> > >
> > Hmm, the previous way can save one "cond_resched()" for the common cases, i.e.,
> > when ret != RET_PF_RETRY or when gpa is shared .
> 
> Hrm, right.  Maybe this?  Dunno if that's any better.
> 
> 	ret = 0;
> 	do {
> 		if (ret)
> 			cond_resched();
Not sure either :)
This is clearer, however, brings one more check to the normal path.

> 		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> 	} while (...)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
  2025-01-27  9:24         ` Yan Zhao
@ 2025-01-27 17:04           ` Sean Christopherson
  2025-02-05  7:34             ` Yan Zhao
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2025-01-27 17:04 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata

On Mon, Jan 27, 2025, Yan Zhao wrote:
> On Fri, Jan 24, 2025 at 05:23:36PM -0800, Sean Christopherson wrote:
> My previous consideration is that
> when there's a pending interrupt that can be recognized, given the current VM
> Exit reason is EPT violation, the next VM Entry will not deliver the interrupt
> since the condition to recognize and deliver interrupt is unchanged after the
> EPT violation VM Exit.
> So checking pending interrupt brings only false positive, which is unlike
> checking PID that the vector in the PID could arrive after the EPT violation VM
> Exit and PID would be cleared after VM Entry even if the interrupts are not
> deliverable. So checking PID may lead to true positive and less false positive.
> 
> But I understand your point now. As checking PID can also be false positive, it
> would be no harm to introduce another source of false positive.

Yep.  FWIW, I agree that checking VMXIP is theoretically "worse", in the sense
that it's much more likely to be a false positive.  Off the top of my head, the
only time VMXIP will be set with RFLAGS.IF=1 on an EPT Violation is if the EPT
violation happens in an STI or MOV_SS/POP_SS shadow.

> So using kvm_vcpu_has_events() looks like a kind of trade-off?
> kvm_vcpu_has_events() can make TDX's code less special but might lead to the
> local vCPU more vulnerable to the 0-step mitigation, especially when interrupts
> are disabled in the guest.

Ya.  I think it's worth worth using kvm_vcpu_has_events() though.  In practice,
observing VMXIP=1 with RFLAGS=0 on an EPT Violation means the guest is accessing
memory for the first time in atomic kernel context.  That alone seems highly
unlikely.  Add in that triggering retry requires an uncommon race, and the overall
probability becomes miniscule.

> > That code needs a comment, because depending on the behavior of that field, it
> > might not even be correct.
> > 
> > > (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
> > >     pending. However, vt_inject_exception() is NULL for TDs.
> > 
> > Wouldn't a pending exception be a KVM bug?
> Hmm, yes, it should be.
> Although kvm_vcpu_ioctl_x86_set_mce() can invoke kvm_queue_exception() to queue
> an exception for TDs, 

I thought TDX didn't support synthetic #MCs?

> this should not occur while VCPU_RUN is in progress.

Not "should not", _cannot_, because they're mutually exclusive.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
  2025-01-27 17:04           ` Sean Christopherson
@ 2025-02-05  7:34             ` Yan Zhao
  0 siblings, 0 replies; 27+ messages in thread
From: Yan Zhao @ 2025-02-05  7:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, isaku.yamahata

On Mon, Jan 27, 2025 at 09:04:59AM -0800, Sean Christopherson wrote:
> On Mon, Jan 27, 2025, Yan Zhao wrote:
> > On Fri, Jan 24, 2025 at 05:23:36PM -0800, Sean Christopherson wrote:
> > My previous consideration is that
> > when there's a pending interrupt that can be recognized, given the current VM
> > Exit reason is EPT violation, the next VM Entry will not deliver the interrupt
> > since the condition to recognize and deliver interrupt is unchanged after the
> > EPT violation VM Exit.
> > So checking pending interrupt brings only false positive, which is unlike
> > checking PID that the vector in the PID could arrive after the EPT violation VM
> > Exit and PID would be cleared after VM Entry even if the interrupts are not
> > deliverable. So checking PID may lead to true positive and less false positive.
> > 
> > But I understand your point now. As checking PID can also be false positive, it
> > would be no harm to introduce another source of false positive.
> 
> Yep.  FWIW, I agree that checking VMXIP is theoretically "worse", in the sense
> that it's much more likely to be a false positive.  Off the top of my head, the
> only time VMXIP will be set with RFLAGS.IF=1 on an EPT Violation is if the EPT
> violation happens in an STI or MOV_SS/POP_SS shadow.
> 
> > So using kvm_vcpu_has_events() looks like a kind of trade-off?
> > kvm_vcpu_has_events() can make TDX's code less special but might lead to the
> > local vCPU more vulnerable to the 0-step mitigation, especially when interrupts
> > are disabled in the guest.
> 
> Ya.  I think it's worth worth using kvm_vcpu_has_events() though.  In practice,
> observing VMXIP=1 with RFLAGS=0 on an EPT Violation means the guest is accessing
> memory for the first time in atomic kernel context.  That alone seems highly
> unlikely.  Add in that triggering retry requires an uncommon race, and the overall
> probability becomes miniscule.
Ok. Will use kvm_vcpu_has_events() instead.

> 
> > > That code needs a comment, because depending on the behavior of that field, it
> > > might not even be correct.
> > > 
> > > > (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
> > > >     pending. However, vt_inject_exception() is NULL for TDs.
> > > 
> > > Wouldn't a pending exception be a KVM bug?
> > Hmm, yes, it should be.
> > Although kvm_vcpu_ioctl_x86_set_mce() can invoke kvm_queue_exception() to queue
> > an exception for TDs, 
> 
> I thought TDX didn't support synthetic #MCs?
Hmm, it's just an example to show kvm_queue_exception() can be invoked for a TD,
as it's not disallowed :)
Another example is kvm_arch_vcpu_ioctl_set_guest_debug() when
vcpu->arch.guest_state_protected is false for a DEBUG TD.

But as you said, they are not possible to be called during vcpu_run().

> > this should not occur while VCPU_RUN is in progress.
> 
> Not "should not", _cannot_, because they're mutually exclusive.
Right, "cannot" is the accurate term.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-02-05  7:35 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-13  2:09 [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Yan Zhao
2025-01-13  2:10 ` [PATCH 1/7] KVM: TDX: Return -EBUSY when tdh_mem_page_add() encounters TDX_OPERAND_BUSY Yan Zhao
2025-01-14 22:24   ` Edgecombe, Rick P
2025-01-15  4:59     ` Yan Zhao
2025-01-13  2:11 ` [PATCH 2/7] KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault() Yan Zhao
2025-01-14 22:24   ` Edgecombe, Rick P
2025-01-15  4:58     ` Yan Zhao
2025-01-13  2:12 ` [PATCH 3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY Yan Zhao
2025-01-17 21:14   ` Sean Christopherson
2025-01-20  8:05     ` Yan Zhao
2025-01-25  1:23       ` Sean Christopherson
2025-01-27  9:24         ` Yan Zhao
2025-01-27 17:04           ` Sean Christopherson
2025-02-05  7:34             ` Yan Zhao
2025-01-13  2:12 ` [PATCH 4/7] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal Yan Zhao
2025-01-16  6:23   ` Binbin Wu
2025-01-16  6:28     ` Binbin Wu
2025-01-16  8:18     ` Yan Zhao
2025-01-13  2:13 ` [PATCH 5/7] fixup! KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Yan Zhao
2025-01-16  6:30   ` Binbin Wu
2025-01-13  2:13 ` [PATCH 6/7] " Yan Zhao
2025-01-13  2:13 ` [PATCH 7/7] fixup! KVM: TDX: Implement TDX vcpu enter/exit path Yan Zhao
2025-01-14 22:27 ` [PATCH 0/7] KVM: TDX SEPT SEAMCALL retry Edgecombe, Rick P
2025-01-15 16:43 ` Paolo Bonzini
2025-01-16  0:52   ` Yan Zhao
2025-01-16 11:07     ` Paolo Bonzini
2025-01-17  9:52       ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).