[PATCH 0/2] two KVM MMU fixes for TDX

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] two KVM MMU fixes for TDX
@ 2025-02-17  8:55 Yan Zhao
  2025-02-17  8:56 ` [PATCH 1/2] KVM: TDX: Handle SEPT zap error due to page add error in premap Yan Zhao
  2025-02-17  8:57 ` [PATCH 2/2] KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead Yan Zhao
  0 siblings, 2 replies; 7+ messages in thread
From: Yan Zhao @ 2025-02-17  8:55 UTC (permalink / raw)
  To: pbonzini, seanjc; +Cc: rick.p.edgecombe, linux-kernel, kvm, Yan Zhao

Hi, 

There are two fixes to KVM MMU for TDX in response to two hypothetically
triggered errors:
(1) errors in tdh_mem_page_add(),
(2) fatal errors in tdh_mem_sept_add()/tdh_mem_page_aug().

Patch 1 handles the error in SEPT zap resulting from error (1).
Patch 2 fixes a possible stuck in the kernel loop introduced by error (2).
(A similar fix in SEPT SEAMCALL local retry is also required if the fix in
patch 2 looks good to you).

The two errors are not observed in any real workloads yet.
The series is tested by faking the error in the SEAMCALL wrapper while
bypassing the real SEAMCALLs.

Thanks
Yan

Yan Zhao (2):
  KVM: TDX: Handle SEPT zap error due to page add error in premap
  KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead

 arch/x86/kvm/mmu/mmu.c |  4 +++
 arch/x86/kvm/vmx/tdx.c | 64 +++++++++++++++++++++++++++++-------------
 2 files changed, 49 insertions(+), 19 deletions(-)

-- 
2.43.2

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] KVM: TDX: Handle SEPT zap error due to page add error in premap
  2025-02-17  8:55 [PATCH 0/2] two KVM MMU fixes for TDX Yan Zhao
@ 2025-02-17  8:56 ` Yan Zhao
  2025-02-17  8:57 ` [PATCH 2/2] KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead Yan Zhao
  1 sibling, 0 replies; 7+ messages in thread
From: Yan Zhao @ 2025-02-17  8:56 UTC (permalink / raw)
  To: pbonzini, seanjc; +Cc: rick.p.edgecombe, linux-kernel, kvm, Yan Zhao

Move the handling of SEPT zap errors caused by unsuccessful execution of
tdh_mem_page_add() in KVM_TDX_INIT_MEM_REGION from
tdx_sept_drop_private_spte() to tdx_sept_zap_private_spte(). Introduce a
new helper function tdx_is_sept_zap_err_due_to_premap() to detect this
specific error.

During the IOCTL KVM_TDX_INIT_MEM_REGION, KVM premaps leaf SPTEs in the
mirror page table before the corresponding entry in the private page table
is successfully installed by tdh_mem_page_add(). If an error occurs during
the invocation of tdh_mem_page_add(), a mismatch between the mirror and
private page tables results in SEAMCALLs for SEPT zap returning the error
code TDX_EPT_ENTRY_STATE_INCORRECT.

The error TDX_EPT_WALK_FAILED is not possible because, during
KVM_TDX_INIT_MEM_REGION, KVM only premaps leaf SPTEs after successfully
mapping non-leaf SPTEs. Unlike leaf SPTEs, there is no mismatch in non-leaf
PTEs between the mirror and private page tables. Therefore, during zap,
SEAMCALLs should find an empty leaf entry in the private EPT, leading to
the error TDX_EPT_ENTRY_STATE_INCORRECT instead of TDX_EPT_WALK_FAILED.

Since tdh_mem_range_block() is always invoked before tdh_mem_page_remove(),
move the handling of SEPT zap errors from tdx_sept_drop_private_spte() to
tdx_sept_zap_private_spte(). In tdx_sept_zap_private_spte(), return 0 for
errors due to premap to skip executing other SEAMCALLs for zap, which are
unnecessary. Return 1 to indicate no other errors, allowing the execution
of other zap SEAMCALLs to continue.

The failure of tdh_mem_page_add() is uncommon and has not been observed in
real workloads. Currently, this failure is only hypothetically triggered by
skipping the real SEAMCALL and faking the add error in the SEAMCALL
wrapper. Additionally, without this fix, there will be no host crashes or
other severe issues.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 64 +++++++++++++++++++++++++++++-------------
 1 file changed, 45 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8cad38e8e0bc..86c0653d797e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1616,20 +1616,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		tdx_no_vcpus_enter_stop(kvm);
 	}
 
-	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE &&
-		     err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
-		/*
-		 * Page is mapped by KVM_TDX_INIT_MEM_REGION, but hasn't called
-		 * tdh_mem_page_add().
-		 */
-		if ((!is_last_spte(entry, level) || !(entry & VMX_EPT_RWX_MASK)) &&
-		    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
-			atomic64_dec(&kvm_tdx->nr_premapped);
-			tdx_unpin(kvm, page);
-			return 0;
-		}
-	}
-
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state);
 		return -EIO;
@@ -1667,8 +1653,41 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
+/*
+ * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
+ * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
+ * successfully.
+ *
+ * Since tdh_mem_sept_add() must have been invoked successfully before a
+ * non-leaf entry present in the mirrored page table, the SEPT ZAP related
+ * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
+ * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
+ * SEPT.
+ *
+ * Further check if the returned entry from SEPT walking is with RWX permissions
+ * to filter out anything unexpected.
+ *
+ * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
+ * level_state returned from a SEAMCALL error is the same as that passed into
+ * the SEAMCALL.
+ */
+static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
+					     u64 entry, int level)
+{
+	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
+		return false;
+
+	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
+		return false;
+
+	if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
+		return false;
+
+	return true;
+}
+
 static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
-				     enum pg_level level)
+				     enum pg_level level, struct page *page)
 {
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1686,12 +1705,18 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
 		tdx_no_vcpus_enter_stop(kvm);
 	}
+	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
+	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
+		atomic64_dec(&kvm_tdx->nr_premapped);
+		tdx_unpin(kvm, page);
+		return 0;
+	}
 
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
 		return -EIO;
 	}
-	return 0;
+	return 1;
 }
 
 /*
@@ -1769,6 +1794,7 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 				 enum pg_level level, kvm_pfn_t pfn)
 {
+	struct page *page = pfn_to_page(pfn);
 	int ret;
 
 	/*
@@ -1779,8 +1805,8 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
 		return -EINVAL;
 
-	ret = tdx_sept_zap_private_spte(kvm, gfn, level);
-	if (ret)
+	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
+	if (ret <= 0)
 		return ret;
 
 	/*
@@ -1789,7 +1815,7 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	 */
 	tdx_track(kvm);
 
-	return tdx_sept_drop_private_spte(kvm, gfn, level, pfn_to_page(pfn));
+	return tdx_sept_drop_private_spte(kvm, gfn, level, page);
 }
 
 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead
  2025-02-17  8:55 [PATCH 0/2] two KVM MMU fixes for TDX Yan Zhao
  2025-02-17  8:56 ` [PATCH 1/2] KVM: TDX: Handle SEPT zap error due to page add error in premap Yan Zhao
@ 2025-02-17  8:57 ` Yan Zhao
  2025-02-18 16:03   ` Sean Christopherson
  1 sibling, 1 reply; 7+ messages in thread
From: Yan Zhao @ 2025-02-17  8:57 UTC (permalink / raw)
  To: pbonzini, seanjc; +Cc: rick.p.edgecombe, linux-kernel, kvm, Yan Zhao

Bail out of the loop in kvm_tdp_map_page() when a VM is dead. Otherwise,
kvm_tdp_map_page() may get stuck in the kernel loop when there's only one
vCPU in the VM (or if the other vCPUs are not executing ioctls), even if
fatal errors have occurred.

kvm_tdp_map_page() is called by the ioctl KVM_PRE_FAULT_MEMORY or the TDX
ioctl KVM_TDX_INIT_MEM_REGION. It loops in the kernel whenever RET_PF_RETRY
is returned. In the TDP MMU, kvm_tdp_mmu_map() always returns RET_PF_RETRY,
regardless of the specific error code from tdp_mmu_set_spte_atomic(),
tdp_mmu_link_sp(), or tdp_mmu_split_huge_page(). While this is acceptable
in general cases where the only possible error code from these functions is
-EBUSY, TDX introduces an additional error code, -EIO, due to SEAMCALL
errors.

Since this -EIO error is also a fatal error, check for VM dead in the
kvm_tdp_map_page() to avoid unnecessary retries until a signal is pending.

The error -EIO is uncommon and has not been observed in real workloads.
Currently, it is only hypothetically triggered by bypassing the real
SEAMCALL and faking an error in the SEAMCALL wrapper.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 08ed5092c15a..3a8d735939b5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4700,6 +4700,10 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 	do {
 		if (signal_pending(current))
 			return -EINTR;
+
+		if (vcpu->kvm->vm_dead)
+			return -EIO;
+
 		cond_resched();
 		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
 	} while (r == RET_PF_RETRY);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead
  2025-02-17  8:57 ` [PATCH 2/2] KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead Yan Zhao
@ 2025-02-18 16:03   ` Sean Christopherson
  2025-02-19  2:17     ` Yan Zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Sean Christopherson @ 2025-02-18 16:03 UTC (permalink / raw)
  To: Yan Zhao; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm

On Mon, Feb 17, 2025, Yan Zhao wrote:
> Bail out of the loop in kvm_tdp_map_page() when a VM is dead. Otherwise,
> kvm_tdp_map_page() may get stuck in the kernel loop when there's only one
> vCPU in the VM (or if the other vCPUs are not executing ioctls), even if
> fatal errors have occurred.
> 
> kvm_tdp_map_page() is called by the ioctl KVM_PRE_FAULT_MEMORY or the TDX
> ioctl KVM_TDX_INIT_MEM_REGION. It loops in the kernel whenever RET_PF_RETRY
> is returned. In the TDP MMU, kvm_tdp_mmu_map() always returns RET_PF_RETRY,
> regardless of the specific error code from tdp_mmu_set_spte_atomic(),
> tdp_mmu_link_sp(), or tdp_mmu_split_huge_page(). While this is acceptable
> in general cases where the only possible error code from these functions is
> -EBUSY, TDX introduces an additional error code, -EIO, due to SEAMCALL
> errors.
> 
> Since this -EIO error is also a fatal error, check for VM dead in the
> kvm_tdp_map_page() to avoid unnecessary retries until a signal is pending.
> 
> The error -EIO is uncommon and has not been observed in real workloads.
> Currently, it is only hypothetically triggered by bypassing the real
> SEAMCALL and faking an error in the SEAMCALL wrapper.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 08ed5092c15a..3a8d735939b5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4700,6 +4700,10 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
>  	do {
>  		if (signal_pending(current))
>  			return -EINTR;
> +
> +		if (vcpu->kvm->vm_dead)

This needs to be READ_ONCE().  Along those lines, I think I'd prefer

		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
			return -EIO;

or

		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) 
			return -EIO;

so that if more terminal requests come long, we can bundle everything into a
single check via a selective version of kvm_request_pending().

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead
  2025-02-18 16:03   ` Sean Christopherson
@ 2025-02-19  2:17     ` Yan Zhao
  2025-02-19 14:18       ` Sean Christopherson
  0 siblings, 1 reply; 7+ messages in thread
From: Yan Zhao @ 2025-02-19  2:17 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm

On Tue, Feb 18, 2025 at 08:03:57AM -0800, Sean Christopherson wrote:
> On Mon, Feb 17, 2025, Yan Zhao wrote:
> > Bail out of the loop in kvm_tdp_map_page() when a VM is dead. Otherwise,
> > kvm_tdp_map_page() may get stuck in the kernel loop when there's only one
> > vCPU in the VM (or if the other vCPUs are not executing ioctls), even if
> > fatal errors have occurred.
> > 
> > kvm_tdp_map_page() is called by the ioctl KVM_PRE_FAULT_MEMORY or the TDX
> > ioctl KVM_TDX_INIT_MEM_REGION. It loops in the kernel whenever RET_PF_RETRY
> > is returned. In the TDP MMU, kvm_tdp_mmu_map() always returns RET_PF_RETRY,
> > regardless of the specific error code from tdp_mmu_set_spte_atomic(),
> > tdp_mmu_link_sp(), or tdp_mmu_split_huge_page(). While this is acceptable
> > in general cases where the only possible error code from these functions is
> > -EBUSY, TDX introduces an additional error code, -EIO, due to SEAMCALL
> > errors.
> > 
> > Since this -EIO error is also a fatal error, check for VM dead in the
> > kvm_tdp_map_page() to avoid unnecessary retries until a signal is pending.
> > 
> > The error -EIO is uncommon and has not been observed in real workloads.
> > Currently, it is only hypothetically triggered by bypassing the real
> > SEAMCALL and faking an error in the SEAMCALL wrapper.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 08ed5092c15a..3a8d735939b5 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4700,6 +4700,10 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> >  	do {
> >  		if (signal_pending(current))
> >  			return -EINTR;
> > +
> > +		if (vcpu->kvm->vm_dead)
> 
> This needs to be READ_ONCE().  Along those lines, I think I'd prefer
Indeed.

> 
> 		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> 			return -EIO;
> 
> or
> 
> 		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) 
> 			return -EIO;
Hmm, what's the difference between the two cases?
Paste error?

> so that if more terminal requests come long, we can bundle everything into a
> single check via a selective version of kvm_request_pending().
Makes sense!
I'll update it to
 		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) 
 			return -EIO;
in v2.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead
  2025-02-19  2:17     ` Yan Zhao
@ 2025-02-19 14:18       ` Sean Christopherson
  2025-02-20  1:50         ` Yan Zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Sean Christopherson @ 2025-02-19 14:18 UTC (permalink / raw)
  To: Yan Zhao; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm

On Wed, Feb 19, 2025, Yan Zhao wrote:
> On Tue, Feb 18, 2025 at 08:03:57AM -0800, Sean Christopherson wrote:
> > On Mon, Feb 17, 2025, Yan Zhao wrote:
> > > Bail out of the loop in kvm_tdp_map_page() when a VM is dead. Otherwise,
> > > kvm_tdp_map_page() may get stuck in the kernel loop when there's only one
> > > vCPU in the VM (or if the other vCPUs are not executing ioctls), even if
> > > fatal errors have occurred.
> > > 
> > > kvm_tdp_map_page() is called by the ioctl KVM_PRE_FAULT_MEMORY or the TDX
> > > ioctl KVM_TDX_INIT_MEM_REGION. It loops in the kernel whenever RET_PF_RETRY
> > > is returned. In the TDP MMU, kvm_tdp_mmu_map() always returns RET_PF_RETRY,
> > > regardless of the specific error code from tdp_mmu_set_spte_atomic(),
> > > tdp_mmu_link_sp(), or tdp_mmu_split_huge_page(). While this is acceptable
> > > in general cases where the only possible error code from these functions is
> > > -EBUSY, TDX introduces an additional error code, -EIO, due to SEAMCALL
> > > errors.
> > > 
> > > Since this -EIO error is also a fatal error, check for VM dead in the
> > > kvm_tdp_map_page() to avoid unnecessary retries until a signal is pending.
> > > 
> > > The error -EIO is uncommon and has not been observed in real workloads.
> > > Currently, it is only hypothetically triggered by bypassing the real
> > > SEAMCALL and faking an error in the SEAMCALL wrapper.
> > > 
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 08ed5092c15a..3a8d735939b5 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4700,6 +4700,10 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > >  	do {
> > >  		if (signal_pending(current))
> > >  			return -EINTR;
> > > +
> > > +		if (vcpu->kvm->vm_dead)
> > 
> > This needs to be READ_ONCE().  Along those lines, I think I'd prefer
> Indeed.
> 
> > 
> > 		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> > 			return -EIO;
> > 
> > or
> > 
> > 		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) 
> > 			return -EIO;
> Hmm, what's the difference between the two cases?
> Paste error?

Hrm, yes.  I already forgot what I was thinking, but I believe the second one was
supposed to be:

		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
			return -EIO;

The "check" version should be fine though, i.e. clearing the request is ok,
because kvm_vcpu_ioctl() will see vcpu->kvm->vm_dead before handling KVM_RUN or
any other ioctl.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead
  2025-02-19 14:18       ` Sean Christopherson
@ 2025-02-20  1:50         ` Yan Zhao
  0 siblings, 0 replies; 7+ messages in thread
From: Yan Zhao @ 2025-02-20  1:50 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm

On Wed, Feb 19, 2025 at 06:18:41AM -0800, Sean Christopherson wrote:
> On Wed, Feb 19, 2025, Yan Zhao wrote:
> > On Tue, Feb 18, 2025 at 08:03:57AM -0800, Sean Christopherson wrote:
> > > On Mon, Feb 17, 2025, Yan Zhao wrote:
> > > > Bail out of the loop in kvm_tdp_map_page() when a VM is dead. Otherwise,
> > > > kvm_tdp_map_page() may get stuck in the kernel loop when there's only one
> > > > vCPU in the VM (or if the other vCPUs are not executing ioctls), even if
> > > > fatal errors have occurred.
> > > > 
> > > > kvm_tdp_map_page() is called by the ioctl KVM_PRE_FAULT_MEMORY or the TDX
> > > > ioctl KVM_TDX_INIT_MEM_REGION. It loops in the kernel whenever RET_PF_RETRY
> > > > is returned. In the TDP MMU, kvm_tdp_mmu_map() always returns RET_PF_RETRY,
> > > > regardless of the specific error code from tdp_mmu_set_spte_atomic(),
> > > > tdp_mmu_link_sp(), or tdp_mmu_split_huge_page(). While this is acceptable
> > > > in general cases where the only possible error code from these functions is
> > > > -EBUSY, TDX introduces an additional error code, -EIO, due to SEAMCALL
> > > > errors.
> > > > 
> > > > Since this -EIO error is also a fatal error, check for VM dead in the
> > > > kvm_tdp_map_page() to avoid unnecessary retries until a signal is pending.
> > > > 
> > > > The error -EIO is uncommon and has not been observed in real workloads.
> > > > Currently, it is only hypothetically triggered by bypassing the real
> > > > SEAMCALL and faking an error in the SEAMCALL wrapper.
> > > > 
> > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 08ed5092c15a..3a8d735939b5 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -4700,6 +4700,10 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > > >  	do {
> > > >  		if (signal_pending(current))
> > > >  			return -EINTR;
> > > > +
> > > > +		if (vcpu->kvm->vm_dead)
> > > 
> > > This needs to be READ_ONCE().  Along those lines, I think I'd prefer
> > Indeed.
> > 
> > > 
> > > 		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> > > 			return -EIO;
> > > 
> > > or
> > > 
> > > 		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) 
> > > 			return -EIO;
> > Hmm, what's the difference between the two cases?
> > Paste error?
> 
> Hrm, yes.  I already forgot what I was thinking, but I believe the second one was
> supposed to be:
> 
> 		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
> 			return -EIO;
> 
> The "check" version should be fine though, i.e. clearing the request is ok,
> because kvm_vcpu_ioctl() will see vcpu->kvm->vm_dead before handling KVM_RUN or
> any other ioctl.
Got it!

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-02-20  1:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-17  8:55 [PATCH 0/2] two KVM MMU fixes for TDX Yan Zhao
2025-02-17  8:56 ` [PATCH 1/2] KVM: TDX: Handle SEPT zap error due to page add error in premap Yan Zhao
2025-02-17  8:57 ` [PATCH 2/2] KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead Yan Zhao
2025-02-18 16:03   ` Sean Christopherson
2025-02-19  2:17     ` Yan Zhao
2025-02-19 14:18       ` Sean Christopherson
2025-02-20  1:50         ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox