linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] KVM: Drop vm_dead, pivot on vm_bugged for -EIO
@ 2025-07-29 19:33 Sean Christopherson
  2025-07-29 19:33 ` [PATCH 1/5] KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests Sean Christopherson
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-07-29 19:33 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini
  Cc: linux-arm-kernel, kvmarm, kvm, linux-kernel, Adrian Hunter,
	Vishal Annapurve, Xiaoyao Li, Rick Edgecombe, Nikolay Borisov

Drop vm_dead and instead reject ioctls based only on vm_bugged.  Checking
vm_dead (or vm_bugged) is inherently racy due as it's not protected by any
locks.  For vm_bugged, imperfection is a-ok as the goal is purely to limit
the damage done by a kernel/hardware bug.  But rejecting ioclts based on
vm_dead is dangerous as it gives us a false sense of security, e.g. see the
race found by syzbot in commit ecf371f8b02d ("KVM: SVM: Reject SEV{-ES}
intra host migration if vCPU creation is in-flight").

This series was motivated by the last patch, a.k.a. KVM_TDX_TERMINATE_VM.
I applied a slightly different version of that patch for 6.17[*], but I'm
reposting it with the vm_dead changes due to Paolo's question about whether
or not we should have a generic KVM_TERMINATE_VM; dropping vm_dead doesn't
make much sense if we want to add KVM_TERMINATE_VM.

[*] https://lore.kernel.org/all/20250725220713.264711-13-seanjc@google.com

Sean Christopherson (5):
  KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests
  KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  KVM: Reject ioctls only if the VM is bugged, not simply marked dead
  KVM: selftests: Use for-loop to handle all successful SEV migrations
  KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM

 Documentation/virt/kvm/x86/intel-tdx.rst      | 22 ++++++++-
 arch/arm64/kvm/arm.c                          |  2 +-
 arch/arm64/kvm/vgic/vgic-init.c               |  2 +-
 arch/x86/include/uapi/asm/kvm.h               |  7 ++-
 arch/x86/kvm/mmu/mmu.c                        |  2 +-
 arch/x86/kvm/vmx/tdx.c                        | 45 +++++++++++++------
 arch/x86/kvm/vmx/tdx.h                        |  1 +
 arch/x86/kvm/x86.c                            |  2 +-
 include/linux/kvm_host.h                      | 11 +++--
 .../selftests/kvm/x86/sev_migrate_tests.c     | 34 ++++++--------
 virt/kvm/kvm_main.c                           | 10 ++---
 11 files changed, 90 insertions(+), 48 deletions(-)


base-commit: beafd7ecf2255e8b62a42dc04f54843033db3d24
-- 
2.50.1.552.g942d659e1b-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/5] KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests
  2025-07-29 19:33 [PATCH 0/5] KVM: Drop vm_dead, pivot on vm_bugged for -EIO Sean Christopherson
@ 2025-07-29 19:33 ` Sean Christopherson
  2025-07-29 19:33 ` [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation Sean Christopherson
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-07-29 19:33 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini
  Cc: linux-arm-kernel, kvmarm, kvm, linux-kernel, Adrian Hunter,
	Vishal Annapurve, Xiaoyao Li, Rick Edgecombe, Nikolay Borisov

Use kvm_test_request() instead of kvm_check_request() when querying
KVM_REQ_VM_DEAD, i.e. don't clear KVM_REQ_VM_DEAD, as the entire purpose
of KVM_REQ_VM_DEAD is to prevent the vCPU from enterring the guest ever
again, even if userspace insists on redoing KVM_RUN.

Ensuring KVM_REQ_VM_DEAD is never cleared will allow relaxing KVM's rule
that ioctls can't be invoked on dead VMs, to only disallow ioctls if the
VM is bugged, i.e. if KVM hit a KVM_BUG_ON().

Opportunistically add compile-time assertions to guard against clearing
KVM_REQ_VM_DEAD through the standard APIs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/arm.c     | 2 +-
 arch/x86/kvm/mmu/mmu.c   | 2 +-
 arch/x86/kvm/vmx/tdx.c   | 2 +-
 arch/x86/kvm/x86.c       | 2 +-
 include/linux/kvm_host.h | 9 +++++++--
 5 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index f946926716b0..2fdc48c0fc4d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1013,7 +1013,7 @@ static int kvm_vcpu_suspend(struct kvm_vcpu *vcpu)
 static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 {
 	if (kvm_request_pending(vcpu)) {
-		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
+		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
 			return -EIO;
 
 		if (kvm_check_request(KVM_REQ_SLEEP, vcpu))
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6e838cb6c9e1..d09bd236a92d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4915,7 +4915,7 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 		if (signal_pending(current))
 			return -EINTR;
 
-		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
+		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
 			return -EIO;
 
 		cond_resched();
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 66744f5768c8..3e0d4edee849 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2010,7 +2010,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 		if (kvm_vcpu_has_events(vcpu) || signal_pending(current))
 			break;
 
-		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
+		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
 			ret = -EIO;
 			break;
 		}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a1c49bc681c4..1700df68f12a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10649,7 +10649,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	bool req_immediate_exit = false;
 
 	if (kvm_request_pending(vcpu)) {
-		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
+		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
 			r = -EIO;
 			goto out;
 		}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 15656b7fba6c..627054d27222 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2261,13 +2261,18 @@ static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
 	return test_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
 }
 
-static inline void kvm_clear_request(int req, struct kvm_vcpu *vcpu)
+static __always_inline void kvm_clear_request(int req, struct kvm_vcpu *vcpu)
 {
+	BUILD_BUG_ON(req == KVM_REQ_VM_DEAD);
+
 	clear_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
 }
 
-static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
 {
+	/* Once a VM is dead, it needs to stay dead. */
+	BUILD_BUG_ON(req == KVM_REQ_VM_DEAD);
+
 	if (kvm_test_request(req, vcpu)) {
 		kvm_clear_request(req, vcpu);
 
-- 
2.50.1.552.g942d659e1b-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  2025-07-29 19:33 [PATCH 0/5] KVM: Drop vm_dead, pivot on vm_bugged for -EIO Sean Christopherson
  2025-07-29 19:33 ` [PATCH 1/5] KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests Sean Christopherson
@ 2025-07-29 19:33 ` Sean Christopherson
  2025-07-29 22:27   ` Edgecombe, Rick P
  2025-07-30  2:07   ` Xiaoyao Li
  2025-07-29 19:33 ` [PATCH 3/5] KVM: Reject ioctls only if the VM is bugged, not simply marked dead Sean Christopherson
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-07-29 19:33 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini
  Cc: linux-arm-kernel, kvmarm, kvm, linux-kernel, Adrian Hunter,
	Vishal Annapurve, Xiaoyao Li, Rick Edgecombe, Nikolay Borisov

Exit to userspace with -EFAULT and a valid MEMORY_FAULT exit if a vCPU
hits an unexpected pending S-EPT Violation instead of marking the VM dead.
While it's unlikely the VM can continue on, whether or not to terminate
the VM is not KVM's decision to make.

Set memory_fault.size to zero to communicate to userspace that reported
fault is "bad", and to effectively terminate the VM if userspace blindly
treats the exit as a conversion attempt (KVM_SET_MEMORY_ATTRIBUTES will
fail with -EINVAL if the size is zero).

Opportunistically delete the pr_warn(), which could be abused to spam the
kernel log, and is largely useless outside of interact debug as it doesn't
specify which VM encountered a failure.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3e0d4edee849..c2ef03f39c32 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 
 	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
 		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
-			pr_warn("Guest access before accepting 0x%llx on vCPU %d\n",
-				gpa, vcpu->vcpu_id);
-			kvm_vm_dead(vcpu->kvm);
-			return -EIO;
+			kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true);
+			return -EFAULT;
 		}
 		/*
 		 * Always treat SEPT violations as write faults.  Ignore the
-- 
2.50.1.552.g942d659e1b-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/5] KVM: Reject ioctls only if the VM is bugged, not simply marked dead
  2025-07-29 19:33 [PATCH 0/5] KVM: Drop vm_dead, pivot on vm_bugged for -EIO Sean Christopherson
  2025-07-29 19:33 ` [PATCH 1/5] KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests Sean Christopherson
  2025-07-29 19:33 ` [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation Sean Christopherson
@ 2025-07-29 19:33 ` Sean Christopherson
  2025-07-30  1:20   ` Chao Gao
  2025-07-29 19:33 ` [PATCH 4/5] KVM: selftests: Use for-loop to handle all successful SEV migrations Sean Christopherson
  2025-07-29 19:33 ` [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Sean Christopherson
  4 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-07-29 19:33 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini
  Cc: linux-arm-kernel, kvmarm, kvm, linux-kernel, Adrian Hunter,
	Vishal Annapurve, Xiaoyao Li, Rick Edgecombe, Nikolay Borisov

Relax the protection against interacting with a buggy KVM to only reject
ioctls if the VM is bugged, i.e. allow userspace to invoke ioctls if KVM
deliberately terminated the VM.  Drop kvm.vm_dead as there are no longer
any readers, and KVM shouldn't rely on vm_dead for functional correctness.
The only functional guarantees provided by kvm_vm_dead() come by way of
KVM_REQ_VM_DEAD, which ensures that vCPU won't re-enter the guest.

Practically speaking, this only affects x86, which uses kvm_vm_dead() to
prevent running a VM whose resources have been partially freed or has run
one or more of its vCPUs into an architecturally defined state.  In these
cases, there is no (known) danger to KVM, the goal is purely to prevent
entering the guest.

As evidenced by commit ecf371f8b02d ("KVM: SVM: Reject SEV{-ES} intra host
migration if vCPU creation is in-flight"), the restriction on invoking
ioctls only blocks _new_ ioctls.  I.e. KVM mustn't rely on blocking ioctls
for functional safety (whereas KVM_REQ_VM_DEAD is guaranteed to prevent
vCPUs from entering the guest).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/vgic/vgic-init.c                     |  2 +-
 include/linux/kvm_host.h                            |  2 --
 tools/testing/selftests/kvm/x86/sev_migrate_tests.c |  5 +----
 virt/kvm/kvm_main.c                                 | 10 +++++-----
 4 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index eb1205654ac8..c2033bae73b2 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -612,7 +612,7 @@ int kvm_vgic_map_resources(struct kvm *kvm)
 	mutex_unlock(&kvm->arch.config_lock);
 out_slots:
 	if (ret)
-		kvm_vm_dead(kvm);
+		kvm_vm_bugged(kvm);
 
 	mutex_unlock(&kvm->slots_lock);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 627054d27222..fa97d71577b5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -854,7 +854,6 @@ struct kvm {
 	u32 dirty_ring_size;
 	bool dirty_ring_with_bitmap;
 	bool vm_bugged;
-	bool vm_dead;
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
@@ -894,7 +893,6 @@ struct kvm {
 
 static inline void kvm_vm_dead(struct kvm *kvm)
 {
-	kvm->vm_dead = true;
 	kvm_make_all_cpus_request(kvm, KVM_REQ_VM_DEAD);
 }
 
diff --git a/tools/testing/selftests/kvm/x86/sev_migrate_tests.c b/tools/testing/selftests/kvm/x86/sev_migrate_tests.c
index 0a6dfba3905b..0580bee5888e 100644
--- a/tools/testing/selftests/kvm/x86/sev_migrate_tests.c
+++ b/tools/testing/selftests/kvm/x86/sev_migrate_tests.c
@@ -87,10 +87,7 @@ static void test_sev_migrate_from(bool es)
 		sev_migrate_from(dst_vms[i], dst_vms[i - 1]);
 
 	/* Migrate the guest back to the original VM. */
-	ret = __sev_migrate_from(src_vm, dst_vms[NR_MIGRATE_TEST_VMS - 1]);
-	TEST_ASSERT(ret == -1 && errno == EIO,
-		    "VM that was migrated from should be dead. ret %d, errno: %d", ret,
-		    errno);
+	sev_migrate_from(src_vm, dst_vms[NR_MIGRATE_TEST_VMS - 1]);
 
 	kvm_vm_free(src_vm);
 	for (i = 0; i < NR_MIGRATE_TEST_VMS; ++i)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6c07dd423458..f1f69e10a371 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4408,7 +4408,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
 	struct kvm_fpu *fpu = NULL;
 	struct kvm_sregs *kvm_sregs = NULL;
 
-	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
+	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
 		return -EIO;
 
 	if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
@@ -4651,7 +4651,7 @@ static long kvm_vcpu_compat_ioctl(struct file *filp,
 	void __user *argp = compat_ptr(arg);
 	int r;
 
-	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
+	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
 		return -EIO;
 
 	switch (ioctl) {
@@ -4717,7 +4717,7 @@ static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
 {
 	struct kvm_device *dev = filp->private_data;
 
-	if (dev->kvm->mm != current->mm || dev->kvm->vm_dead)
+	if (dev->kvm->mm != current->mm || dev->kvm->vm_bugged)
 		return -EIO;
 
 	switch (ioctl) {
@@ -5139,7 +5139,7 @@ static long kvm_vm_ioctl(struct file *filp,
 	void __user *argp = (void __user *)arg;
 	int r;
 
-	if (kvm->mm != current->mm || kvm->vm_dead)
+	if (kvm->mm != current->mm || kvm->vm_bugged)
 		return -EIO;
 	switch (ioctl) {
 	case KVM_CREATE_VCPU:
@@ -5403,7 +5403,7 @@ static long kvm_vm_compat_ioctl(struct file *filp,
 	struct kvm *kvm = filp->private_data;
 	int r;
 
-	if (kvm->mm != current->mm || kvm->vm_dead)
+	if (kvm->mm != current->mm || kvm->vm_bugged)
 		return -EIO;
 
 	r = kvm_arch_vm_compat_ioctl(filp, ioctl, arg);
-- 
2.50.1.552.g942d659e1b-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/5] KVM: selftests: Use for-loop to handle all successful SEV migrations
  2025-07-29 19:33 [PATCH 0/5] KVM: Drop vm_dead, pivot on vm_bugged for -EIO Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-07-29 19:33 ` [PATCH 3/5] KVM: Reject ioctls only if the VM is bugged, not simply marked dead Sean Christopherson
@ 2025-07-29 19:33 ` Sean Christopherson
  2025-07-29 19:33 ` [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Sean Christopherson
  4 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-07-29 19:33 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini
  Cc: linux-arm-kernel, kvmarm, kvm, linux-kernel, Adrian Hunter,
	Vishal Annapurve, Xiaoyao Li, Rick Edgecombe, Nikolay Borisov

Use the main for-loop in the "SEV migrate from" testcase to handle all
successful migrations, as there is nothing inherently unique about the
original source VM beyond it needing to be created as an SEV VM.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/x86/sev_migrate_tests.c     | 31 +++++++++----------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/sev_migrate_tests.c b/tools/testing/selftests/kvm/x86/sev_migrate_tests.c
index 0580bee5888e..b501c916edf5 100644
--- a/tools/testing/selftests/kvm/x86/sev_migrate_tests.c
+++ b/tools/testing/selftests/kvm/x86/sev_migrate_tests.c
@@ -14,7 +14,7 @@
 #include "kselftest.h"
 
 #define NR_MIGRATE_TEST_VCPUS 4
-#define NR_MIGRATE_TEST_VMS 3
+#define NR_MIGRATE_TEST_VMS 4
 #define NR_LOCK_TESTING_THREADS 3
 #define NR_LOCK_TESTING_ITERATIONS 10000
 
@@ -72,26 +72,23 @@ static void sev_migrate_from(struct kvm_vm *dst, struct kvm_vm *src)
 
 static void test_sev_migrate_from(bool es)
 {
-	struct kvm_vm *src_vm;
-	struct kvm_vm *dst_vms[NR_MIGRATE_TEST_VMS];
-	int i, ret;
+	struct kvm_vm *vms[NR_MIGRATE_TEST_VMS];
+	int i;
 
-	src_vm = sev_vm_create(es);
-	for (i = 0; i < NR_MIGRATE_TEST_VMS; ++i)
-		dst_vms[i] = aux_vm_create(true);
-
-	/* Initial migration from the src to the first dst. */
-	sev_migrate_from(dst_vms[0], src_vm);
-
-	for (i = 1; i < NR_MIGRATE_TEST_VMS; i++)
-		sev_migrate_from(dst_vms[i], dst_vms[i - 1]);
+	vms[0] = sev_vm_create(es);
+	for (i = 1; i < NR_MIGRATE_TEST_VMS; ++i)
+		vms[i] = aux_vm_create(true);
 
-	/* Migrate the guest back to the original VM. */
-	sev_migrate_from(src_vm, dst_vms[NR_MIGRATE_TEST_VMS - 1]);
+	/*
+	 * Migrate in N times, in a chain from the initial SEV VM to each "aux"
+	 * VM, and finally back to the original SEV VM.  KVM disallows KVM_RUN
+	 * on the source after migration, but all other ioctls should succeed.
+	 */
+	for (i = 0; i < NR_MIGRATE_TEST_VMS; i++)
+		sev_migrate_from(vms[(i + 1) % NR_MIGRATE_TEST_VMS], vms[i]);
 
-	kvm_vm_free(src_vm);
 	for (i = 0; i < NR_MIGRATE_TEST_VMS; ++i)
-		kvm_vm_free(dst_vms[i]);
+		kvm_vm_free(vms[i]);
 }
 
 struct locking_thread_input {
-- 
2.50.1.552.g942d659e1b-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-07-29 19:33 [PATCH 0/5] KVM: Drop vm_dead, pivot on vm_bugged for -EIO Sean Christopherson
                   ` (3 preceding siblings ...)
  2025-07-29 19:33 ` [PATCH 4/5] KVM: selftests: Use for-loop to handle all successful SEV migrations Sean Christopherson
@ 2025-07-29 19:33 ` Sean Christopherson
  2025-08-01 13:56   ` Adrian Hunter
  4 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-07-29 19:33 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini
  Cc: linux-arm-kernel, kvmarm, kvm, linux-kernel, Adrian Hunter,
	Vishal Annapurve, Xiaoyao Li, Rick Edgecombe, Nikolay Borisov

Add sub-ioctl KVM_TDX_TERMINATE_VM to release the HKID prior to shutdown,
which enables more efficient reclaim of private memory.

Private memory is removed from MMU/TDP when guest_memfds are closed.  If
the HKID has not been released, the TDX VM is still in the RUNNABLE state,
and so pages must be removed using "Dynamic Page Removal" procedure (refer
to the TDX Module Base spec) which involves a number of steps:
	Block further address translation
	Exit each VCPU
	Clear Secure EPT entry
	Flush/write-back/invalidate relevant caches

However, when the HKID is released, the TDX VM moves to TD_TEARDOWN state,
where all TDX VM pages are effectively unmapped, so pages can be reclaimed
directly.

Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
reclaim time.  For example:

  VCPUs    Size (GB)    Before (secs)    After (secs)
      4           18		   72              24
     32          107		  517             134
     64          400		 5539             467

Add kvm_tdx_capabilities.supported_caps along with KVM_TDX_CAP_TERMINATE_VM
to advertise support to userspace.  Use a new field in kvm_tdx_capabilities
instead of adding yet another generic KVM_CAP to avoid bleeding TDX details
into common code (and #ifdefs), and so that userspace can query TDX
capabilities in one shot.  Enumerating capabilities as a mask of bits does
limit supported_caps to 64 capabilities, but in the unlikely event KVM
needs to enumerate more than 64 TDX capabilities, there are another 249
u64 entries reserved for future expansion.

To preserve the KVM_BUG_ON() sanity check that deals with HKID assignment,
track if a TD is terminated and assert that, when an S-EPT entry is
removed, either the TD has an assigned HKID or the TD was explicitly
terminated.

Link: https://lore.kernel.org/r/Z-V0qyTn2bXdrPF7@google.com
Link: https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/x86/intel-tdx.rst | 22 +++++++++++++-
 arch/x86/include/uapi/asm/kvm.h          |  7 ++++-
 arch/x86/kvm/vmx/tdx.c                   | 37 +++++++++++++++++++-----
 arch/x86/kvm/vmx/tdx.h                   |  1 +
 4 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
index 5efac62c92c7..bcfa97e0c9e7 100644
--- a/Documentation/virt/kvm/x86/intel-tdx.rst
+++ b/Documentation/virt/kvm/x86/intel-tdx.rst
@@ -38,6 +38,7 @@ ioctl with TDX specific sub-ioctl() commands.
           KVM_TDX_INIT_MEM_REGION,
           KVM_TDX_FINALIZE_VM,
           KVM_TDX_GET_CPUID,
+          KVM_TDX_TERMINATE_VM,
 
           KVM_TDX_CMD_NR_MAX,
   };
@@ -92,7 +93,10 @@ to be configured to the TDX guest.
         __u64 kernel_tdvmcallinfo_1_r12;
         __u64 user_tdvmcallinfo_1_r12;
 
-        __u64 reserved[250];
+        /* Misc capabilities enumerated via the KVM_TDX_CAP_* namespace. */
+        __u64 supported_caps;
+
+        __u64 reserved[249];
 
         /* Configurable CPUID bits for userspace */
         struct kvm_cpuid2 cpuid;
@@ -227,6 +231,22 @@ struct kvm_cpuid2.
 	  __u32 padding[3];
   };
 
+KVM_TDX_TERMINATE_VM
+--------------------
+:Capability: KVM_TDX_CAP_TERMINATE_VM
+:Type: vm ioctl
+:Returns: 0 on success, <0 on error
+
+Release Host Key ID (HKID) to allow more efficient reclaim of private memory.
+After this, the TD is no longer in a runnable state.
+
+Using KVM_TDX_TERMINATE_VM is optional.
+
+- id: KVM_TDX_TERMINATE_VM
+- flags: must be 0
+- data: must be 0
+- hw_error: must be 0
+
 KVM TDX creation flow
 =====================
 In addition to the standard KVM flow, new TDX ioctls need to be called.  The
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 0f15d683817d..e019111e2150 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -940,6 +940,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_INIT_MEM_REGION,
 	KVM_TDX_FINALIZE_VM,
 	KVM_TDX_GET_CPUID,
+	KVM_TDX_TERMINATE_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
@@ -962,6 +963,8 @@ struct kvm_tdx_cmd {
 	__u64 hw_error;
 };
 
+#define KVM_TDX_CAP_TERMINATE_VM       _BITULL(0)
+
 struct kvm_tdx_capabilities {
 	__u64 supported_attrs;
 	__u64 supported_xfam;
@@ -971,7 +974,9 @@ struct kvm_tdx_capabilities {
 	__u64 kernel_tdvmcallinfo_1_r12;
 	__u64 user_tdvmcallinfo_1_r12;
 
-	__u64 reserved[250];
+	__u64 supported_caps;
+
+	__u64 reserved[249];
 
 	/* Configurable CPUID bits for userspace */
 	struct kvm_cpuid2 cpuid;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c2ef03f39c32..ae059daf1a20 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -188,6 +188,8 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
 	if (!caps->supported_xfam)
 		return -EIO;
 
+	caps->supported_caps = KVM_TDX_CAP_TERMINATE_VM;
+
 	caps->cpuid.nent = td_conf->num_cpuid_config;
 
 	caps->user_tdvmcallinfo_1_r11 =
@@ -520,6 +522,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 		goto out;
 	}
 
+	write_lock(&kvm->mmu_lock);
 	for_each_online_cpu(i) {
 		if (packages_allocated &&
 		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -544,7 +547,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	} else {
 		tdx_hkid_free(kvm_tdx);
 	}
-
+	write_unlock(&kvm->mmu_lock);
 out:
 	mutex_unlock(&tdx_lock);
 	cpus_read_unlock();
@@ -1884,13 +1887,13 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
-	/*
-	 * HKID is released after all private pages have been removed, and set
-	 * before any might be populated. Warn if zapping is attempted when
-	 * there can't be anything populated in the private EPT.
-	 */
-	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+	if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
+		KVM_BUG_ON(!to_kvm_tdx(kvm)->vm_terminated, kvm);
+		ret = tdx_reclaim_page(page);
+		if (!ret)
+			tdx_unpin(kvm, page);
+		return ret;
+	}
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
@@ -2884,6 +2887,21 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+static int tdx_terminate_vm(struct kvm *kvm)
+{
+	if (kvm_trylock_all_vcpus(kvm))
+		return -EBUSY;
+
+	kvm_vm_dead(kvm);
+	to_kvm_tdx(kvm)->vm_terminated = true;
+
+	kvm_unlock_all_vcpus(kvm);
+
+	tdx_mmu_release_hkid(kvm);
+
+	return 0;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -2911,6 +2929,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_FINALIZE_VM:
 		r = tdx_td_finalize(kvm, &tdx_cmd);
 		break;
+	case KVM_TDX_TERMINATE_VM:
+		r = tdx_terminate_vm(kvm);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ca39a9391db1..0abe70aa1644 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -45,6 +45,7 @@ struct kvm_tdx {
 	 * Set/unset is protected with kvm->mmu_lock.
 	 */
 	bool wait_for_sept_zap;
+	bool vm_terminated;
 };
 
 /* TDX module vCPU states */
-- 
2.50.1.552.g942d659e1b-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  2025-07-29 19:33 ` [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation Sean Christopherson
@ 2025-07-29 22:27   ` Edgecombe, Rick P
  2025-07-29 22:54     ` Sean Christopherson
  2025-07-30  5:55     ` Yan Zhao
  2025-07-30  2:07   ` Xiaoyao Li
  1 sibling, 2 replies; 21+ messages in thread
From: Edgecombe, Rick P @ 2025-07-29 22:27 UTC (permalink / raw)
  To: maz@kernel.org, pbonzini@redhat.com, seanjc@google.com,
	oliver.upton@linux.dev
  Cc: Annapurve, Vishal, Li, Xiaoyao, kvmarm@lists.linux.dev,
	Hunter, Adrian, linux-arm-kernel@lists.infradead.org,
	kvm@vger.kernel.org, nik.borisov@suse.com,
	linux-kernel@vger.kernel.org

On Tue, 2025-07-29 at 12:33 -0700, Sean Christopherson wrote:
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 3e0d4edee849..c2ef03f39c32 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>  
>  	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
>  		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
> -			pr_warn("Guest access before accepting 0x%llx on vCPU %d\n",
> -				gpa, vcpu->vcpu_id);
> -			kvm_vm_dead(vcpu->kvm);
> -			return -EIO;
> +			kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true);
> +			return -EFAULT;
>  		}
>  		/*
>  		 * Always treat SEPT violations as write faults.  Ignore the

The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an
attempt to set the mirror EPT entry when it is already present. And the
unaccepted memory access will trigger an EPT violation for a mirror PTE that is
already set. I think this is a better solution irrespective of the vm_dead
changes.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

But hmm, tangentially related, but Yan do we have a similar problem with
KVM_PRE_FAULT_MEMORY after we started setting pre_fault_allowed during TD
finalization?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  2025-07-29 22:27   ` Edgecombe, Rick P
@ 2025-07-29 22:54     ` Sean Christopherson
  2025-07-29 22:58       ` Edgecombe, Rick P
  2025-07-30  5:55     ` Yan Zhao
  1 sibling, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-07-29 22:54 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: maz@kernel.org, pbonzini@redhat.com, oliver.upton@linux.dev,
	Vishal Annapurve, Xiaoyao Li, kvmarm@lists.linux.dev,
	Adrian Hunter, linux-arm-kernel@lists.infradead.org,
	kvm@vger.kernel.org, nik.borisov@suse.com,
	linux-kernel@vger.kernel.org

On Tue, Jul 29, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-07-29 at 12:33 -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 3e0d4edee849..c2ef03f39c32 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >  
> >  	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
> >  		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
> > -			pr_warn("Guest access before accepting 0x%llx on vCPU %d\n",
> > -				gpa, vcpu->vcpu_id);
> > -			kvm_vm_dead(vcpu->kvm);
> > -			return -EIO;
> > +			kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true);
> > +			return -EFAULT;
> >  		}
> >  		/*
> >  		 * Always treat SEPT violations as write faults.  Ignore the
> 
> The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an
> attempt to set the mirror EPT entry when it is already present. And the
> unaccepted memory access will trigger an EPT violation for a mirror PTE that is
> already set. I think this is a better solution irrespective of the vm_dead
> changes.

In that case, this change will expose KVM to the KVM_BUG_ON(), because nothing
prevents userspace from re-running the vCPU.  Which KVM_BUG_ON() exactly gets
hit?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  2025-07-29 22:54     ` Sean Christopherson
@ 2025-07-29 22:58       ` Edgecombe, Rick P
  2025-07-29 23:08         ` Sean Christopherson
  2025-07-30  5:45         ` Yan Zhao
  0 siblings, 2 replies; 21+ messages in thread
From: Edgecombe, Rick P @ 2025-07-29 22:58 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: linux-kernel@vger.kernel.org, oliver.upton@linux.dev,
	Annapurve, Vishal, Li, Xiaoyao, kvmarm@lists.linux.dev,
	Hunter, Adrian, maz@kernel.org,
	linux-arm-kernel@lists.infradead.org, pbonzini@redhat.com,
	nik.borisov@suse.com, kvm@vger.kernel.org

On Tue, 2025-07-29 at 15:54 -0700, Sean Christopherson wrote:
> > The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an
> > attempt to set the mirror EPT entry when it is already present. And the
> > unaccepted memory access will trigger an EPT violation for a mirror PTE that
> > is
> > already set. I think this is a better solution irrespective of the vm_dead
> > changes.
> 
> In that case, this change will expose KVM to the KVM_BUG_ON(), because nothing
> prevents userspace from re-running the vCPU. 

If userspace runs the vCPU again then an EPT violation gets triggered again,
which again gets kicked out to userspace. The new check will prevent it from
getting into the fault handler, right?

>  Which KVM_BUG_ON() exactly gets hit?

Should be:

static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t
sptep,
						 gfn_t gfn, u64 old_spte,
						 u64 new_spte, int level)
{
	bool was_present = is_shadow_present_pte(old_spte);
	bool is_present = is_shadow_present_pte(new_spte);
	bool is_leaf = is_present && is_last_spte(new_spte, level);
	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
	int ret = 0;

	KVM_BUG_ON(was_present, kvm);


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  2025-07-29 22:58       ` Edgecombe, Rick P
@ 2025-07-29 23:08         ` Sean Christopherson
  2025-07-29 23:13           ` Edgecombe, Rick P
  2025-07-30  5:45         ` Yan Zhao
  1 sibling, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-07-29 23:08 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: linux-kernel@vger.kernel.org, oliver.upton@linux.dev,
	Vishal Annapurve, Xiaoyao Li, kvmarm@lists.linux.dev,
	Adrian Hunter, maz@kernel.org,
	linux-arm-kernel@lists.infradead.org, pbonzini@redhat.com,
	nik.borisov@suse.com, kvm@vger.kernel.org

On Tue, Jul 29, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-07-29 at 15:54 -0700, Sean Christopherson wrote:
> > > The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an
> > > attempt to set the mirror EPT entry when it is already present. And the
> > > unaccepted memory access will trigger an EPT violation for a mirror PTE
> > > that is already set. I think this is a better solution irrespective of
> > > the vm_dead changes.
> > 
> > In that case, this change will expose KVM to the KVM_BUG_ON(), because nothing
> > prevents userspace from re-running the vCPU. 
> 
> If userspace runs the vCPU again then an EPT violation gets triggered again,
> which again gets kicked out to userspace. The new check will prevent it from
> getting into the fault handler, right?

Yes?  But I'm confused about why you mentioned vm_dead, and why you're calling
this a "new check".  This effectively does two things: drops kvm_vm_dead() and
switches from EOI => EFAULT.  _If_ setting vm_dead was necessary, then we have a
problem.

I assume by "The vm_dead was added" you really mean "forcing an exit to userspace",
and that kvm_vm_dead()+EIO was a somewhat arbitrary way of forcing an exit?

> >  Which KVM_BUG_ON() exactly gets hit?
> 
> Should be:
> 
> static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t
> sptep,
> 						 gfn_t gfn, u64 old_spte,
> 						 u64 new_spte, int level)
> {
> 	bool was_present = is_shadow_present_pte(old_spte);
> 	bool is_present = is_shadow_present_pte(new_spte);
> 	bool is_leaf = is_present && is_last_spte(new_spte, level);
> 	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> 	int ret = 0;
> 
> 	KVM_BUG_ON(was_present, kvm);

Yeah, I don't see how that can be reach in this scenario. 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  2025-07-29 23:08         ` Sean Christopherson
@ 2025-07-29 23:13           ` Edgecombe, Rick P
  0 siblings, 0 replies; 21+ messages in thread
From: Edgecombe, Rick P @ 2025-07-29 23:13 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, oliver.upton@linux.dev, Annapurve, Vishal,
	Li, Xiaoyao, linux-kernel@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	Hunter, Adrian, nik.borisov@suse.com, pbonzini@redhat.com

On Tue, 2025-07-29 at 16:08 -0700, Sean Christopherson wrote:
> > If userspace runs the vCPU again then an EPT violation gets triggered again,
> > which again gets kicked out to userspace. The new check will prevent it from
> > getting into the fault handler, right?
> 
> Yes?  But I'm confused about why you mentioned vm_dead, and why you're calling
> this a "new check".  This effectively does two things: drops kvm_vm_dead() and
> switches from EOI => EFAULT.  _If_ setting vm_dead was necessary, then we have
> a
> problem.
> 
> I assume by "The vm_dead was added" you really mean "forcing an exit to
> userspace",
> and that kvm_vm_dead()+EIO was a somewhat arbitrary way of forcing an exit?

Sorry, yes vm_dead prevents an EPT violation loop but not the KVM_BUG_ON(). The
whole if clause prevents the KVM_BUG_ON(). Your patch prevents the ept violation
loop in a better way.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] KVM: Reject ioctls only if the VM is bugged, not simply marked dead
  2025-07-29 19:33 ` [PATCH 3/5] KVM: Reject ioctls only if the VM is bugged, not simply marked dead Sean Christopherson
@ 2025-07-30  1:20   ` Chao Gao
  0 siblings, 0 replies; 21+ messages in thread
From: Chao Gao @ 2025-07-30  1:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, Adrian Hunter, Vishal Annapurve,
	Xiaoyao Li, Rick Edgecombe, Nikolay Borisov

On Tue, Jul 29, 2025 at 12:33:38PM -0700, Sean Christopherson wrote:
>Relax the protection against interacting with a buggy KVM to only reject
>ioctls if the VM is bugged, i.e. allow userspace to invoke ioctls if KVM
>deliberately terminated the VM.  Drop kvm.vm_dead as there are no longer
>any readers, and KVM shouldn't rely on vm_dead for functional correctness.
>The only functional guarantees provided by kvm_vm_dead() come by way of
>KVM_REQ_VM_DEAD, which ensures that vCPU won't re-enter the guest.

If ioctls are allowed for dead VMs, would it be possible for userspace to
create a new vCPU and attempt to enter a dead VM? is this something KVM
should prevent?

>
>Practically speaking, this only affects x86, which uses kvm_vm_dead() to
>prevent running a VM whose resources have been partially freed or has run
>one or more of its vCPUs into an architecturally defined state.  In these
						  ^^^ undefined?

>cases, there is no (known) danger to KVM, the goal is purely to prevent
>entering the guest.
>
>As evidenced by commit ecf371f8b02d ("KVM: SVM: Reject SEV{-ES} intra host
>migration if vCPU creation is in-flight"), the restriction on invoking
>ioctls only blocks _new_ ioctls.  I.e. KVM mustn't rely on blocking ioctls
>for functional safety (whereas KVM_REQ_VM_DEAD is guaranteed to prevent
>vCPUs from entering the guest).


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  2025-07-29 19:33 ` [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation Sean Christopherson
  2025-07-29 22:27   ` Edgecombe, Rick P
@ 2025-07-30  2:07   ` Xiaoyao Li
  2025-07-30  6:04     ` Yan Zhao
  1 sibling, 1 reply; 21+ messages in thread
From: Xiaoyao Li @ 2025-07-30  2:07 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Paolo Bonzini
  Cc: linux-arm-kernel, kvmarm, kvm, linux-kernel, Adrian Hunter,
	Vishal Annapurve, Rick Edgecombe, Nikolay Borisov

On 7/30/2025 3:33 AM, Sean Christopherson wrote:
> Exit to userspace with -EFAULT and a valid MEMORY_FAULT exit if a vCPU
> hits an unexpected pending S-EPT Violation instead of marking the VM dead.
> While it's unlikely the VM can continue on, whether or not to terminate
> the VM is not KVM's decision to make.
> 
> Set memory_fault.size to zero to communicate to userspace that reported
> fault is "bad", and to effectively terminate the VM if userspace blindly
> treats the exit as a conversion attempt (KVM_SET_MEMORY_ATTRIBUTES will
> fail with -EINVAL if the size is zero).

This sets a special contract on size zero.

I had a patch internally, which introduce a new exit type:

+               /* KVM_EXIT_GUEST_ERROR */
+               struct {
+  #define KVM_GUEST_ERROR_TDX_ACCESS_PENDING_PAGE      0
+                       __u32 error_type;
+                       __u32 ndata;
+                       __u64 data[16];
+               } guest_error;

how about it?

> Opportunistically delete the pr_warn(), which could be abused to spam the
> kernel log, and is largely useless outside of interact debug as it doesn't
> specify which VM encountered a failure.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 3e0d4edee849..c2ef03f39c32 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>   
>   	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
>   		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
> -			pr_warn("Guest access before accepting 0x%llx on vCPU %d\n",
> -				gpa, vcpu->vcpu_id);
> -			kvm_vm_dead(vcpu->kvm);
> -			return -EIO;
> +			kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true);
> +			return -EFAULT;
>   		}
>   		/*
>   		 * Always treat SEPT violations as write faults.  Ignore the



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  2025-07-29 22:58       ` Edgecombe, Rick P
  2025-07-29 23:08         ` Sean Christopherson
@ 2025-07-30  5:45         ` Yan Zhao
  1 sibling, 0 replies; 21+ messages in thread
From: Yan Zhao @ 2025-07-30  5:45 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, linux-kernel@vger.kernel.org,
	oliver.upton@linux.dev, Annapurve, Vishal, Li, Xiaoyao,
	kvmarm@lists.linux.dev, Hunter, Adrian, maz@kernel.org,
	linux-arm-kernel@lists.infradead.org, pbonzini@redhat.com,
	nik.borisov@suse.com, kvm@vger.kernel.org

On Tue, Jul 29, 2025 at 10:58:02PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2025-07-29 at 15:54 -0700, Sean Christopherson wrote:
> > > The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an
> > > attempt to set the mirror EPT entry when it is already present. And the
> > > unaccepted memory access will trigger an EPT violation for a mirror PTE that
> > > is
> > > already set. I think this is a better solution irrespective of the vm_dead
> > > changes.
> > 
> > In that case, this change will expose KVM to the KVM_BUG_ON(), because nothing
> > prevents userspace from re-running the vCPU. 
> 
> If userspace runs the vCPU again then an EPT violation gets triggered again,
> which again gets kicked out to userspace. The new check will prevent it from
> getting into the fault handler, right?
> 
> >  Which KVM_BUG_ON() exactly gets hit?
> 
> Should be:
> 
> static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t
> sptep,
> 						 gfn_t gfn, u64 old_spte,
> 						 u64 new_spte, int level)
> {
> 	bool was_present = is_shadow_present_pte(old_spte);
> 	bool is_present = is_shadow_present_pte(new_spte);
> 	bool is_leaf = is_present && is_last_spte(new_spte, level);
> 	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> 	int ret = 0;
> 
> 	KVM_BUG_ON(was_present, kvm);
Yes, this KVM_BUG_ON() could be triggered in an earlier code if
tdx_is_sept_violation_unexpected_pending() wasn't checked to prevent repeated
EPT faults.

But the KVM_BUG_ON() is no longer reachable after the commit cc7ed3358e41 ("KVM:
x86/mmu: Always set SPTE's dirty bit if it's created as writable").

With the current code, even without putting VM to dead, the worst case is the
endless invocation of EPT fault handler, returning RET_PF_SPURIOUS.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  2025-07-29 22:27   ` Edgecombe, Rick P
  2025-07-29 22:54     ` Sean Christopherson
@ 2025-07-30  5:55     ` Yan Zhao
  2025-07-30 12:59       ` Edgecombe, Rick P
  1 sibling, 1 reply; 21+ messages in thread
From: Yan Zhao @ 2025-07-30  5:55 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: maz@kernel.org, pbonzini@redhat.com, seanjc@google.com,
	oliver.upton@linux.dev, Annapurve, Vishal, Li, Xiaoyao,
	kvmarm@lists.linux.dev, Hunter, Adrian,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	nik.borisov@suse.com, linux-kernel@vger.kernel.org

On Tue, Jul 29, 2025 at 10:27:34PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2025-07-29 at 12:33 -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 3e0d4edee849..c2ef03f39c32 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >  
> >  	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
> >  		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
> > -			pr_warn("Guest access before accepting 0x%llx on vCPU %d\n",
> > -				gpa, vcpu->vcpu_id);
> > -			kvm_vm_dead(vcpu->kvm);
> > -			return -EIO;
> > +			kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true);
> > +			return -EFAULT;
> >  		}
> >  		/*
> >  		 * Always treat SEPT violations as write faults.  Ignore the
> 
> The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an
> attempt to set the mirror EPT entry when it is already present. And the
> unaccepted memory access will trigger an EPT violation for a mirror PTE that is
> already set. I think this is a better solution irrespective of the vm_dead
> changes.
> 
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> But hmm, tangentially related, but Yan do we have a similar problem with
> KVM_PRE_FAULT_MEMORY after we started setting pre_fault_allowed during TD
> finalization?
Sean's commit 6385d01eec16 ("KVM: x86/mmu: Don't overwrite shadow-present MMU
SPTEs when prefaulting") should have prevented repeated invocation of
set_external_spte_present() with prefaulted entries.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  2025-07-30  2:07   ` Xiaoyao Li
@ 2025-07-30  6:04     ` Yan Zhao
  0 siblings, 0 replies; 21+ messages in thread
From: Yan Zhao @ 2025-07-30  6:04 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Sean Christopherson, Marc Zyngier, Oliver Upton, Paolo Bonzini,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, Adrian Hunter,
	Vishal Annapurve, Rick Edgecombe, Nikolay Borisov

On Wed, Jul 30, 2025 at 10:07:36AM +0800, Xiaoyao Li wrote:
> On 7/30/2025 3:33 AM, Sean Christopherson wrote:
> > Exit to userspace with -EFAULT and a valid MEMORY_FAULT exit if a vCPU
> > hits an unexpected pending S-EPT Violation instead of marking the VM dead.
> > While it's unlikely the VM can continue on, whether or not to terminate
> > the VM is not KVM's decision to make.
> > 
> > Set memory_fault.size to zero to communicate to userspace that reported
> > fault is "bad", and to effectively terminate the VM if userspace blindly
> > treats the exit as a conversion attempt (KVM_SET_MEMORY_ATTRIBUTES will
> > fail with -EINVAL if the size is zero).
> 
> This sets a special contract on size zero.
+1.

Or would it be good to use pr_warn_once() to indicate that the VM termination
reason is "Guest access before accepting" if a new exit type is not specified?

This info is straightforward and helpful in cases when a TD is accidentally
killed.

> I had a patch internally, which introduce a new exit type:
> 
> +               /* KVM_EXIT_GUEST_ERROR */
> +               struct {
> +  #define KVM_GUEST_ERROR_TDX_ACCESS_PENDING_PAGE      0
> +                       __u32 error_type;
> +                       __u32 ndata;
> +                       __u64 data[16];
> +               } guest_error;
> 
> how about it?
> 
> > Opportunistically delete the pr_warn(), which could be abused to spam the
> > kernel log, and is largely useless outside of interact debug as it doesn't
> > specify which VM encountered a failure.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/vmx/tdx.c | 6 ++----
> >   1 file changed, 2 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 3e0d4edee849..c2ef03f39c32 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >   	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
> >   		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
> > -			pr_warn("Guest access before accepting 0x%llx on vCPU %d\n",
> > -				gpa, vcpu->vcpu_id);
> > -			kvm_vm_dead(vcpu->kvm);
> > -			return -EIO;
> > +			kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true);
> > +			return -EFAULT;
> >   		}
> >   		/*
> >   		 * Always treat SEPT violations as write faults.  Ignore the
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation
  2025-07-30  5:55     ` Yan Zhao
@ 2025-07-30 12:59       ` Edgecombe, Rick P
  0 siblings, 0 replies; 21+ messages in thread
From: Edgecombe, Rick P @ 2025-07-30 12:59 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: linux-kernel@vger.kernel.org, seanjc@google.com,
	oliver.upton@linux.dev, Annapurve, Vishal, Li, Xiaoyao,
	kvmarm@lists.linux.dev, Hunter, Adrian, maz@kernel.org,
	pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, nik.borisov@suse.com

On Wed, 2025-07-30 at 13:55 +0800, Yan Zhao wrote:
> > But hmm, tangentially related, but Yan do we have a similar problem with
> > KVM_PRE_FAULT_MEMORY after we started setting pre_fault_allowed during TD
> > finalization?
> Sean's commit 6385d01eec16 ("KVM: x86/mmu: Don't overwrite shadow-present MMU
> SPTEs when prefaulting") should have prevented repeated invocation of
> set_external_spte_present() with prefaulted entries.

Ok, great.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-07-29 19:33 ` [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Sean Christopherson
@ 2025-08-01 13:56   ` Adrian Hunter
  2025-08-01 16:44     ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Adrian Hunter @ 2025-08-01 13:56 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Paolo Bonzini
  Cc: linux-arm-kernel, kvmarm, kvm, linux-kernel, Vishal Annapurve,
	Xiaoyao Li, Rick Edgecombe, Nikolay Borisov, Zhao, Yan Y,
	Huang, Kai, binbin.wu@linux.intel.com

On 29/07/2025 22:33, Sean Christopherson wrote:
> +static int tdx_terminate_vm(struct kvm *kvm)
> +{
> +	if (kvm_trylock_all_vcpus(kvm))
> +		return -EBUSY;
> +
> +	kvm_vm_dead(kvm);
> +	to_kvm_tdx(kvm)->vm_terminated = true;
> +
> +	kvm_unlock_all_vcpus(kvm);
> +
> +	tdx_mmu_release_hkid(kvm);
> +
> +	return 0;
> +}

As I think I mentioned when removing vm_dead first came up,
I think we need more checks.  I spent some time going through
the code and came up with what is below:

First, we need to avoid TDX VCPU sub-IOCTLs from racing with
tdx_mmu_release_hkid().  But having any TDX sub-IOCTL run after
KVM_TDX_TERMINATE_VM raises questions of what might happen, so
it is much simpler to understand, if that is not possible.
There are 3 options:

1. Require that KVM_TDX_TERMINATE_VM is valid only if
kvm_tdx->state == TD_STATE_RUNNABLE.  Since currently all
the TDX sub-IOCTLs are for initialization, that would block
the opportunity for any to run after KVM_TDX_TERMINATE_VM.

2. Check vm_terminated in tdx_vm_ioctl() and tdx_vcpu_ioctl()

3. Test KVM_REQ_VM_DEAD in tdx_vm_ioctl() and tdx_vcpu_ioctl()

[ Note cannot check is_hkid_assigned() because that is racy ]

Secondly, I suggest we avoid SEAMCALLs that will fail and
result in KVM_BUG_ON() if HKID has been released.

There are 2 groups of those: MMU-related and TDVPS_ACCESSORS.

For the MMU-related, the following 2 functions should return
an error immediately if vm_terminated:

	tdx_sept_link_private_spt()
	tdx_sept_set_private_spte()

For that not be racy, extra synchronization is needed so that
vm_terminated can be reliably checked when holding mmu lock
i.e.

static int tdx_terminate_vm(struct kvm *kvm)
{
	if (kvm_trylock_all_vcpus(kvm))
		return -EBUSY;

	kvm_vm_dead(kvm);
+
+       write_lock(&kvm->mmu_lock);
	to_kvm_tdx(kvm)->vm_terminated = true;
+       write_unlock(&kvm->mmu_lock);

	kvm_unlock_all_vcpus(kvm);

	tdx_mmu_release_hkid(kvm);

	return 0;
}

Finally, there are 2 TDVPS_ACCESSORS that need avoiding:

	tdx_load_mmu_pgd()
		skip td_vmcs_write64() if vm_terminated

	tdx_protected_apic_has_interrupt()
		skip td_state_non_arch_read64() if vm_terminated




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-08-01 13:56   ` Adrian Hunter
@ 2025-08-01 16:44     ` Sean Christopherson
  2025-08-03 17:41       ` Adrian Hunter
  2025-08-06  6:06       ` Chao Gao
  0 siblings, 2 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-08-01 16:44 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, Vishal Annapurve, Xiaoyao Li,
	Rick Edgecombe, Nikolay Borisov, Yan Y Zhao, Kai Huang,
	binbin.wu@linux.intel.com

+Chao

On Fri, Aug 01, 2025, Adrian Hunter wrote:
> On 29/07/2025 22:33, Sean Christopherson wrote:
> > +static int tdx_terminate_vm(struct kvm *kvm)
> > +{
> > +	if (kvm_trylock_all_vcpus(kvm))
> > +		return -EBUSY;
> > +
> > +	kvm_vm_dead(kvm);
> > +	to_kvm_tdx(kvm)->vm_terminated = true;
> > +
> > +	kvm_unlock_all_vcpus(kvm);
> > +
> > +	tdx_mmu_release_hkid(kvm);
> > +
> > +	return 0;
> > +}
> 
> As I think I mentioned when removing vm_dead first came up,
> I think we need more checks.  I spent some time going through
> the code and came up with what is below:
> 
> First, we need to avoid TDX VCPU sub-IOCTLs from racing with
> tdx_mmu_release_hkid().  But having any TDX sub-IOCTL run after
> KVM_TDX_TERMINATE_VM raises questions of what might happen, so
> it is much simpler to understand, if that is not possible.
> There are 3 options:
> 
> 1. Require that KVM_TDX_TERMINATE_VM is valid only if
> kvm_tdx->state == TD_STATE_RUNNABLE.  Since currently all
> the TDX sub-IOCTLs are for initialization, that would block
> the opportunity for any to run after KVM_TDX_TERMINATE_VM.
> 
> 2. Check vm_terminated in tdx_vm_ioctl() and tdx_vcpu_ioctl()
> 
> 3. Test KVM_REQ_VM_DEAD in tdx_vm_ioctl() and tdx_vcpu_ioctl()
> 
> [ Note cannot check is_hkid_assigned() because that is racy ]
> 
> Secondly, I suggest we avoid SEAMCALLs that will fail and
> result in KVM_BUG_ON() if HKID has been released.
> 
> There are 2 groups of those: MMU-related and TDVPS_ACCESSORS.
> 
> For the MMU-related, the following 2 functions should return
> an error immediately if vm_terminated:
> 
> 	tdx_sept_link_private_spt()
> 	tdx_sept_set_private_spte()
> 
> For that not be racy, extra synchronization is needed so that
> vm_terminated can be reliably checked when holding mmu lock
> i.e.
> 
> static int tdx_terminate_vm(struct kvm *kvm)
> {
> 	if (kvm_trylock_all_vcpus(kvm))
> 		return -EBUSY;
> 
> 	kvm_vm_dead(kvm);
> +
> +       write_lock(&kvm->mmu_lock);
> 	to_kvm_tdx(kvm)->vm_terminated = true;
> +       write_unlock(&kvm->mmu_lock);
> 
> 	kvm_unlock_all_vcpus(kvm);
> 
> 	tdx_mmu_release_hkid(kvm);
> 
> 	return 0;
> }
> 
> Finally, there are 2 TDVPS_ACCESSORS that need avoiding:
> 
> 	tdx_load_mmu_pgd()
> 		skip td_vmcs_write64() if vm_terminated
> 
> 	tdx_protected_apic_has_interrupt()
> 		skip td_state_non_arch_read64() if vm_terminated

Oof.  And as Chao pointed out[*], removing the vm_dead check would allow creating
and running vCPUs in a dead VM, which is most definitely not desirable.  Squashing
the vCPU creation case is easy enough if we keep vm_dead but still generally allow
ioctls, and it's probably worth doing that no matter what (to plug the hole where
pending vCPU creations could succeed):

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d477a7fda0ae..941d2c32b7dc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4207,6 +4207,11 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
 
        mutex_lock(&kvm->lock);
 
+       if (kvm->vm_dead) {
+               r = -EIO;
+               goto unlock_vcpu_destroy;
+       }
+
        if (kvm_get_vcpu_by_id(kvm, id)) {
                r = -EEXIST;
                goto unlock_vcpu_destroy;

And then to ensure vCPUs can't do anything, check KVM_REQ_VM_DEAD after acquiring
vcpu->mutex.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6c07dd423458..883077eee4ce 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4433,6 +4433,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
 
        if (mutex_lock_killable(&vcpu->mutex))
                return -EINTR;
+
+       if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
+               r = -EIO;
+               goto out;
+       }
+
        switch (ioctl) {
        case KVM_RUN: {
                struct pid *oldpid;


That should address all TDVPS paths (I hope), and I _think_ would address all
MMU-related paths as well?  E.g. prefault requires a vCPU.

Disallowing (most) vCPU ioctls but not all VM ioctls on vm_dead isn't great ABI
(understatement), but I think we need/want the above changes even if we keep the
general vm_dead restriction.  And given the extremely ad hoc behavior of taking
kvm->lock for VM ioctls, trying to enforce vm_dead for "all" VM ioctls seems like
a fool's errand.

So I'm leaning toward keeping "KVM: Reject ioctls only if the VM is bugged, not
simply marked dead" (with a different shortlog+changelog), but keeping vm_dead
(and not introducing kvm_tdx.vm_terminated).

Thoughts?

[*] https://lore.kernel.org/all/aIlzeT+yFG2Tvb3%2F@intel.com


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-08-01 16:44     ` Sean Christopherson
@ 2025-08-03 17:41       ` Adrian Hunter
  2025-08-06  6:06       ` Chao Gao
  1 sibling, 0 replies; 21+ messages in thread
From: Adrian Hunter @ 2025-08-03 17:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, Vishal Annapurve, Xiaoyao Li,
	Rick Edgecombe, Nikolay Borisov, Yan Y Zhao, Kai Huang,
	binbin.wu@linux.intel.com

On 01/08/2025 19:44, Sean Christopherson wrote:
> +Chao
> 
> On Fri, Aug 01, 2025, Adrian Hunter wrote:
>> On 29/07/2025 22:33, Sean Christopherson wrote:
>>> +static int tdx_terminate_vm(struct kvm *kvm)
>>> +{
>>> +	if (kvm_trylock_all_vcpus(kvm))
>>> +		return -EBUSY;
>>> +
>>> +	kvm_vm_dead(kvm);
>>> +	to_kvm_tdx(kvm)->vm_terminated = true;
>>> +
>>> +	kvm_unlock_all_vcpus(kvm);
>>> +
>>> +	tdx_mmu_release_hkid(kvm);
>>> +
>>> +	return 0;
>>> +}
>>
>> As I think I mentioned when removing vm_dead first came up,
>> I think we need more checks.  I spent some time going through
>> the code and came up with what is below:
>>
>> First, we need to avoid TDX VCPU sub-IOCTLs from racing with
>> tdx_mmu_release_hkid().  But having any TDX sub-IOCTL run after
>> KVM_TDX_TERMINATE_VM raises questions of what might happen, so
>> it is much simpler to understand, if that is not possible.
>> There are 3 options:
>>
>> 1. Require that KVM_TDX_TERMINATE_VM is valid only if
>> kvm_tdx->state == TD_STATE_RUNNABLE.  Since currently all
>> the TDX sub-IOCTLs are for initialization, that would block
>> the opportunity for any to run after KVM_TDX_TERMINATE_VM.
>>
>> 2. Check vm_terminated in tdx_vm_ioctl() and tdx_vcpu_ioctl()
>>
>> 3. Test KVM_REQ_VM_DEAD in tdx_vm_ioctl() and tdx_vcpu_ioctl()
>>
>> [ Note cannot check is_hkid_assigned() because that is racy ]
>>
>> Secondly, I suggest we avoid SEAMCALLs that will fail and
>> result in KVM_BUG_ON() if HKID has been released.
>>
>> There are 2 groups of those: MMU-related and TDVPS_ACCESSORS.
>>
>> For the MMU-related, the following 2 functions should return
>> an error immediately if vm_terminated:
>>
>> 	tdx_sept_link_private_spt()
>> 	tdx_sept_set_private_spte()
>>
>> For that not be racy, extra synchronization is needed so that
>> vm_terminated can be reliably checked when holding mmu lock
>> i.e.
>>
>> static int tdx_terminate_vm(struct kvm *kvm)
>> {
>> 	if (kvm_trylock_all_vcpus(kvm))
>> 		return -EBUSY;
>>
>> 	kvm_vm_dead(kvm);
>> +
>> +       write_lock(&kvm->mmu_lock);
>> 	to_kvm_tdx(kvm)->vm_terminated = true;
>> +       write_unlock(&kvm->mmu_lock);
>>
>> 	kvm_unlock_all_vcpus(kvm);
>>
>> 	tdx_mmu_release_hkid(kvm);
>>
>> 	return 0;
>> }
>>
>> Finally, there are 2 TDVPS_ACCESSORS that need avoiding:
>>
>> 	tdx_load_mmu_pgd()
>> 		skip td_vmcs_write64() if vm_terminated
>>
>> 	tdx_protected_apic_has_interrupt()
>> 		skip td_state_non_arch_read64() if vm_terminated
> 
> Oof.  And as Chao pointed out[*], removing the vm_dead check would allow creating
> and running vCPUs in a dead VM, which is most definitely not desirable.  Squashing
> the vCPU creation case is easy enough if we keep vm_dead but still generally allow
> ioctls, and it's probably worth doing that no matter what (to plug the hole where
> pending vCPU creations could succeed):
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d477a7fda0ae..941d2c32b7dc 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4207,6 +4207,11 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
>  
>         mutex_lock(&kvm->lock);
>  
> +       if (kvm->vm_dead) {
> +               r = -EIO;
> +               goto unlock_vcpu_destroy;
> +       }
> +
>         if (kvm_get_vcpu_by_id(kvm, id)) {
>                 r = -EEXIST;
>                 goto unlock_vcpu_destroy;
> 
> And then to ensure vCPUs can't do anything, check KVM_REQ_VM_DEAD after acquiring
> vcpu->mutex.
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6c07dd423458..883077eee4ce 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4433,6 +4433,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
>  
>         if (mutex_lock_killable(&vcpu->mutex))
>                 return -EINTR;
> +
> +       if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
> +               r = -EIO;
> +               goto out;
> +       }
> +
>         switch (ioctl) {
>         case KVM_RUN: {
>                 struct pid *oldpid;
> 
> 
> That should address all TDVPS paths (I hope), and I _think_ would address all
> MMU-related paths as well?  E.g. prefault requires a vCPU.
> 
> Disallowing (most) vCPU ioctls but not all VM ioctls on vm_dead isn't great ABI
> (understatement), but I think we need/want the above changes even if we keep the
> general vm_dead restriction.  And given the extremely ad hoc behavior of taking
> kvm->lock for VM ioctls, trying to enforce vm_dead for "all" VM ioctls seems like
> a fool's errand.
> 
> So I'm leaning toward keeping "KVM: Reject ioctls only if the VM is bugged, not
> simply marked dead" (with a different shortlog+changelog), but keeping vm_dead
> (and not introducing kvm_tdx.vm_terminated).
> 
> Thoughts?

That covers the cases I listed, so it is fine by me.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
  2025-08-01 16:44     ` Sean Christopherson
  2025-08-03 17:41       ` Adrian Hunter
@ 2025-08-06  6:06       ` Chao Gao
  1 sibling, 0 replies; 21+ messages in thread
From: Chao Gao @ 2025-08-06  6:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Adrian Hunter, Marc Zyngier, Oliver Upton, Paolo Bonzini,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, Vishal Annapurve,
	Xiaoyao Li, Rick Edgecombe, Nikolay Borisov, Yan Y Zhao,
	Kai Huang, binbin.wu@linux.intel.com

>Oof.  And as Chao pointed out[*], removing the vm_dead check would allow creating
>and running vCPUs in a dead VM, which is most definitely not desirable.  Squashing
>the vCPU creation case is easy enough if we keep vm_dead but still generally allow
>ioctls, and it's probably worth doing that no matter what (to plug the hole where
>pending vCPU creations could succeed):
>
>diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>index d477a7fda0ae..941d2c32b7dc 100644
>--- a/virt/kvm/kvm_main.c
>+++ b/virt/kvm/kvm_main.c
>@@ -4207,6 +4207,11 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
> 
>        mutex_lock(&kvm->lock);
> 
>+       if (kvm->vm_dead) {
>+               r = -EIO;
>+               goto unlock_vcpu_destroy;
>+       }
>+

yes. this addresses my concern.

>        if (kvm_get_vcpu_by_id(kvm, id)) {
>                r = -EEXIST;
>                goto unlock_vcpu_destroy;
>
>And then to ensure vCPUs can't do anything, check KVM_REQ_VM_DEAD after acquiring
>vcpu->mutex.
>
>diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>index 6c07dd423458..883077eee4ce 100644
>--- a/virt/kvm/kvm_main.c
>+++ b/virt/kvm/kvm_main.c
>@@ -4433,6 +4433,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
> 
>        if (mutex_lock_killable(&vcpu->mutex))
>                return -EINTR;
>+
>+       if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
>+               r = -EIO;
>+               goto out;
>+       }
>+
>        switch (ioctl) {
>        case KVM_RUN: {
>                struct pid *oldpid;
>
>
>That should address all TDVPS paths (I hope), and I _think_ would address all
>MMU-related paths as well?  E.g. prefault requires a vCPU.
>
>Disallowing (most) vCPU ioctls but not all VM ioctls on vm_dead isn't great ABI
>(understatement), but I think we need/want the above changes even if we keep the
>general vm_dead restriction.  And given the extremely ad hoc behavior of taking
>kvm->lock for VM ioctls, trying to enforce vm_dead for "all" VM ioctls seems like
>a fool's errand.
>
>So I'm leaning toward keeping "KVM: Reject ioctls only if the VM is bugged, not
>simply marked dead" (with a different shortlog+changelog), but keeping vm_dead
>(and not introducing kvm_tdx.vm_terminated).

Sounds good to me.

With kvm_tdx.vm_terminated removed, we should consider adding a comment above
the is_hkid_assigned() check in tdx_sept_remove_private_spte() to clarify that
!is_hkid_assigned() indicates the guest has been terminated, allowing private
pages to be reclaimed directly without zapping.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-08-06  6:09 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-29 19:33 [PATCH 0/5] KVM: Drop vm_dead, pivot on vm_bugged for -EIO Sean Christopherson
2025-07-29 19:33 ` [PATCH 1/5] KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests Sean Christopherson
2025-07-29 19:33 ` [PATCH 2/5] KVM: TDX: Exit with MEMORY_FAULT on unexpected pending S-EPT Violation Sean Christopherson
2025-07-29 22:27   ` Edgecombe, Rick P
2025-07-29 22:54     ` Sean Christopherson
2025-07-29 22:58       ` Edgecombe, Rick P
2025-07-29 23:08         ` Sean Christopherson
2025-07-29 23:13           ` Edgecombe, Rick P
2025-07-30  5:45         ` Yan Zhao
2025-07-30  5:55     ` Yan Zhao
2025-07-30 12:59       ` Edgecombe, Rick P
2025-07-30  2:07   ` Xiaoyao Li
2025-07-30  6:04     ` Yan Zhao
2025-07-29 19:33 ` [PATCH 3/5] KVM: Reject ioctls only if the VM is bugged, not simply marked dead Sean Christopherson
2025-07-30  1:20   ` Chao Gao
2025-07-29 19:33 ` [PATCH 4/5] KVM: selftests: Use for-loop to handle all successful SEV migrations Sean Christopherson
2025-07-29 19:33 ` [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM Sean Christopherson
2025-08-01 13:56   ` Adrian Hunter
2025-08-01 16:44     ` Sean Christopherson
2025-08-03 17:41       ` Adrian Hunter
2025-08-06  6:06       ` Chao Gao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).