[PATCH 0/2] Introduce RET_PF_RETRY_INVALID

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Introduce RET_PF_RETRY_INVALID_SLOT
@ 2025-05-19  2:36 Yan Zhao
  2025-05-19  2:37 ` [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot Yan Zhao
  2025-05-19  2:38 ` [PATCH 2/2] KVM: selftests: Test prefault memory with concurrent memslot removal Yan Zhao
  0 siblings, 2 replies; 18+ messages in thread
From: Yan Zhao @ 2025-05-19  2:36 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: reinette.chatre, rick.p.edgecombe, linux-kernel, kvm, Yan Zhao

Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of
kvm_mmu_do_page_fault() that a fault retry is caused by an invalid memslot.
This helps prevent deadlock when a memslot is removed during pre-faulting
or local retry of faulting private pages in TDX.

Take pre-faulting as an example. Removing a memslot during the ioctl
KVM_PRE_FAULT_MEMORY on x86 can lead to a kernel deadlock.

"slot deleting" thread                    "pre-fault" thread
-----------------------------             ----------------------
                                          srcu_read_lock();
(A)
invalid_slot->flags |= KVM_MEMSLOT_INVALID
rcu_assign_pointer();

                                          kvm_tdp_map_page();
                                          (B)
                                            do {
                                               r = kvm_mmu_do_page_fault();

(C) synchronize_srcu_expedited();

                                            } while (r == RET_PF_RETRY);

                                         (D) srcu_read_unlock();

The deadlock occurs because (C) is waiting for (D) to complete, but (B)
repeatedly encounters an invalid slot and retries before (C) can finish,
thus preventing (D) from being executed.
(If the srcu_read_lock() in the pre-fault thread is invoked after (A)
and before (C), deadlock can also occur).

The local retry code in TDX's EPT violation handler faces a similar issue,
where deadlock can occur if an invalid slot is found when faulting a
private GFN.

To resolve the deadlock, modify kvm_mmu_do_page_fault() to return
RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an
invalid memslot. This change enables the pre-fault thread or TDX's EPT
violation handler to exit the loop and release the srcu lock.

There're some alternative solutions to address the deadlock.
1) Return -EAGAIN for an invalid slot.
   Cons: This approach involves an uAPI change, requiring userspace to
         ignore error -EAGAIN and re-run the vCPU. While QEMU already
         ignores -EAGAIN, selftests may need updates.

2) Call kvm_mmu_prepare_memory_fault_exit() and return -EFAULT for an
   invalid slot.
   Cons: - kvm_mmu_prepare_memory_fault_exit() is not recognized by
           userspace for normal VMs.
         - Returning -EFAULT when a memslot is still invalid (i.e., not
           completely removed) may confuse userspace.

3) Release the srcu lock before cond_resched() and re-acquire it afterward
   in kvm_tdp_map_page() and tdx_handle_ept_violation():
   Cons: - It allows each kvm_vcpu_pre_fault_memory() to pre-fault memory
           with different memslot layouts.
         - It requires tdx_gmem_post_populate() to acquire the srcu lock
           before invoking kvm_tdp_map_page(), which is redundant since
           tdx_gmem_post_populate() already holds the kvm->slots_lock.

Patch 1 opts to introduce RET_PF_RETRY_INVALID_SLOT, which prevents endless
retries in kvm_tdp_map_page() and tdx_handle_ept_violation() while avoiding
exiting to userspace.

Patch 2 provides a selftest to reproduce the deadlock when patch 1 is
absent and verifies the pre-fault can correctly completes when the faulting
range is in a removing memslot.

Note: The selftest pre_fault_memory_test relies on commit 20a6cff3b283
("KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()") to run
the second ioctl KVM_PRE_FAULT_MEMORY successfully after a memory slot is
removed during the first ioctl KVM_PRE_FAULT_MEMORY.
Base commit:
kvm-x86/next or
kvm/next + commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete
roots in kvm_mmu_reload()") 

Yan Zhao (2):
  KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid
    slot
  KVM: selftests: Test prefault memory with concurrent memslot removal

 arch/x86/kvm/mmu/mmu.c                        |  3 +-
 arch/x86/kvm/mmu/mmu_internal.h               |  3 +
 .../selftests/kvm/pre_fault_memory_test.c     | 82 +++++++++++++++----
 3 files changed, 72 insertions(+), 16 deletions(-)

base-commit: e59ae112a7c2e0f58d9f5905de299fe206f84af0
-- 
2.43.2

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19  2:36 [PATCH 0/2] Introduce RET_PF_RETRY_INVALID_SLOT Yan Zhao
@ 2025-05-19  2:37 ` Yan Zhao
  2025-05-19 13:33   ` Sean Christopherson
  2025-05-19  2:38 ` [PATCH 2/2] KVM: selftests: Test prefault memory with concurrent memslot removal Yan Zhao
  1 sibling, 1 reply; 18+ messages in thread
From: Yan Zhao @ 2025-05-19  2:37 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: reinette.chatre, rick.p.edgecombe, linux-kernel, kvm, Yan Zhao

Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of
kvm_mmu_do_page_fault() that a fault retry is due to an invalid memslot.
This helps prevent deadlocks when a memslot is removed during pre-faulting
GPAs in the memslot or local retry of faulting private pages in TDX.

Take pre-faulting as an example.

During ioctl KVM_PRE_FAULT_MEMORY, kvm->srcu is acquired around the
pre-faulting of the entire range. For x86, kvm_arch_vcpu_pre_fault_memory()
further invokes kvm_tdp_map_page(), which retries kvm_mmu_do_page_fault()
if the return value is RET_PF_RETRY.

If a memslot is deleted during the ioctl KVM_PRE_FAULT_MEMORY, after
kvm_invalidate_memslot() marks a slot as invalid and makes it visible via
rcu_assign_pointer() in kvm_swap_active_memslots(), kvm_mmu_do_page_fault()
may encounter an invalid slot and return RET_PF_RETRY. Consequently,
kvm_tdp_map_page() will then retry without releasing the srcu lock.
Meanwhile, synchronize_srcu_expedited() in kvm_swap_active_memslots() is
blocked, waiting for kvm_vcpu_pre_fault_memory() to release the srcu lock,
leading to a deadlock.

"slot deleting" thread                   "prefault" thread
-----------------------------            ----------------------
                                         srcu_read_lock();
(A)
invalid_slot->flags |= KVM_MEMSLOT_INVALID;
rcu_assign_pointer();

                                         kvm_tdp_map_page();
                                         (B)
                                            do {
                                               r = kvm_mmu_do_page_fault();

(C) synchronize_srcu_expedited();

                                            } while (r == RET_PF_RETRY);

                                         (D) srcu_read_unlock();

As shown in diagram, (C) is waiting for (D). However, (B) continuously
finds an invalid slot before (C) completes, causing (B) to retry and
preventing (D) from being invoked.

The local retry code in TDX's EPT violation handler faces a similar issue,
where a deadlock can occur when faulting a private GFN in a slot that is
concurrently being removed.

To resolve the deadlock, introduce a new return value
RET_PF_RETRY_INVALID_SLOT and modify kvm_mmu_do_page_fault() to return
RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an
invalid memslot. This prevents endless retries in kvm_tdp_map_page() or
tdx_handle_ept_violation(), allowing the srcu to be released and enabling
slot removal to proceed.

As all callers of kvm_tdp_map_page(), i.e.,
kvm_arch_vcpu_pre_fault_memory() or tdx_gmem_post_populate(), are in
pre-fault path, treat RET_PF_RETRY_INVALID_SLOT the same as RET_PF_EMULATE
to return -ENOENT in kvm_tdp_map_page() to enable userspace to be aware of
the slot removal.

Returning RET_PF_RETRY_INVALID_SLOT in kvm_mmu_do_page_fault() does not
affect kvm_mmu_page_fault() and kvm_arch_async_page_ready(), as their
callers either only check if the return value > 0 to re-enter vCPU for
retry or do not check return value.

Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c          | 3 ++-
 arch/x86/kvm/mmu/mmu_internal.h | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..3331e1e1aa69 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4599,7 +4599,7 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.
 	 */
 	if (slot->flags & KVM_MEMSLOT_INVALID)
-		return RET_PF_RETRY;
+		return RET_PF_RETRY_INVALID_SLOT;

 	if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
 		/*
@@ -4879,6 +4879,7 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 		return 0;

 	case RET_PF_EMULATE:
+	case RET_PF_RETRY_INVALID_SLOT:
 		return -ENOENT;

 	case RET_PF_RETRY:
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index db8f33e4de62..1aa14a32225e 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -311,6 +311,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
  * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
+ * RET_PF_RETRY_INVALID_SLOT: Let CPU fault again on the address due to slot
+ *                            with flag KVM_MEMSLOT_INVALID.
  *
  * Any names added to this enum should be exported to userspace for use in
  * tracepoints via TRACE_DEFINE_ENUM() in mmutrace.h
@@ -326,6 +328,7 @@ enum {
 	RET_PF_INVALID,
 	RET_PF_FIXED,
 	RET_PF_SPURIOUS,
+	RET_PF_RETRY_INVALID_SLOT,
 };

 /*
-- 
2.43.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] KVM: selftests: Test prefault memory with concurrent memslot removal
  2025-05-19  2:36 [PATCH 0/2] Introduce RET_PF_RETRY_INVALID_SLOT Yan Zhao
  2025-05-19  2:37 ` [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot Yan Zhao
@ 2025-05-19  2:38 ` Yan Zhao
  1 sibling, 0 replies; 18+ messages in thread
From: Yan Zhao @ 2025-05-19  2:38 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: reinette.chatre, rick.p.edgecombe, linux-kernel, kvm, Yan Zhao

Add a new test case in pre_fault_memory_test to run vm_mem_region_delete()
concurrently with ioctl KVM_PRE_FAULT_MEMORY. Both of them should complete.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 .../selftests/kvm/pre_fault_memory_test.c     | 82 +++++++++++++++----
 1 file changed, 67 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/testing/selftests/kvm/pre_fault_memory_test.c
index 0350a8896a2f..c82dfc033a7b 100644
--- a/tools/testing/selftests/kvm/pre_fault_memory_test.c
+++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c
@@ -10,12 +10,16 @@
 #include <test_util.h>
 #include <kvm_util.h>
 #include <processor.h>
+#include <pthread.h>
 
 /* Arbitrarily chosen values */
 #define TEST_SIZE		(SZ_2M + PAGE_SIZE)
 #define TEST_NPAGES		(TEST_SIZE / PAGE_SIZE)
 #define TEST_SLOT		10
 
+static bool prefault_ready;
+static bool delete_thread_ready;
+
 static void guest_code(uint64_t base_gpa)
 {
 	volatile uint64_t val __used;
@@ -30,16 +34,41 @@ static void guest_code(uint64_t base_gpa)
 	GUEST_DONE();
 }
 
-static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 gpa, u64 size,
-			     u64 left)
+static void *remove_slot_worker(void *data)
+{
+	struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
+
+	WRITE_ONCE(delete_thread_ready, true);
+
+	while (!READ_ONCE(prefault_ready))
+		cpu_relax();
+
+	vm_mem_region_delete(vcpu->vm, TEST_SLOT);
+
+	WRITE_ONCE(delete_thread_ready, false);
+	return NULL;
+}
+
+static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 base_gpa, u64 offset,
+			     u64 size, u64 left, bool private, bool remove_slot)
 {
 	struct kvm_pre_fault_memory range = {
-		.gpa = gpa,
+		.gpa = base_gpa + offset,
 		.size = size,
 		.flags = 0,
 	};
 	u64 prev;
 	int ret, save_errno;
+	pthread_t remove_thread;
+
+	if (remove_slot) {
+		pthread_create(&remove_thread, NULL, remove_slot_worker, vcpu);
+
+		while (!READ_ONCE(delete_thread_ready))
+			cpu_relax();
+
+		WRITE_ONCE(prefault_ready, true);
+	}
 
 	do {
 		prev = range.size;
@@ -51,16 +80,35 @@ static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 gpa, u64 size,
 			    ret < 0 ? "failure" : "success");
 	} while (ret >= 0 ? range.size : save_errno == EINTR);
 
-	TEST_ASSERT(range.size == left,
-		    "Completed with %lld bytes left, expected %" PRId64,
-		    range.size, left);
-
-	if (left == 0)
-		__TEST_ASSERT_VM_VCPU_IOCTL(!ret, "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm);
-	else
-		/* No memory slot causes RET_PF_EMULATE. it results in -ENOENT. */
-		__TEST_ASSERT_VM_VCPU_IOCTL(ret && save_errno == ENOENT,
+	if (remove_slot) {
+		/*
+		 * ENOENT is expected if slot removal is performed earlier or
+		 * during KVM_PRE_FAULT_MEMORY;
+		 * On rare condition, ret could be 0 if KVM_PRE_FAULT_MEMORY
+		 * completes earlier than slot removal.
+		 */
+		__TEST_ASSERT_VM_VCPU_IOCTL((ret && save_errno == ENOENT) || !ret,
 					    "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm);
+
+		pthread_join(remove_thread, NULL);
+		WRITE_ONCE(prefault_ready, false);
+
+		vm_userspace_mem_region_add(vcpu->vm, VM_MEM_SRC_ANONYMOUS,
+					    base_gpa, TEST_SLOT, TEST_NPAGES,
+					    private ? KVM_MEM_GUEST_MEMFD : 0);
+	} else {
+		TEST_ASSERT(range.size == left,
+			    "Completed with %lld bytes left, expected %" PRId64,
+			    range.size, left);
+
+		if (left == 0)
+			__TEST_ASSERT_VM_VCPU_IOCTL(!ret, "KVM_PRE_FAULT_MEMORY",
+						    ret, vcpu->vm);
+		else
+			/* No memory slot causes RET_PF_EMULATE. it results in -ENOENT. */
+			__TEST_ASSERT_VM_VCPU_IOCTL(ret && save_errno == ENOENT,
+						    "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm);
+	}
 }
 
 static void __test_pre_fault_memory(unsigned long vm_type, bool private)
@@ -97,9 +145,13 @@ static void __test_pre_fault_memory(unsigned long vm_type, bool private)
 
 	if (private)
 		vm_mem_set_private(vm, guest_test_phys_mem, TEST_SIZE);
-	pre_fault_memory(vcpu, guest_test_phys_mem, SZ_2M, 0);
-	pre_fault_memory(vcpu, guest_test_phys_mem + SZ_2M, PAGE_SIZE * 2, PAGE_SIZE);
-	pre_fault_memory(vcpu, guest_test_phys_mem + TEST_SIZE, PAGE_SIZE, PAGE_SIZE);
+
+	pre_fault_memory(vcpu, guest_test_phys_mem, 0, SZ_2M, 0, private, true);
+	pre_fault_memory(vcpu, guest_test_phys_mem, 0, SZ_2M, 0, private, false);
+	pre_fault_memory(vcpu, guest_test_phys_mem, SZ_2M, PAGE_SIZE * 2, PAGE_SIZE,
+			 private, false);
+	pre_fault_memory(vcpu, guest_test_phys_mem, TEST_SIZE, PAGE_SIZE, PAGE_SIZE,
+			 private, false);
 
 	vcpu_args_set(vcpu, 1, guest_test_virt_mem);
 	vcpu_run(vcpu);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19  2:37 ` [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot Yan Zhao
@ 2025-05-19 13:33   ` Sean Christopherson
  2025-05-19 15:05     ` Reinette Chatre
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Sean Christopherson @ 2025-05-19 13:33 UTC (permalink / raw)
  To: Yan Zhao; +Cc: pbonzini, reinette.chatre, rick.p.edgecombe, linux-kernel, kvm

On Mon, May 19, 2025, Yan Zhao wrote:
> Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of
> kvm_mmu_do_page_fault() that a fault retry is due to an invalid memslot.
> This helps prevent deadlocks when a memslot is removed during pre-faulting
> GPAs in the memslot or local retry of faulting private pages in TDX.
> 
> Take pre-faulting as an example.
> 
> During ioctl KVM_PRE_FAULT_MEMORY, kvm->srcu is acquired around the
> pre-faulting of the entire range. For x86, kvm_arch_vcpu_pre_fault_memory()
> further invokes kvm_tdp_map_page(), which retries kvm_mmu_do_page_fault()
> if the return value is RET_PF_RETRY.
> 
> If a memslot is deleted during the ioctl KVM_PRE_FAULT_MEMORY, after
> kvm_invalidate_memslot() marks a slot as invalid and makes it visible via
> rcu_assign_pointer() in kvm_swap_active_memslots(), kvm_mmu_do_page_fault()
> may encounter an invalid slot and return RET_PF_RETRY. Consequently,
> kvm_tdp_map_page() will then retry without releasing the srcu lock.
> Meanwhile, synchronize_srcu_expedited() in kvm_swap_active_memslots() is
> blocked, waiting for kvm_vcpu_pre_fault_memory() to release the srcu lock,
> leading to a deadlock.

Probably worth calling out that KVM will respond to signals, i.e. there's no risk
to the host kernel.

> "slot deleting" thread                   "prefault" thread
> -----------------------------            ----------------------
>                                          srcu_read_lock();
> (A)
> invalid_slot->flags |= KVM_MEMSLOT_INVALID;
> rcu_assign_pointer();
> 
>                                          kvm_tdp_map_page();
>                                          (B)
>                                             do {
>                                                r = kvm_mmu_do_page_fault();
> 
> (C) synchronize_srcu_expedited();
> 
>                                             } while (r == RET_PF_RETRY);
> 
>                                          (D) srcu_read_unlock();
> 
> As shown in diagram, (C) is waiting for (D). However, (B) continuously
> finds an invalid slot before (C) completes, causing (B) to retry and
> preventing (D) from being invoked.
> 
> The local retry code in TDX's EPT violation handler faces a similar issue,
> where a deadlock can occur when faulting a private GFN in a slot that is
> concurrently being removed.
> 
> To resolve the deadlock, introduce a new return value
> RET_PF_RETRY_INVALID_SLOT and modify kvm_mmu_do_page_fault() to return
> RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an
> invalid memslot. This prevents endless retries in kvm_tdp_map_page() or
> tdx_handle_ept_violation(), allowing the srcu to be released and enabling
> slot removal to proceed.
> 
> As all callers of kvm_tdp_map_page(), i.e.,
> kvm_arch_vcpu_pre_fault_memory() or tdx_gmem_post_populate(), are in
> pre-fault path, treat RET_PF_RETRY_INVALID_SLOT the same as RET_PF_EMULATE
> to return -ENOENT in kvm_tdp_map_page() to enable userspace to be aware of
> the slot removal.

Userspace should already be "aware" of the slot removal.

> Returning RET_PF_RETRY_INVALID_SLOT in kvm_mmu_do_page_fault() does not
> affect kvm_mmu_page_fault() and kvm_arch_async_page_ready(), as their
> callers either only check if the return value > 0 to re-enter vCPU for
> retry or do not check return value.
> 
> Reported-by: Reinette Chatre <reinette.chatre@intel.com>

Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
kicking vCPUs out of KVM?

Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
KVM can simply drop and reacquire SRCU in the relevant paths.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..ceab756052eb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4866,7 +4866,12 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
                        return -EIO;
 
                cond_resched();
+
                r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
+               if (r == RET_PF_RETRY) {
+                       kvm_vcpu_srcu_read_unlock(vcpu);
+                       kvm_vcpu_srcu_read_lock(vcpu);
+               }
        } while (r == RET_PF_RETRY);
 
        if (r < 0)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..e29966ce3ab7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1920,6 +1920,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
                        break;
                }
 
+               kvm_vcpu_srcu_read_unlock(vcpu);
+               kvm_vcpu_srcu_read_lock(vcpu);
+
                cond_resched();
        }
        return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b24db92e98f3..21a3fa7476dd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4266,7 +4266,6 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
 static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
                                     struct kvm_pre_fault_memory *range)
 {
-       int idx;
        long r;
        u64 full_size;
 
@@ -4279,7 +4278,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
                return -EINVAL;
 
        vcpu_load(vcpu);
-       idx = srcu_read_lock(&vcpu->kvm->srcu);
+       kvm_vcpu_srcu_read_lock(vcpu);
 
        full_size = range->size;
        do {
@@ -4300,7 +4299,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
                cond_resched();
        } while (range->size);
 
-       srcu_read_unlock(&vcpu->kvm->srcu, idx);
+       kvm_vcpu_srcu_read_unlock(vcpu);
        vcpu_put(vcpu);
 
        /* Return success if at least one page was mapped successfully.  */

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19 13:33   ` Sean Christopherson
@ 2025-05-19 15:05     ` Reinette Chatre
  2025-05-19 15:22       ` Edgecombe, Rick P
  2025-05-19 16:12     ` Edgecombe, Rick P
  2025-05-20  5:27     ` Yan Zhao
  2 siblings, 1 reply; 18+ messages in thread
From: Reinette Chatre @ 2025-05-19 15:05 UTC (permalink / raw)
  To: Sean Christopherson, Yan Zhao
  Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm

Hi Sean,

On 5/19/25 6:33 AM, Sean Christopherson wrote:
> On Mon, May 19, 2025, Yan Zhao wrote:
>> Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of
>> kvm_mmu_do_page_fault() that a fault retry is due to an invalid memslot.
>> This helps prevent deadlocks when a memslot is removed during pre-faulting
>> GPAs in the memslot or local retry of faulting private pages in TDX.
>>
>> Take pre-faulting as an example.
>>
>> During ioctl KVM_PRE_FAULT_MEMORY, kvm->srcu is acquired around the
>> pre-faulting of the entire range. For x86, kvm_arch_vcpu_pre_fault_memory()
>> further invokes kvm_tdp_map_page(), which retries kvm_mmu_do_page_fault()
>> if the return value is RET_PF_RETRY.
>>
>> If a memslot is deleted during the ioctl KVM_PRE_FAULT_MEMORY, after
>> kvm_invalidate_memslot() marks a slot as invalid and makes it visible via
>> rcu_assign_pointer() in kvm_swap_active_memslots(), kvm_mmu_do_page_fault()
>> may encounter an invalid slot and return RET_PF_RETRY. Consequently,
>> kvm_tdp_map_page() will then retry without releasing the srcu lock.
>> Meanwhile, synchronize_srcu_expedited() in kvm_swap_active_memslots() is
>> blocked, waiting for kvm_vcpu_pre_fault_memory() to release the srcu lock,
>> leading to a deadlock.
> 
> Probably worth calling out that KVM will respond to signals, i.e. there's no risk
> to the host kernel.
> 
>> "slot deleting" thread                   "prefault" thread
>> -----------------------------            ----------------------
>>                                          srcu_read_lock();
>> (A)
>> invalid_slot->flags |= KVM_MEMSLOT_INVALID;
>> rcu_assign_pointer();
>>
>>                                          kvm_tdp_map_page();
>>                                          (B)
>>                                             do {
>>                                                r = kvm_mmu_do_page_fault();
>>
>> (C) synchronize_srcu_expedited();
>>
>>                                             } while (r == RET_PF_RETRY);
>>
>>                                          (D) srcu_read_unlock();
>>
>> As shown in diagram, (C) is waiting for (D). However, (B) continuously
>> finds an invalid slot before (C) completes, causing (B) to retry and
>> preventing (D) from being invoked.
>>
>> The local retry code in TDX's EPT violation handler faces a similar issue,
>> where a deadlock can occur when faulting a private GFN in a slot that is
>> concurrently being removed.
>>
>> To resolve the deadlock, introduce a new return value
>> RET_PF_RETRY_INVALID_SLOT and modify kvm_mmu_do_page_fault() to return
>> RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an
>> invalid memslot. This prevents endless retries in kvm_tdp_map_page() or
>> tdx_handle_ept_violation(), allowing the srcu to be released and enabling
>> slot removal to proceed.
>>
>> As all callers of kvm_tdp_map_page(), i.e.,
>> kvm_arch_vcpu_pre_fault_memory() or tdx_gmem_post_populate(), are in
>> pre-fault path, treat RET_PF_RETRY_INVALID_SLOT the same as RET_PF_EMULATE
>> to return -ENOENT in kvm_tdp_map_page() to enable userspace to be aware of
>> the slot removal.
> 
> Userspace should already be "aware" of the slot removal.
> 
>> Returning RET_PF_RETRY_INVALID_SLOT in kvm_mmu_do_page_fault() does not
>> affect kvm_mmu_page_fault() and kvm_arch_async_page_ready(), as their
>> callers either only check if the return value > 0 to re-enter vCPU for
>> retry or do not check return value.
>>
>> Reported-by: Reinette Chatre <reinette.chatre@intel.com>
> 
> Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> kicking vCPUs out of KVM?

No, this was not hit by a real VMM. This was hit by a TDX MMU stress test (built
on top of [1]) that is still under development.

Reinette

[1] https://lore.kernel.org/lkml/20250414214801.2693294-1-sagis@google.com/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19 15:05     ` Reinette Chatre
@ 2025-05-19 15:22       ` Edgecombe, Rick P
  2025-05-19 15:53         ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: Edgecombe, Rick P @ 2025-05-19 15:22 UTC (permalink / raw)
  To: Chatre, Reinette, seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org

On Mon, 2025-05-19 at 08:05 -0700, Reinette Chatre wrote:
> > Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot
> > without
> > kicking vCPUs out of KVM?
> 
> No, this was not hit by a real VMM. This was hit by a TDX MMU stress test
> (built on top of [1]) that is still under development.

Yea, the context is that this TDX MMU stress test has grown more and more
stressful. Mostly it has found TDX module issues. But recently it added this
case which turned out to be a general issue. The TDX specific MMU stress test is
not ready yet, so Yan added the case to the general test and fixed it for both
VM types.

For TDX, since it's an pretty edge case and nothing catastrophic happens, I'd
prefer to not rush a fix into the TDX PR.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19 15:22       ` Edgecombe, Rick P
@ 2025-05-19 15:53         ` Sean Christopherson
  2025-05-19 16:17           ` Edgecombe, Rick P
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2025-05-19 15:53 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Reinette Chatre, Yan Y Zhao, kvm@vger.kernel.org,
	pbonzini@redhat.com, linux-kernel@vger.kernel.org

On Mon, May 19, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-05-19 at 08:05 -0700, Reinette Chatre wrote:
> > > Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot
> > > without kicking vCPUs out of KVM?
> > 
> > No, this was not hit by a real VMM. This was hit by a TDX MMU stress test
> > (built on top of [1]) that is still under development.
> 
> Yea, the context is that this TDX MMU stress test has grown more and more
> stressful. Mostly it has found TDX module issues. But recently it added this
> case which turned out to be a general issue. The TDX specific MMU stress test is
> not ready yet, so Yan added the case to the general test and fixed it for both
> VM types.
> 
> For TDX, since it's an pretty edge case and nothing catastrophic happens, I'd
> prefer to not rush a fix into the TDX PR.

Yeah, and I'd prefer not to bleed these details into userspace (or into KVM in
general), hence my question about whether or not a "real" VMM hit this.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19 13:33   ` Sean Christopherson
  2025-05-19 15:05     ` Reinette Chatre
@ 2025-05-19 16:12     ` Edgecombe, Rick P
  2025-05-19 17:06       ` Sean Christopherson
  2025-05-20  5:27     ` Yan Zhao
  2 siblings, 1 reply; 18+ messages in thread
From: Edgecombe, Rick P @ 2025-05-19 16:12 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org

On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> kicking vCPUs out of KVM?
> 
> Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
> KVM can simply drop and reacquire SRCU in the relevant paths.

During the initial debugging and kicking around stage, this is the first
direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
kvm_tdp_map_page() tries to unlock without it being held. (although that version
didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
came up with the version in this series, which we held review on for the list:

> However, upon further consideration, I am reluctant to implement this fix for
> the following reasons:
> - kvm_gmem_populate() already holds the kvm->slots_lock.
> - While retrying with srcu unlock and lock can workaround the
>   KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
>   and tdx_handle_ept_violation() faulting with different memslot layouts.


I'm not sure why the second one is really a problem. For the first one I think
that path could just take the scru lock in the proper order with kvm-
>slots_lock? I need to stare at these locking rules each time, so low quality
suggestion. But that is the context.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19 15:53         ` Sean Christopherson
@ 2025-05-19 16:17           ` Edgecombe, Rick P
  0 siblings, 0 replies; 18+ messages in thread
From: Edgecombe, Rick P @ 2025-05-19 16:17 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, Chatre, Reinette, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, Zhao, Yan Y

On Mon, 2025-05-19 at 08:53 -0700, Sean Christopherson wrote:
> Yeah, and I'd prefer not to bleed these details into userspace (or into KVM in
> general)
> 

Makes sense. 

> , hence my question about whether or not a "real" VMM hit this.

None that I know of.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19 16:12     ` Edgecombe, Rick P
@ 2025-05-19 17:06       ` Sean Christopherson
  2025-05-19 17:49         ` Reinette Chatre
                           ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Sean Christopherson @ 2025-05-19 17:06 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Yan Y Zhao, kvm@vger.kernel.org, pbonzini@redhat.com,
	Reinette Chatre, linux-kernel@vger.kernel.org

On Mon, May 19, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> > kicking vCPUs out of KVM?
> > 
> > Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
> > KVM can simply drop and reacquire SRCU in the relevant paths.
> 
> During the initial debugging and kicking around stage, this is the first
> direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> kvm_tdp_map_page() tries to unlock without it being held. (although that version
> didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> came up with the version in this series, which we held review on for the list:

Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.

> > However, upon further consideration, I am reluctant to implement this fix for

Which fix?

> > the following reasons:
> > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > - While retrying with srcu unlock and lock can workaround the
> >   KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> >   and tdx_handle_ept_violation() faulting with different memslot layouts.

This behavior has existed since pretty much the beginning of KVM time.  TDX is the
oddball that doesn't re-enter the guest.  All other flavors re-enter the guest on
RET_PF_RETRY, which means dropping and reacquiring SRCU.  Which is why I don't like
RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.

Arguably, _TDX_ is buggy by not providing this behavior.

> I'm not sure why the second one is really a problem. For the first one I think
> that path could just take the scru lock in the proper order with kvm-
> >slots_lock?

Acquiring SRCU inside slots_lock should be fine.  The reserve order would be
problematic, as KVM synchronizes SRCU while holding slots_lock.

That said, I don't love the idea of grabbing SRCU, because it's so obviously a
hack.  What about something like this?

---
 arch/x86/kvm/mmu.h     |  2 ++
 arch/x86/kvm/mmu/mmu.c | 49 +++++++++++++++++++++++++++---------------
 arch/x86/kvm/vmx/tdx.c |  7 ++++--
 virt/kvm/kvm_main.c    |  5 ++---
 4 files changed, 41 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b4b6860ab971..0fc68f0fe80e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -259,6 +259,8 @@ extern bool tdp_mmu_enabled;
 
 bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
 int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
+int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+			  u8 *level);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..4f16fe95173c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4851,24 +4851,15 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 {
 	int r;
 
-	/*
-	 * Restrict to TDP page fault, since that's the only case where the MMU
-	 * is indexed by GPA.
-	 */
-	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
-		return -EOPNOTSUPP;
+	if (signal_pending(current))
+		return -EINTR;
 
-	do {
-		if (signal_pending(current))
-			return -EINTR;
+	if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
+		return -EIO;
 
-		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
-			return -EIO;
-
-		cond_resched();
-		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
-	} while (r == RET_PF_RETRY);
+	cond_resched();
 
+	r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
 	if (r < 0)
 		return r;
 
@@ -4878,10 +4869,12 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 	case RET_PF_WRITE_PROTECTED:
 		return 0;
 
+	case RET_PF_RETRY:
+		return -EAGAIN;
+
 	case RET_PF_EMULATE:
 		return -ENOENT;
 
-	case RET_PF_RETRY:
 	case RET_PF_CONTINUE:
 	case RET_PF_INVALID:
 	default:
@@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 }
 EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
 
+int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
+{
+	int r;
+
+	/*
+	 * Restrict to TDP page fault, since that's the only case where the MMU
+	 * is indexed by GPA.
+	 */
+	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
+		return -EOPNOTSUPP;
+
+	for (;;) {
+		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
+		if (r != -EAGAIN)
+			break;
+
+		/* Comment goes here. */
+		kvm_vcpu_srcu_read_unlock(vcpu);
+		kvm_vcpu_srcu_read_lock(vcpu);
+	}
+}
+
 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range)
 {
@@ -4918,7 +4933,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	 * Shadow paging uses GVA for kvm page fault, so restrict to
 	 * two-dimensional paging.
 	 */
-	r = kvm_tdp_map_page(vcpu, range->gpa, error_code, &level);
+	r = kvm_tdp_prefault_page(vcpu, range->gpa, error_code, &level);
 	if (r < 0)
 		return r;
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..1a232562080d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3075,8 +3075,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (ret != 1)
 		return -ENOMEM;
 
-	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
-	if (ret < 0)
+	do {
+		ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+	while (ret == -EAGAIN);
+
+	if (ret)
 		goto out;
 
 	/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b24db92e98f3..21a3fa7476dd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4266,7 +4266,6 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
 static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				     struct kvm_pre_fault_memory *range)
 {
-	int idx;
 	long r;
 	u64 full_size;
 
@@ -4279,7 +4278,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 		return -EINVAL;
 
 	vcpu_load(vcpu);
-	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	kvm_vcpu_srcu_read_lock(vcpu);
 
 	full_size = range->size;
 	do {
@@ -4300,7 +4299,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 		cond_resched();
 	} while (range->size);
 
-	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+	kvm_vcpu_srcu_read_unlock(vcpu);
 	vcpu_put(vcpu);
 
 	/* Return success if at least one page was mapped successfully.  */

base-commit: 12ca5c63556bbfcd77fe890fcdd1cd1adfb31fdd
--

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19 17:06       ` Sean Christopherson
@ 2025-05-19 17:49         ` Reinette Chatre
  2025-05-19 20:14         ` Edgecombe, Rick P
  2025-05-20  5:33         ` Yan Zhao
  2 siblings, 0 replies; 18+ messages in thread
From: Reinette Chatre @ 2025-05-19 17:49 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe
  Cc: Yan Y Zhao, kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org

Hi Sean,

On 5/19/25 10:06 AM, Sean Christopherson wrote:
> On Mon, May 19, 2025, Rick P Edgecombe wrote:
>> On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
>>> Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
>>> kicking vCPUs out of KVM?
>>>
>>> Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
>>> KVM can simply drop and reacquire SRCU in the relevant paths.
>>
>> During the initial debugging and kicking around stage, this is the first
>> direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
>> kvm_tdp_map_page() tries to unlock without it being held. (although that version
>> didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
>> came up with the version in this series, which we held review on for the list:
> 
> Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
> 
>>> However, upon further consideration, I am reluctant to implement this fix for
> 
> Which fix?
> 
>>> the following reasons:
>>> - kvm_gmem_populate() already holds the kvm->slots_lock.
>>> - While retrying with srcu unlock and lock can workaround the
>>>   KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
>>>   and tdx_handle_ept_violation() faulting with different memslot layouts.
> 
> This behavior has existed since pretty much the beginning of KVM time.  TDX is the
> oddball that doesn't re-enter the guest.  All other flavors re-enter the guest on
> RET_PF_RETRY, which means dropping and reacquiring SRCU.  Which is why I don't like
> RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
> 
> Arguably, _TDX_ is buggy by not providing this behavior.
> 
>> I'm not sure why the second one is really a problem. For the first one I think
>> that path could just take the scru lock in the proper order with kvm-
>>> slots_lock?
> 
> Acquiring SRCU inside slots_lock should be fine.  The reserve order would be
> problematic, as KVM synchronizes SRCU while holding slots_lock.
> 
> That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> hack.  What about something like this?
> 
> ---
>  arch/x86/kvm/mmu.h     |  2 ++
>  arch/x86/kvm/mmu/mmu.c | 49 +++++++++++++++++++++++++++---------------
>  arch/x86/kvm/vmx/tdx.c |  7 ++++--
>  virt/kvm/kvm_main.c    |  5 ++---
>  4 files changed, 41 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index b4b6860ab971..0fc68f0fe80e 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -259,6 +259,8 @@ extern bool tdp_mmu_enabled;
>  
>  bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
>  int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> +			  u8 *level);
>  
>  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index cbc84c6abc2e..4f16fe95173c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4851,24 +4851,15 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
>  {
>  	int r;
>  
> -	/*
> -	 * Restrict to TDP page fault, since that's the only case where the MMU
> -	 * is indexed by GPA.
> -	 */
> -	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> -		return -EOPNOTSUPP;
> +	if (signal_pending(current))
> +		return -EINTR;
>  
> -	do {
> -		if (signal_pending(current))
> -			return -EINTR;
> +	if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> +		return -EIO;
>  
> -		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> -			return -EIO;
> -
> -		cond_resched();
> -		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> -	} while (r == RET_PF_RETRY);
> +	cond_resched();
>  
> +	r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
>  	if (r < 0)
>  		return r;
>  
> @@ -4878,10 +4869,12 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
>  	case RET_PF_WRITE_PROTECTED:
>  		return 0;
>  
> +	case RET_PF_RETRY:
> +		return -EAGAIN;
> +
>  	case RET_PF_EMULATE:
>  		return -ENOENT;
>  
> -	case RET_PF_RETRY:
>  	case RET_PF_CONTINUE:
>  	case RET_PF_INVALID:
>  	default:
> @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
>  }
>  EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
>  
> +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> +{
> +	int r;
> +
> +	/*
> +	 * Restrict to TDP page fault, since that's the only case where the MMU
> +	 * is indexed by GPA.
> +	 */
> +	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> +		return -EOPNOTSUPP;
> +
> +	for (;;) {
> +		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> +		if (r != -EAGAIN)
> +			break;
> +
> +		/* Comment goes here. */
> +		kvm_vcpu_srcu_read_unlock(vcpu);
> +		kvm_vcpu_srcu_read_lock(vcpu);
> +	}

added "return r" here.

> +}
> +
>  long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				    struct kvm_pre_fault_memory *range)
>  {
> @@ -4918,7 +4933,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	 * Shadow paging uses GVA for kvm page fault, so restrict to
>  	 * two-dimensional paging.
>  	 */
> -	r = kvm_tdp_map_page(vcpu, range->gpa, error_code, &level);
> +	r = kvm_tdp_prefault_page(vcpu, range->gpa, error_code, &level);
>  	if (r < 0)
>  		return r;
>  
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..1a232562080d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3075,8 +3075,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	if (ret != 1)
>  		return -ENOMEM;
>  
> -	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> -	if (ret < 0)
> +	do {
> +		ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> +	while (ret == -EAGAIN);

added closing '}' here

> +
> +	if (ret)
>  		goto out;
>  
>  	/*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b24db92e98f3..21a3fa7476dd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4266,7 +4266,6 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
>  static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				     struct kvm_pre_fault_memory *range)
>  {
> -	int idx;
>  	long r;
>  	u64 full_size;
>  
> @@ -4279,7 +4278,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  		return -EINVAL;
>  
>  	vcpu_load(vcpu);
> -	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +	kvm_vcpu_srcu_read_lock(vcpu);
>  
>  	full_size = range->size;
>  	do {
> @@ -4300,7 +4299,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  		cond_resched();
>  	} while (range->size);
>  
> -	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +	kvm_vcpu_srcu_read_unlock(vcpu);
>  	vcpu_put(vcpu);
>  
>  	/* Return success if at least one page was mapped successfully.  */
> 
> base-commit: 12ca5c63556bbfcd77fe890fcdd1cd1adfb31fdd
> --

Just reporting on the testing side ...

I just made two small changes to patch as indicated. With these changes pre_fault_memory_test
(with changes from patch #2) hangs (with KVM still responding to signals). fwiw,
pre_fault_memory_test also hangs with [1]. 

I also tried out the TDX stress tests that you do not (yet) have access to. The stress tests
passes with [1] but hangs with this change.

Reinette

[1] https://lore.kernel.org/lkml/aCsy-m_esVjy8Pey@google.com/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19 17:06       ` Sean Christopherson
  2025-05-19 17:49         ` Reinette Chatre
@ 2025-05-19 20:14         ` Edgecombe, Rick P
  2025-05-20  5:33         ` Yan Zhao
  2 siblings, 0 replies; 18+ messages in thread
From: Edgecombe, Rick P @ 2025-05-19 20:14 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Zhao, Yan Y

On Mon, 2025-05-19 at 10:06 -0700, Sean Christopherson wrote:
> On Mon, May 19, 2025, Rick P Edgecombe wrote:
> > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > > Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> > > kicking vCPUs out of KVM?
> > > 
> > > Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
> > > KVM can simply drop and reacquire SRCU in the relevant paths.
> > 
> > During the initial debugging and kicking around stage, this is the first
> > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> > kvm_tdp_map_page() tries to unlock without it being held. (although that version
> > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> > came up with the version in this series, which we held review on for the list:
> 
> Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
> 
> > > However, upon further consideration, I am reluctant to implement this fix for
> 
> Which fix?

The one considered when debugging the issue. It was pretty much exactly like you
first suggested, but with the scru lock taken in tdx_gmem_post_populate(). Since
it was pretty much the same I just shared Yan's comment on it.

In case it looks like internal code review, here is some more history:
1. Reinette reports bug from internal test, wondering if it's valid userspace
behavior
2. I suggest scru root cause
3. Yan provides specific diff (pretty much what you suggested) for Reinette to
test, who finds the post_populate case generates a warning
4. Yan looks at fixing up post_populate case, but decides she doesn't like it
(the quoted blurb) and develops the alternative in this series

> > > the following reasons:
> > > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > > - While retrying with srcu unlock and lock can workaround the
> > >   KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> > >   and tdx_handle_ept_violation() faulting with different memslot layouts.
> 
> This behavior has existed since pretty much the beginning of KVM time.
> 

The non-tdx issues today are related to the pre-fault memory stuff, which
doesn't enter the guest for a different reason.

>   TDX is the
> oddball that doesn't re-enter the guest.  All other flavors re-enter the guest on
> RET_PF_RETRY, which means dropping and reacquiring SRCU.  Which is why I don't like
> RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
> 
> Arguably, _TDX_ is buggy by not providing this behavior.

Long term the zero step issue needs to be resolved. So yea we should avoid
building around it for anything not-critical (like this). But I don't see why
prefault is not in the same category of oddness.

> 
> > I'm not sure why the second one is really a problem. For the first one I think
> > that path could just take the scru lock in the proper order with kvm-
> > > slots_lock?
> 
> Acquiring SRCU inside slots_lock should be fine.  The reserve order would be
> problematic, as KVM synchronizes SRCU while holding slots_lock.
> 
> That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> hack.  What about something like this?

Like Reinette noticed it doesn't pass the included test. It looks like
synchronize_srcu_expedited() is not waiting any longer, but there is some other
RET_PF_RETRY loop not caused by KVM_MEMSLOT_INVALID. Must be some side effect.
Will debug further.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19 13:33   ` Sean Christopherson
  2025-05-19 15:05     ` Reinette Chatre
  2025-05-19 16:12     ` Edgecombe, Rick P
@ 2025-05-20  5:27     ` Yan Zhao
  2 siblings, 0 replies; 18+ messages in thread
From: Yan Zhao @ 2025-05-20  5:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, reinette.chatre, rick.p.edgecombe, linux-kernel, kvm

On Mon, May 19, 2025 at 06:33:14AM -0700, Sean Christopherson wrote:
> On Mon, May 19, 2025, Yan Zhao wrote:
> > Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of
> > kvm_mmu_do_page_fault() that a fault retry is due to an invalid memslot.
> > This helps prevent deadlocks when a memslot is removed during pre-faulting
> > GPAs in the memslot or local retry of faulting private pages in TDX.
> > 
> > Take pre-faulting as an example.
> > 
> > During ioctl KVM_PRE_FAULT_MEMORY, kvm->srcu is acquired around the
> > pre-faulting of the entire range. For x86, kvm_arch_vcpu_pre_fault_memory()
> > further invokes kvm_tdp_map_page(), which retries kvm_mmu_do_page_fault()
> > if the return value is RET_PF_RETRY.
> > 
> > If a memslot is deleted during the ioctl KVM_PRE_FAULT_MEMORY, after
> > kvm_invalidate_memslot() marks a slot as invalid and makes it visible via
> > rcu_assign_pointer() in kvm_swap_active_memslots(), kvm_mmu_do_page_fault()
> > may encounter an invalid slot and return RET_PF_RETRY. Consequently,
> > kvm_tdp_map_page() will then retry without releasing the srcu lock.
> > Meanwhile, synchronize_srcu_expedited() in kvm_swap_active_memslots() is
> > blocked, waiting for kvm_vcpu_pre_fault_memory() to release the srcu lock,
> > leading to a deadlock.
> 
> Probably worth calling out that KVM will respond to signals, i.e. there's no risk
> to the host kernel.
Yes. The user can stop at any time by killing the process.

No risk to the host kernel.

> 
> > "slot deleting" thread                   "prefault" thread
> > -----------------------------            ----------------------
> >                                          srcu_read_lock();
> > (A)
> > invalid_slot->flags |= KVM_MEMSLOT_INVALID;
> > rcu_assign_pointer();
> > 
> >                                          kvm_tdp_map_page();
> >                                          (B)
> >                                             do {
> >                                                r = kvm_mmu_do_page_fault();
> > 
> > (C) synchronize_srcu_expedited();
> > 
> >                                             } while (r == RET_PF_RETRY);
> > 
> >                                          (D) srcu_read_unlock();
> > 
> > As shown in diagram, (C) is waiting for (D). However, (B) continuously
> > finds an invalid slot before (C) completes, causing (B) to retry and
> > preventing (D) from being invoked.
> > 
> > The local retry code in TDX's EPT violation handler faces a similar issue,
> > where a deadlock can occur when faulting a private GFN in a slot that is
> > concurrently being removed.
> > 
> > To resolve the deadlock, introduce a new return value
> > RET_PF_RETRY_INVALID_SLOT and modify kvm_mmu_do_page_fault() to return
> > RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an
> > invalid memslot. This prevents endless retries in kvm_tdp_map_page() or
> > tdx_handle_ept_violation(), allowing the srcu to be released and enabling
> > slot removal to proceed.
> > 
> > As all callers of kvm_tdp_map_page(), i.e.,
> > kvm_arch_vcpu_pre_fault_memory() or tdx_gmem_post_populate(), are in
> > pre-fault path, treat RET_PF_RETRY_INVALID_SLOT the same as RET_PF_EMULATE
> > to return -ENOENT in kvm_tdp_map_page() to enable userspace to be aware of
> > the slot removal.
> 
> Userspace should already be "aware" of the slot removal.
Hmm, I mean let userspace know there's an error due to faulting into a slot
under removal.


> > Returning RET_PF_RETRY_INVALID_SLOT in kvm_mmu_do_page_fault() does not
> > affect kvm_mmu_page_fault() and kvm_arch_async_page_ready(), as their
> > callers either only check if the return value > 0 to re-enter vCPU for
> > retry or do not check return value.
> > 
> > Reported-by: Reinette Chatre <reinette.chatre@intel.com>
> 
> Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> kicking vCPUs out of KVM?
Not a real VMM. It's solely for stress testing (or maybe better call it negative
testing).

For TDX, now the only memslot removal case is for memory hot-unplug, which
notifies guest in advance.

> Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
> KVM can simply drop and reacquire SRCU in the relevant paths.
Yes, we can switch to SRCU if you prefer that path.

Previously, I thought it's redudant to acquire SRCU in kvm_gmem_populate() as it
already holds slots_lock. I also assumed you would prefer ioctl
KVM_PRE_FAULT_MEMORY to fault under a single memslot layout, because
kvm_vcpu_pre_fault_memory() acquires SRCU for the entire range.

Otherwise, could we acquire the SRCU for each kvm_mmu_do_page_fault()?

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..15e98202868a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4859,6 +4859,14 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
                return -EOPNOTSUPP;

        do {
+               /*
+                * reload is efficient when called repeatedly, so we can do it on
+                * every iteration.
+                */
+               r = kvm_mmu_reload(vcpu);
+               if (unlikely(r))
+                       return r;
+
                if (signal_pending(current))
                        return -EINTR;

@@ -4866,7 +4874,10 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
                        return -EIO;

                cond_resched();
+
+               kvm_vcpu_srcu_read_lock(vcpu);
                r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
+               kvm_vcpu_srcu_read_unlock(vcpu);
        } while (r == RET_PF_RETRY);

        if (r < 0)
@@ -4902,14 +4913,6 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
        if (!vcpu->kvm->arch.pre_fault_allowed)
                return -EOPNOTSUPP;

-       /*
-        * reload is efficient when called repeatedly, so we can do it on
-        * every iteration.
-        */
-       r = kvm_mmu_reload(vcpu);
-       if (r)
-               return r;
-
        if (kvm_arch_has_private_mem(vcpu->kvm) &&
            kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(range->gpa)))
                error_code |= PFERR_PRIVATE_ACCESS;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..b1f20c7fd17d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1920,7 +1920,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
                        break;
                }

+               kvm_vcpu_srcu_read_unlock(vcpu);
                cond_resched();
+               kvm_vcpu_srcu_read_lock(vcpu);
        }
        return ret;
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b24db92e98f3..c502105905af 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4266,7 +4266,6 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
 static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
                                     struct kvm_pre_fault_memory *range)
 {
-       int idx;
        long r;
        u64 full_size;

@@ -4279,7 +4278,6 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
                return -EINVAL;

        vcpu_load(vcpu);
-       idx = srcu_read_lock(&vcpu->kvm->srcu);

        full_size = range->size;
        do {
@@ -4300,7 +4298,6 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
                cond_resched();
        } while (range->size);

-       srcu_read_unlock(&vcpu->kvm->srcu, idx);
        vcpu_put(vcpu);

        /* Return success if at least one page was mapped successfully.  */

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-19 17:06       ` Sean Christopherson
  2025-05-19 17:49         ` Reinette Chatre
  2025-05-19 20:14         ` Edgecombe, Rick P
@ 2025-05-20  5:33         ` Yan Zhao
  2025-05-20 16:13           ` Sean Christopherson
  2 siblings, 1 reply; 18+ messages in thread
From: Yan Zhao @ 2025-05-20  5:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
	Reinette Chatre, linux-kernel@vger.kernel.org

On Mon, May 19, 2025 at 10:06:22AM -0700, Sean Christopherson wrote:
> On Mon, May 19, 2025, Rick P Edgecombe wrote:
> > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > > Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> > > kicking vCPUs out of KVM?
> > > 
> > > Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
> > > KVM can simply drop and reacquire SRCU in the relevant paths.
> > 
> > During the initial debugging and kicking around stage, this is the first
> > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> > kvm_tdp_map_page() tries to unlock without it being held. (although that version
> > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> > came up with the version in this series, which we held review on for the list:
> 
> Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
> 
> > > However, upon further consideration, I am reluctant to implement this fix for
> 
> Which fix?
> 
> > > the following reasons:
> > > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > > - While retrying with srcu unlock and lock can workaround the
> > >   KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> > >   and tdx_handle_ept_violation() faulting with different memslot layouts.
> 
> This behavior has existed since pretty much the beginning of KVM time.  TDX is the
> oddball that doesn't re-enter the guest.  All other flavors re-enter the guest on
> RET_PF_RETRY, which means dropping and reacquiring SRCU.  Which is why I don't like
> RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
> 
> Arguably, _TDX_ is buggy by not providing this behavior.
> 
> > I'm not sure why the second one is really a problem. For the first one I think
> > that path could just take the scru lock in the proper order with kvm-
> > >slots_lock?
> 
> Acquiring SRCU inside slots_lock should be fine.  The reserve order would be
> problematic, as KVM synchronizes SRCU while holding slots_lock.
> 
> That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> hack.  What about something like this?
So you want to avoid acquiring SRCU in the kvm_gmem_populate() path?

Generally I think it's good, except that it missed a kvm_mmu_reload() (please
refer to my comment below) and the kvm_vcpu_srcu_read_{un}lock() pair in
tdx_handle_ept_violation() path (So, Reinette reported it failed the TDX stress
tests at [1]).

As this is not a bug met in real VMM, could I play with this fix for a while and
come back to you later?


> 
> ---
>  arch/x86/kvm/mmu.h     |  2 ++
>  arch/x86/kvm/mmu/mmu.c | 49 +++++++++++++++++++++++++++---------------
>  arch/x86/kvm/vmx/tdx.c |  7 ++++--
>  virt/kvm/kvm_main.c    |  5 ++---
>  4 files changed, 41 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index b4b6860ab971..0fc68f0fe80e 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -259,6 +259,8 @@ extern bool tdp_mmu_enabled;
>  
>  bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
>  int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> +			  u8 *level);
>  
>  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index cbc84c6abc2e..4f16fe95173c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4851,24 +4851,15 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
>  {
>  	int r;
>  
> -	/*
> -	 * Restrict to TDP page fault, since that's the only case where the MMU
> -	 * is indexed by GPA.
> -	 */
> -	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> -		return -EOPNOTSUPP;
> +	if (signal_pending(current))
> +		return -EINTR;
>  
> -	do {
> -		if (signal_pending(current))
> -			return -EINTR;
> +	if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> +		return -EIO;
>  
> -		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> -			return -EIO;
> -
> -		cond_resched();
> -		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> -	} while (r == RET_PF_RETRY);
> +	cond_resched();
>  
> +	r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
>  	if (r < 0)
>  		return r;
>  
> @@ -4878,10 +4869,12 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
>  	case RET_PF_WRITE_PROTECTED:
>  		return 0;
>  
> +	case RET_PF_RETRY:
> +		return -EAGAIN;
> +
>  	case RET_PF_EMULATE:
>  		return -ENOENT;
>  
> -	case RET_PF_RETRY:
>  	case RET_PF_CONTINUE:
>  	case RET_PF_INVALID:
>  	default:
> @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
>  }
>  EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
>  
> +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> +{
> +	int r;
> +
> +	/*
> +	 * Restrict to TDP page fault, since that's the only case where the MMU
> +	 * is indexed by GPA.
> +	 */
> +	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> +		return -EOPNOTSUPP;
> +
> +	for (;;) {
> +		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> +		if (r != -EAGAIN)
> +			break;
> +
> +		/* Comment goes here. */
> +		kvm_vcpu_srcu_read_unlock(vcpu);
> +		kvm_vcpu_srcu_read_lock(vcpu);
For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
the memslot removal succeeds after releasing the SRCU, then the old root is
stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
from being always true.

[1] https://lore.kernel.org/all/dcdb3551-d95c-4724-84ee-d0a6611ca1bf@intel.com/

> +	}
> +}
> +
>  long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				    struct kvm_pre_fault_memory *range)
>  {
> @@ -4918,7 +4933,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	 * Shadow paging uses GVA for kvm page fault, so restrict to
>  	 * two-dimensional paging.
>  	 */
> -	r = kvm_tdp_map_page(vcpu, range->gpa, error_code, &level);
> +	r = kvm_tdp_prefault_page(vcpu, range->gpa, error_code, &level);
>  	if (r < 0)
>  		return r;
>  
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..1a232562080d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3075,8 +3075,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	if (ret != 1)
>  		return -ENOMEM;
>  
> -	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> -	if (ret < 0)
> +	do {
> +		ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> +	while (ret == -EAGAIN);
> +
> +	if (ret)
>  		goto out;
>  
>  	/*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b24db92e98f3..21a3fa7476dd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4266,7 +4266,6 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
>  static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				     struct kvm_pre_fault_memory *range)
>  {
> -	int idx;
>  	long r;
>  	u64 full_size;
>  
> @@ -4279,7 +4278,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  		return -EINVAL;
>  
>  	vcpu_load(vcpu);
> -	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +	kvm_vcpu_srcu_read_lock(vcpu);
>  
>  	full_size = range->size;
>  	do {
> @@ -4300,7 +4299,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  		cond_resched();
>  	} while (range->size);
>  
> -	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +	kvm_vcpu_srcu_read_unlock(vcpu);
>  	vcpu_put(vcpu);
>  
>  	/* Return success if at least one page was mapped successfully.  */
> 
> base-commit: 12ca5c63556bbfcd77fe890fcdd1cd1adfb31fdd
> --
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-20  5:33         ` Yan Zhao
@ 2025-05-20 16:13           ` Sean Christopherson
  2025-05-21  1:45             ` Yan Zhao
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2025-05-20 16:13 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
	Reinette Chatre, linux-kernel@vger.kernel.org

On Tue, May 20, 2025, Yan Zhao wrote:
> On Mon, May 19, 2025 at 10:06:22AM -0700, Sean Christopherson wrote:
> > On Mon, May 19, 2025, Rick P Edgecombe wrote:
> > > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > > > Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> > > > kicking vCPUs out of KVM?
> > > > 
> > > > Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
> > > > KVM can simply drop and reacquire SRCU in the relevant paths.
> > > 
> > > During the initial debugging and kicking around stage, this is the first
> > > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> > > kvm_tdp_map_page() tries to unlock without it being held. (although that version
> > > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> > > came up with the version in this series, which we held review on for the list:
> > 
> > Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
> > 
> > > > However, upon further consideration, I am reluctant to implement this fix for
> > 
> > Which fix?
> > 
> > > > the following reasons:
> > > > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > > > - While retrying with srcu unlock and lock can workaround the
> > > >   KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> > > >   and tdx_handle_ept_violation() faulting with different memslot layouts.
> > 
> > This behavior has existed since pretty much the beginning of KVM time.  TDX is the
> > oddball that doesn't re-enter the guest.  All other flavors re-enter the guest on
> > RET_PF_RETRY, which means dropping and reacquiring SRCU.  Which is why I don't like
> > RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
> > 
> > Arguably, _TDX_ is buggy by not providing this behavior.
> > 
> > > I'm not sure why the second one is really a problem. For the first one I think
> > > that path could just take the scru lock in the proper order with kvm-
> > > >slots_lock?
> > 
> > Acquiring SRCU inside slots_lock should be fine.  The reserve order would be
> > problematic, as KVM synchronizes SRCU while holding slots_lock.
> > 
> > That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> > hack.  What about something like this?
> So you want to avoid acquiring SRCU in the kvm_gmem_populate() path?

Yes, ideally.  Acquiring SCRU wouldn't be the end of the world, but I don't love
the idea of taking a lock just so that the lock can be conditionally dropped in
a common flow.  It's not a deal breaker (I wouldn't be surprised if there's at
least one path in KVM that acquires SRCU purely because of such behavior), but
separating kvm_tdp_prefault_page() from kvm_tdp_map_page() 

> Generally I think it's good, except that it missed a kvm_mmu_reload() (please
> refer to my comment below) and the kvm_vcpu_srcu_read_{un}lock() pair in
> tdx_handle_ept_violation() path (So, Reinette reported it failed the TDX stress
> tests at [1]).

> > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> >  
> > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > +{
> > +	int r;
> > +
> > +	/*
> > +	 * Restrict to TDP page fault, since that's the only case where the MMU
> > +	 * is indexed by GPA.
> > +	 */
> > +	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > +		return -EOPNOTSUPP;
> > +
> > +	for (;;) {
> > +		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > +		if (r != -EAGAIN)
> > +			break;
> > +
> > +		/* Comment goes here. */
> > +		kvm_vcpu_srcu_read_unlock(vcpu);
> > +		kvm_vcpu_srcu_read_lock(vcpu);
> For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> the memslot removal succeeds after releasing the SRCU, then the old root is
> stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> from being always true.

That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
otherwise kvm_mmu_reload() will do nothing.

Thinking about this scenario more, I don't mind punting this problem to userspace
for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
to userspace.  Whereas for KVM_RUN, KVM can't change existing behavior without
breaking userspace, should provide consistent behavior regardless of VM type, and
KVM needs the "complex" code irrespective of this particular scenario.

I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
because then KVM is effectively providing the same overall behavior as KVM_RUN,
just without slightly different roles and responsibilities between KVM and
userspace.  And -ENOENT is also flat out wrong for the case where a memslot is
being moved, but the new base+size still contains the to-be-faulted GPA.

I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
encountered during prefault (as identified by fault->prefetch).

For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
i.e. punting to userspace is not a viable option.  But that path also has options
that aren't available to prefaulting.  E.g. it could (and probably should) break
early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario.  And as Rick
called out, the zero-step mess really needs to be solved in a more robust fashion.

So this?

---
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 20 May 2025 07:55:32 -0700
Subject: [PATCH] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves
 memslot during prefault

Return -EAGAIN if userspace attemps to delete or move a memslot while also
prefaulting memory for that same memslot, i.e. force userspace to retry
instead of trying to handle the scenario entirely within KVM.  Unlike
KVM_RUN, which needs to handle the scenario entirely within KVM because
userspace has come to depend on such behavior, KVM_PRE_FAULT_MEMORY can
return -EAGAIN without breaking userspace as this scenario can't have ever
worked (and there's no sane use case for prefaulting to a memslot that's
being deleted/moved).

And also unlike KVM_RUN, the prefault path doesn't naturally gaurantee
forward progress.  E.g. to handle such a scenario, KVM would need to drop
and reacquire SRCU to break the deadlock between the memslot update
(synchronizes SRCU) and the prefault (waits for the memslot update to
complete).

However, dropping SRCU creates more problems, as completing the memslot
update will bump the memslot generation, which in turn will invalidate the
MMU root.  To handle that, prefaulting would need to handle pending
KVM_REQ_MMU_FREE_OBSOLETE_ROOTS requests and do kvm_mmu_reload() prior to
mapping each individual.

I.e. to fully handle this scenario, prefaulting would eventually need to
look a lot like vcpu_enter_guest().  Given that there's no reasonable use
case and practically zero risk of breaking userspace, punt the problem to
userspace and avoid adding unnecessary complexity to the prefualt path.

Note, TDX's guest_memfd post-populate path is unaffected as slots_lock is
held for the entire duration of populate(), i.e. any memslot modifications
will be fully serialized against TDX's flavor of prefaulting.

Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com
Debugged-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a284dce227a0..7ae56a3c7607 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4595,10 +4595,16 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
 	 * or moved.  This ensures any existing SPTEs for the old memslot will
-	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.
+	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.  Punt the
+	 * error to userspace if this is a prefault, as KVM's prefaulting ABI
+	 * doesn't need provide the same forward progress guarantees as KVM_RUN.
 	 */
-	if (slot->flags & KVM_MEMSLOT_INVALID)
+	if (slot->flags & KVM_MEMSLOT_INVALID) {
+		if (fault->prefetch)
+			return -EAGAIN;
+
 		return RET_PF_RETRY;
+	}

 	if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
 		/*

base-commit: 45eb29140e68ffe8e93a5471006858a018480a45
--

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-20 16:13           ` Sean Christopherson
@ 2025-05-21  1:45             ` Yan Zhao
  2025-05-21 15:45               ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: Yan Zhao @ 2025-05-21  1:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
	Reinette Chatre, linux-kernel@vger.kernel.org

On Tue, May 20, 2025 at 09:13:25AM -0700, Sean Christopherson wrote:
> On Tue, May 20, 2025, Yan Zhao wrote:
> > On Mon, May 19, 2025 at 10:06:22AM -0700, Sean Christopherson wrote:
> > > On Mon, May 19, 2025, Rick P Edgecombe wrote:
> > > > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > > > > Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> > > > > kicking vCPUs out of KVM?
> > > > > 
> > > > > Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
> > > > > KVM can simply drop and reacquire SRCU in the relevant paths.
> > > > 
> > > > During the initial debugging and kicking around stage, this is the first
> > > > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> > > > kvm_tdp_map_page() tries to unlock without it being held. (although that version
> > > > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> > > > came up with the version in this series, which we held review on for the list:
> > > 
> > > Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
> > > 
> > > > > However, upon further consideration, I am reluctant to implement this fix for
> > > 
> > > Which fix?
> > > 
> > > > > the following reasons:
> > > > > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > > > > - While retrying with srcu unlock and lock can workaround the
> > > > >   KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> > > > >   and tdx_handle_ept_violation() faulting with different memslot layouts.
> > > 
> > > This behavior has existed since pretty much the beginning of KVM time.  TDX is the
> > > oddball that doesn't re-enter the guest.  All other flavors re-enter the guest on
> > > RET_PF_RETRY, which means dropping and reacquiring SRCU.  Which is why I don't like
> > > RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
> > > 
> > > Arguably, _TDX_ is buggy by not providing this behavior.
> > > 
> > > > I'm not sure why the second one is really a problem. For the first one I think
> > > > that path could just take the scru lock in the proper order with kvm-
> > > > >slots_lock?
> > > 
> > > Acquiring SRCU inside slots_lock should be fine.  The reserve order would be
> > > problematic, as KVM synchronizes SRCU while holding slots_lock.
> > > 
> > > That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> > > hack.  What about something like this?
> > So you want to avoid acquiring SRCU in the kvm_gmem_populate() path?
> 
> Yes, ideally.  Acquiring SCRU wouldn't be the end of the world, but I don't love
> the idea of taking a lock just so that the lock can be conditionally dropped in
> a common flow.  It's not a deal breaker (I wouldn't be surprised if there's at
> least one path in KVM that acquires SRCU purely because of such behavior), but
> separating kvm_tdp_prefault_page() from kvm_tdp_map_page() 
> 
> > Generally I think it's good, except that it missed a kvm_mmu_reload() (please
> > refer to my comment below) and the kvm_vcpu_srcu_read_{un}lock() pair in
> > tdx_handle_ept_violation() path (So, Reinette reported it failed the TDX stress
> > tests at [1]).
> 
> > > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> > >  
> > > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > > +{
> > > +	int r;
> > > +
> > > +	/*
> > > +	 * Restrict to TDP page fault, since that's the only case where the MMU
> > > +	 * is indexed by GPA.
> > > +	 */
> > > +	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	for (;;) {
> > > +		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > > +		if (r != -EAGAIN)
> > > +			break;
> > > +
> > > +		/* Comment goes here. */
> > > +		kvm_vcpu_srcu_read_unlock(vcpu);
> > > +		kvm_vcpu_srcu_read_lock(vcpu);
> > For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> > the memslot removal succeeds after releasing the SRCU, then the old root is
> > stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> > from being always true.
> 
> That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
> otherwise kvm_mmu_reload() will do nothing.
In commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in
kvm_mmu_reload()"), KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is processed in
kvm_mmu_reload().


> Thinking about this scenario more, I don't mind punting this problem to userspace
> for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
> because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
> to userspace.  Whereas for KVM_RUN, KVM can't change existing behavior without
> breaking userspace, should provide consistent behavior regardless of VM type, and
> KVM needs the "complex" code irrespective of this particular scenario.
> 
> I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
> because then KVM is effectively providing the same overall behavior as KVM_RUN,
> just without slightly different roles and responsibilities between KVM and
> userspace.  And -ENOENT is also flat out wrong for the case where a memslot is
> being moved, but the new base+size still contains the to-be-faulted GPA.
> 
> I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
> into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
> encountered during prefault (as identified by fault->prefetch).
>
> For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
> i.e. punting to userspace is not a viable option.  But that path also has options
> that aren't available to prefaulting.  E.g. it could (and probably should) break
> early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
Hmm, for TDX, there's no request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS for slot
removal. (see commit aa8d1f48d353 ("KVM: x86/mmu: Introduce a quirk to control
memslot zap behavior").

> would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario.  And as Rick
> called out, the zero-step mess really needs to be solved in a more robust fashion.
> 
> So this?
Looks good to me for non-TDX side.

For TDX, could we provide below fix based on your change?
For private fault, -EFAULT will be returned to userspace after the retry anyway
after the slot is completed removed, which is unlike non-private faults that go
to emulate path after retry.

--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4602,6 +4602,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
                if (fault->prefetch)
                        return -EAGAIN;

+               if (fault->is_private) {
+                       kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+                       return -EFAULT;
+               }
+
                return RET_PF_RETRY;
        }


And would you mind if I included your patch in my next version? I can update the
related selftests as well.

Thanks
Yan

> ---
> From: Sean Christopherson <seanjc@google.com>
> Date: Tue, 20 May 2025 07:55:32 -0700
> Subject: [PATCH] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves
>  memslot during prefault
> 
> Return -EAGAIN if userspace attemps to delete or move a memslot while also
> prefaulting memory for that same memslot, i.e. force userspace to retry
> instead of trying to handle the scenario entirely within KVM.  Unlike
> KVM_RUN, which needs to handle the scenario entirely within KVM because
> userspace has come to depend on such behavior, KVM_PRE_FAULT_MEMORY can
> return -EAGAIN without breaking userspace as this scenario can't have ever
> worked (and there's no sane use case for prefaulting to a memslot that's
> being deleted/moved).
> 
> And also unlike KVM_RUN, the prefault path doesn't naturally gaurantee
> forward progress.  E.g. to handle such a scenario, KVM would need to drop
> and reacquire SRCU to break the deadlock between the memslot update
> (synchronizes SRCU) and the prefault (waits for the memslot update to
> complete).
> 
> However, dropping SRCU creates more problems, as completing the memslot
> update will bump the memslot generation, which in turn will invalidate the
> MMU root.  To handle that, prefaulting would need to handle pending
> KVM_REQ_MMU_FREE_OBSOLETE_ROOTS requests and do kvm_mmu_reload() prior to
> mapping each individual.
> 
> I.e. to fully handle this scenario, prefaulting would eventually need to
> look a lot like vcpu_enter_guest().  Given that there's no reasonable use
> case and practically zero risk of breaking userspace, punt the problem to
> userspace and avoid adding unnecessary complexity to the prefualt path.
> 
> Note, TDX's guest_memfd post-populate path is unaffected as slots_lock is
> held for the entire duration of populate(), i.e. any memslot modifications
> will be fully serialized against TDX's flavor of prefaulting.
> 
> Reported-by: Reinette Chatre <reinette.chatre@intel.com>
> Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com
> Debugged-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a284dce227a0..7ae56a3c7607 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4595,10 +4595,16 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
>  	 * or moved.  This ensures any existing SPTEs for the old memslot will
> -	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.
> +	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.  Punt the
> +	 * error to userspace if this is a prefault, as KVM's prefaulting ABI
> +	 * doesn't need provide the same forward progress guarantees as KVM_RUN.
>  	 */
> -	if (slot->flags & KVM_MEMSLOT_INVALID)
> +	if (slot->flags & KVM_MEMSLOT_INVALID) {
> +		if (fault->prefetch)
> +			return -EAGAIN;
> +
>  		return RET_PF_RETRY;
> +	}
>  
>  	if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
>  		/*
> 
> base-commit: 45eb29140e68ffe8e93a5471006858a018480a45
> --

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-21  1:45             ` Yan Zhao
@ 2025-05-21 15:45               ` Sean Christopherson
  2025-05-22  0:40                 ` Yan Zhao
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2025-05-21 15:45 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
	Reinette Chatre, linux-kernel@vger.kernel.org

On Wed, May 21, 2025, Yan Zhao wrote:
> On Tue, May 20, 2025 at 09:13:25AM -0700, Sean Christopherson wrote:
> > > > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> > > >  
> > > > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > > > +{
> > > > +	int r;
> > > > +
> > > > +	/*
> > > > +	 * Restrict to TDP page fault, since that's the only case where the MMU
> > > > +	 * is indexed by GPA.
> > > > +	 */
> > > > +	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	for (;;) {
> > > > +		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > > > +		if (r != -EAGAIN)
> > > > +			break;
> > > > +
> > > > +		/* Comment goes here. */
> > > > +		kvm_vcpu_srcu_read_unlock(vcpu);
> > > > +		kvm_vcpu_srcu_read_lock(vcpu);
> > > For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> > > the memslot removal succeeds after releasing the SRCU, then the old root is
> > > stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> > > from being always true.
> > 
> > That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
> > otherwise kvm_mmu_reload() will do nothing.
> In commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in
> kvm_mmu_reload()"), KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is processed in
> kvm_mmu_reload().

Oh, right!  I completely forgot about that.  Hmm, that reduces the complexity a
little bit, but I'm still leaning towards punting -EAGAIN to userspace.

> > Thinking about this scenario more, I don't mind punting this problem to userspace
> > for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
> > because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
> > to userspace.  Whereas for KVM_RUN, KVM can't change existing behavior without
> > breaking userspace, should provide consistent behavior regardless of VM type, and
> > KVM needs the "complex" code irrespective of this particular scenario.
> > 
> > I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
> > because then KVM is effectively providing the same overall behavior as KVM_RUN,
> > just without slightly different roles and responsibilities between KVM and
> > userspace.  And -ENOENT is also flat out wrong for the case where a memslot is
> > being moved, but the new base+size still contains the to-be-faulted GPA.
> > 
> > I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
> > into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
> > encountered during prefault (as identified by fault->prefetch).
> >
> > For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
> > i.e. punting to userspace is not a viable option.  But that path also has options
> > that aren't available to prefaulting.  E.g. it could (and probably should) break
> > early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
> Hmm, for TDX, there's no request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS for slot
> removal. (see commit aa8d1f48d353 ("KVM: x86/mmu: Introduce a quirk to control
> memslot zap behavior").
> 
> > would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario.  And as Rick
> > called out, the zero-step mess really needs to be solved in a more robust fashion.
> > 
> > So this?
> Looks good to me for non-TDX side.
> 
> For TDX, could we provide below fix based on your change?

Hmm, I'd prefer not to, mainly because I don't want to special case things even
more in the common MMU code, e.g. I don't want to bleed the "no memslot == exit"
logic into multiple locations.  And very strictly speaking, a memory fault exit
isn't guaranteed, as userspace could set a new memory region before the vCPU
retries the fault.

Returning -EAGAIN isn't an option because that would break userspace (e.g. our
VMM doesn't handle EAGAIN and supports SNP), and documenting the behavior would
be weird.  For KVM_PRE_FAULT_MEMORY, KVM's documentation can simply state that
EAGAIN is returned KVM encounters temporary resource contention and that userspace
should simply try again.  It's an ABI change, but for a nascent ioctl() and a
scenario that won't be hit in practice, so I'm confident we can make the change
without breaking userspace.

And again, this is an unfortunate side effect of zero-step; there's no such
restriction for SNP, and ideally the TDX zero-step pain will be solved and this
would also go away for TDX too, so I'm hesitant to bake this behavior into KVM's
ABI.

My best idea is to special case this in tdx_handle_ept_violation().  It's also
very gross, but at least the nastiness is limited to the zero-step mitigation
mess, and is co-located with the code that doesn't actually play nice with
RET_PF_RETRY.  E.g.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..ca47d08ae112 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1907,6 +1907,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
         * handle retries locally in their EPT violation handlers.
         */
        while (1) {
+               struct kvm_memory_slot *slot;
+
                ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
 
                if (ret != RET_PF_RETRY || !local_retry)
@@ -1920,6 +1922,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
                        break;
                }
 
+               slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
+               if (slot && slot->flags & KVM_MEMSLOT_INVALID)
+                       break;
+
                cond_resched();
        }
        return ret;

> For private fault, -EFAULT will be returned to userspace after the retry anyway
> after the slot is completed removed, which is unlike non-private faults that go
> to emulate path after retry.
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4602,6 +4602,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
>                 if (fault->prefetch)
>                         return -EAGAIN;
> 
> +               if (fault->is_private) {
> +                       kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +                       return -EFAULT;
> +               }
> +
>                 return RET_PF_RETRY;
>         }
> 
> 
> And would you mind if I included your patch in my next version? I can update the
> related selftests as well.

Yes, please do!

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
  2025-05-21 15:45               ` Sean Christopherson
@ 2025-05-22  0:40                 ` Yan Zhao
  0 siblings, 0 replies; 18+ messages in thread
From: Yan Zhao @ 2025-05-22  0:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
	Reinette Chatre, linux-kernel@vger.kernel.org

On Wed, May 21, 2025 at 08:45:21AM -0700, Sean Christopherson wrote:
> On Wed, May 21, 2025, Yan Zhao wrote:
> > On Tue, May 20, 2025 at 09:13:25AM -0700, Sean Christopherson wrote:
> > > > > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> > > > >  
> > > > > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > > > > +{
> > > > > +	int r;
> > > > > +
> > > > > +	/*
> > > > > +	 * Restrict to TDP page fault, since that's the only case where the MMU
> > > > > +	 * is indexed by GPA.
> > > > > +	 */
> > > > > +	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	for (;;) {
> > > > > +		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > > > > +		if (r != -EAGAIN)
> > > > > +			break;
> > > > > +
> > > > > +		/* Comment goes here. */
> > > > > +		kvm_vcpu_srcu_read_unlock(vcpu);
> > > > > +		kvm_vcpu_srcu_read_lock(vcpu);
> > > > For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> > > > the memslot removal succeeds after releasing the SRCU, then the old root is
> > > > stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> > > > from being always true.
> > > 
> > > That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
> > > otherwise kvm_mmu_reload() will do nothing.
> > In commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in
> > kvm_mmu_reload()"), KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is processed in
> > kvm_mmu_reload().
> 
> Oh, right!  I completely forgot about that.  Hmm, that reduces the complexity a
> little bit, but I'm still leaning towards punting -EAGAIN to userspace.
> 
> > > Thinking about this scenario more, I don't mind punting this problem to userspace
> > > for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
> > > because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
> > > to userspace.  Whereas for KVM_RUN, KVM can't change existing behavior without
> > > breaking userspace, should provide consistent behavior regardless of VM type, and
> > > KVM needs the "complex" code irrespective of this particular scenario.
> > > 
> > > I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
> > > because then KVM is effectively providing the same overall behavior as KVM_RUN,
> > > just without slightly different roles and responsibilities between KVM and
> > > userspace.  And -ENOENT is also flat out wrong for the case where a memslot is
> > > being moved, but the new base+size still contains the to-be-faulted GPA.
> > > 
> > > I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
> > > into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
> > > encountered during prefault (as identified by fault->prefetch).
> > >
> > > For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
> > > i.e. punting to userspace is not a viable option.  But that path also has options
> > > that aren't available to prefaulting.  E.g. it could (and probably should) break
> > > early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
> > Hmm, for TDX, there's no request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS for slot
> > removal. (see commit aa8d1f48d353 ("KVM: x86/mmu: Introduce a quirk to control
> > memslot zap behavior").
> > 
> > > would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario.  And as Rick
> > > called out, the zero-step mess really needs to be solved in a more robust fashion.
> > > 
> > > So this?
> > Looks good to me for non-TDX side.
> > 
> > For TDX, could we provide below fix based on your change?
> 
> Hmm, I'd prefer not to, mainly because I don't want to special case things even
> more in the common MMU code, e.g. I don't want to bleed the "no memslot == exit"
> logic into multiple locations.  And very strictly speaking, a memory fault exit
> isn't guaranteed, as userspace could set a new memory region before the vCPU
> retries the fault.
> 
> Returning -EAGAIN isn't an option because that would break userspace (e.g. our
> VMM doesn't handle EAGAIN and supports SNP), and documenting the behavior would
> be weird.  For KVM_PRE_FAULT_MEMORY, KVM's documentation can simply state that
> EAGAIN is returned KVM encounters temporary resource contention and that userspace
> should simply try again.  It's an ABI change, but for a nascent ioctl() and a
> scenario that won't be hit in practice, so I'm confident we can make the change
> without breaking userspace.
> 
> And again, this is an unfortunate side effect of zero-step; there's no such
> restriction for SNP, and ideally the TDX zero-step pain will be solved and this
> would also go away for TDX too, so I'm hesitant to bake this behavior into KVM's
> ABI.
> 
> My best idea is to special case this in tdx_handle_ept_violation().  It's also
> very gross, but at least the nastiness is limited to the zero-step mitigation
> mess, and is co-located with the code that doesn't actually play nice with
> RET_PF_RETRY.  E.g.
Thank you, Sean.
We'll go down this path.

> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..ca47d08ae112 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1907,6 +1907,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>          * handle retries locally in their EPT violation handlers.
>          */
>         while (1) {
> +               struct kvm_memory_slot *slot;
> +
>                 ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
>  
>                 if (ret != RET_PF_RETRY || !local_retry)
> @@ -1920,6 +1922,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>                         break;
>                 }
>  
> +               slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> +               if (slot && slot->flags & KVM_MEMSLOT_INVALID)
> +                       break;
> +
>                 cond_resched();
>         }
>         return ret;
> 
...
> > And would you mind if I included your patch in my next version? I can update the
> > related selftests as well.
> 
> Yes, please do!
Thanks. 

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-05-22  0:42 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-19  2:36 [PATCH 0/2] Introduce RET_PF_RETRY_INVALID_SLOT Yan Zhao
2025-05-19  2:37 ` [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot Yan Zhao
2025-05-19 13:33   ` Sean Christopherson
2025-05-19 15:05     ` Reinette Chatre
2025-05-19 15:22       ` Edgecombe, Rick P
2025-05-19 15:53         ` Sean Christopherson
2025-05-19 16:17           ` Edgecombe, Rick P
2025-05-19 16:12     ` Edgecombe, Rick P
2025-05-19 17:06       ` Sean Christopherson
2025-05-19 17:49         ` Reinette Chatre
2025-05-19 20:14         ` Edgecombe, Rick P
2025-05-20  5:33         ` Yan Zhao
2025-05-20 16:13           ` Sean Christopherson
2025-05-21  1:45             ` Yan Zhao
2025-05-21 15:45               ` Sean Christopherson
2025-05-22  0:40                 ` Yan Zhao
2025-05-20  5:27     ` Yan Zhao
2025-05-19  2:38 ` [PATCH 2/2] KVM: selftests: Test prefault memory with concurrent memslot removal Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).