From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: pbonzini@redhat.com, reinette.chatre@intel.com,
rick.p.edgecombe@intel.com, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org
Subject: Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
Date: Mon, 19 May 2025 06:33:14 -0700 [thread overview]
Message-ID: <aCsy-m_esVjy8Pey@google.com> (raw)
In-Reply-To: <20250519023737.30360-1-yan.y.zhao@intel.com>
On Mon, May 19, 2025, Yan Zhao wrote:
> Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of
> kvm_mmu_do_page_fault() that a fault retry is due to an invalid memslot.
> This helps prevent deadlocks when a memslot is removed during pre-faulting
> GPAs in the memslot or local retry of faulting private pages in TDX.
>
> Take pre-faulting as an example.
>
> During ioctl KVM_PRE_FAULT_MEMORY, kvm->srcu is acquired around the
> pre-faulting of the entire range. For x86, kvm_arch_vcpu_pre_fault_memory()
> further invokes kvm_tdp_map_page(), which retries kvm_mmu_do_page_fault()
> if the return value is RET_PF_RETRY.
>
> If a memslot is deleted during the ioctl KVM_PRE_FAULT_MEMORY, after
> kvm_invalidate_memslot() marks a slot as invalid and makes it visible via
> rcu_assign_pointer() in kvm_swap_active_memslots(), kvm_mmu_do_page_fault()
> may encounter an invalid slot and return RET_PF_RETRY. Consequently,
> kvm_tdp_map_page() will then retry without releasing the srcu lock.
> Meanwhile, synchronize_srcu_expedited() in kvm_swap_active_memslots() is
> blocked, waiting for kvm_vcpu_pre_fault_memory() to release the srcu lock,
> leading to a deadlock.
Probably worth calling out that KVM will respond to signals, i.e. there's no risk
to the host kernel.
> "slot deleting" thread "prefault" thread
> ----------------------------- ----------------------
> srcu_read_lock();
> (A)
> invalid_slot->flags |= KVM_MEMSLOT_INVALID;
> rcu_assign_pointer();
>
> kvm_tdp_map_page();
> (B)
> do {
> r = kvm_mmu_do_page_fault();
>
> (C) synchronize_srcu_expedited();
>
> } while (r == RET_PF_RETRY);
>
> (D) srcu_read_unlock();
>
> As shown in diagram, (C) is waiting for (D). However, (B) continuously
> finds an invalid slot before (C) completes, causing (B) to retry and
> preventing (D) from being invoked.
>
> The local retry code in TDX's EPT violation handler faces a similar issue,
> where a deadlock can occur when faulting a private GFN in a slot that is
> concurrently being removed.
>
> To resolve the deadlock, introduce a new return value
> RET_PF_RETRY_INVALID_SLOT and modify kvm_mmu_do_page_fault() to return
> RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an
> invalid memslot. This prevents endless retries in kvm_tdp_map_page() or
> tdx_handle_ept_violation(), allowing the srcu to be released and enabling
> slot removal to proceed.
>
> As all callers of kvm_tdp_map_page(), i.e.,
> kvm_arch_vcpu_pre_fault_memory() or tdx_gmem_post_populate(), are in
> pre-fault path, treat RET_PF_RETRY_INVALID_SLOT the same as RET_PF_EMULATE
> to return -ENOENT in kvm_tdp_map_page() to enable userspace to be aware of
> the slot removal.
Userspace should already be "aware" of the slot removal.
> Returning RET_PF_RETRY_INVALID_SLOT in kvm_mmu_do_page_fault() does not
> affect kvm_mmu_page_fault() and kvm_arch_async_page_ready(), as their
> callers either only check if the return value > 0 to re-enter vCPU for
> retry or do not check return value.
>
> Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Was this hit by a real VMM? If so, why is a TDX VMM removing a memslot without
kicking vCPUs out of KVM?
Regardless, I would prefer not to add a new RET_PF_* flag for this. At a glance,
KVM can simply drop and reacquire SRCU in the relevant paths.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..ceab756052eb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4866,7 +4866,12 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
return -EIO;
cond_resched();
+
r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
+ if (r == RET_PF_RETRY) {
+ kvm_vcpu_srcu_read_unlock(vcpu);
+ kvm_vcpu_srcu_read_lock(vcpu);
+ }
} while (r == RET_PF_RETRY);
if (r < 0)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..e29966ce3ab7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1920,6 +1920,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
break;
}
+ kvm_vcpu_srcu_read_unlock(vcpu);
+ kvm_vcpu_srcu_read_lock(vcpu);
+
cond_resched();
}
return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b24db92e98f3..21a3fa7476dd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4266,7 +4266,6 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
struct kvm_pre_fault_memory *range)
{
- int idx;
long r;
u64 full_size;
@@ -4279,7 +4278,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
return -EINVAL;
vcpu_load(vcpu);
- idx = srcu_read_lock(&vcpu->kvm->srcu);
+ kvm_vcpu_srcu_read_lock(vcpu);
full_size = range->size;
do {
@@ -4300,7 +4299,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
cond_resched();
} while (range->size);
- srcu_read_unlock(&vcpu->kvm->srcu, idx);
+ kvm_vcpu_srcu_read_unlock(vcpu);
vcpu_put(vcpu);
/* Return success if at least one page was mapped successfully. */
next prev parent reply other threads:[~2025-05-19 13:33 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-19 2:36 [PATCH 0/2] Introduce RET_PF_RETRY_INVALID_SLOT Yan Zhao
2025-05-19 2:37 ` [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot Yan Zhao
2025-05-19 13:33 ` Sean Christopherson [this message]
2025-05-19 15:05 ` Reinette Chatre
2025-05-19 15:22 ` Edgecombe, Rick P
2025-05-19 15:53 ` Sean Christopherson
2025-05-19 16:17 ` Edgecombe, Rick P
2025-05-19 16:12 ` Edgecombe, Rick P
2025-05-19 17:06 ` Sean Christopherson
2025-05-19 17:49 ` Reinette Chatre
2025-05-19 20:14 ` Edgecombe, Rick P
2025-05-20 5:33 ` Yan Zhao
2025-05-20 16:13 ` Sean Christopherson
2025-05-21 1:45 ` Yan Zhao
2025-05-21 15:45 ` Sean Christopherson
2025-05-22 0:40 ` Yan Zhao
2025-05-20 5:27 ` Yan Zhao
2025-05-19 2:38 ` [PATCH 2/2] KVM: selftests: Test prefault memory with concurrent memslot removal Yan Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aCsy-m_esVjy8Pey@google.com \
--to=seanjc@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=reinette.chatre@intel.com \
--cc=rick.p.edgecombe@intel.com \
--cc=yan.y.zhao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.