* [PATCH v2 0/3] KVM: Fix deadlock for invalid memslots @ 2025-08-22 7:03 Yan Zhao 2025-08-22 7:03 ` [PATCH v2 1/3] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault Yan Zhao ` (3 more replies) 0 siblings, 4 replies; 13+ messages in thread From: Yan Zhao @ 2025-08-22 7:03 UTC (permalink / raw) To: pbonzini, seanjc Cc: reinette.chatre, rick.p.edgecombe, linux-kernel, kvm, Yan Zhao Hi, This series addresses the deadlock issue encountered with invalid memslots during prefaulting or TDX private EPT violations. Patches 1-2 are the new fixes from Sean. Patch 1 is for the prefaulting case, patch 2 for the TDX private EPT violation case. Patch 3 updates the selftest for prefaulting. The ioctl KVM_PRE_FAULT_MEMORY is now expected to return EAGAIN instead of ENOENT when prefaulting GFNs in an invalid memslot. The TDX-specific selftest is not included in this series, though it's passed locally. v2: - Use Sean suggested fixes (patches 1-2). - Updated selftest for the prefault case accordingly. - code base: kvm-x86-next-2025.08.21 v1: https://lore.kernel.org/all/20250519023613.30329-1-yan.y.zhao@intel.com Sean Christopherson (2): KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault KVM: TDX: Do not retry locally when the retry is caused by invalid memslot Yan Zhao (1): KVM: selftests: Test prefault memory during concurrent memslot removal arch/x86/kvm/mmu/mmu.c | 10 +- arch/x86/kvm/vmx/tdx.c | 11 +++ .../selftests/kvm/pre_fault_memory_test.c | 94 +++++++++++++++---- virt/kvm/kvm_main.c | 1 + 4 files changed, 98 insertions(+), 18 deletions(-) -- 2.43.2 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 1/3] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault 2025-08-22 7:03 [PATCH v2 0/3] KVM: Fix deadlock for invalid memslots Yan Zhao @ 2025-08-22 7:03 ` Yan Zhao 2025-09-09 2:46 ` Binbin Wu 2025-08-22 7:05 ` [PATCH v2 2/3] KVM: TDX: Do not retry locally when the retry is caused by invalid memslot Yan Zhao ` (2 subsequent siblings) 3 siblings, 1 reply; 13+ messages in thread From: Yan Zhao @ 2025-08-22 7:03 UTC (permalink / raw) To: pbonzini, seanjc Cc: reinette.chatre, rick.p.edgecombe, linux-kernel, kvm, Yan Zhao From: Sean Christopherson <seanjc@google.com> Return -EAGAIN if userspace attempts to delete or move a memslot while also prefaulting memory for that same memslot, i.e. force userspace to retry instead of trying to handle the scenario entirely within KVM. Unlike KVM_RUN, which needs to handle the scenario entirely within KVM because userspace has come to depend on such behavior, KVM_PRE_FAULT_MEMORY can return -EAGAIN without breaking userspace as this scenario can't have ever worked (and there's no sane use case for prefaulting to a memslot that's being deleted/moved). And also unlike KVM_RUN, the prefault path doesn't naturally gaurantee forward progress. E.g. to handle such a scenario, KVM would need to drop and reacquire SRCU to break the deadlock between the memslot update (synchronizes SRCU) and the prefault (waits for the memslot update to complete). However, dropping SRCU creates more problems, as completing the memslot update will bump the memslot generation, which in turn will invalidate the MMU root. To handle that, prefaulting would need to handle pending KVM_REQ_MMU_FREE_OBSOLETE_ROOTS requests and do kvm_mmu_reload() prior to mapping each individual. I.e. to fully handle this scenario, prefaulting would eventually need to look a lot like vcpu_enter_guest(). Given that there's no reasonable use case and practically zero risk of breaking userspace, punt the problem to userspace and avoid adding unnecessary complexity to the prefualt path. Note, TDX's guest_memfd post-populate path is unaffected as slots_lock is held for the entire duration of populate(), i.e. any memslot modifications will be fully serialized against TDX's flavor of prefaulting. Reported-by: Reinette Chatre <reinette.chatre@intel.com> Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com Debugged-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/mmu/mmu.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 92ff15969a36..f31fad33c423 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4653,10 +4653,16 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, /* * Retry the page fault if the gfn hit a memslot that is being deleted * or moved. This ensures any existing SPTEs for the old memslot will - * be zapped before KVM inserts a new MMIO SPTE for the gfn. + * be zapped before KVM inserts a new MMIO SPTE for the gfn. Punt the + * error to userspace if this is a prefault, as KVM's prefaulting ABI + * doesn't need provide the same forward progress guarantees as KVM_RUN. */ - if (slot->flags & KVM_MEMSLOT_INVALID) + if (slot->flags & KVM_MEMSLOT_INVALID) { + if (fault->prefetch) + return -EAGAIN; + return RET_PF_RETRY; + } if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) { /* -- 2.43.2 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/3] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault 2025-08-22 7:03 ` [PATCH v2 1/3] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault Yan Zhao @ 2025-09-09 2:46 ` Binbin Wu 0 siblings, 0 replies; 13+ messages in thread From: Binbin Wu @ 2025-09-09 2:46 UTC (permalink / raw) To: Yan Zhao, seanjc Cc: pbonzini, reinette.chatre, rick.p.edgecombe, linux-kernel, kvm On 8/22/2025 3:03 PM, Yan Zhao wrote: > From: Sean Christopherson <seanjc@google.com> > > Return -EAGAIN if userspace attempts to delete or move a memslot while also > prefaulting memory for that same memslot, i.e. force userspace to retry > instead of trying to handle the scenario entirely within KVM. Unlike > KVM_RUN, which needs to handle the scenario entirely within KVM because > userspace has come to depend on such behavior, KVM_PRE_FAULT_MEMORY can > return -EAGAIN without breaking userspace as this scenario can't have ever > worked (and there's no sane use case for prefaulting to a memslot that's > being deleted/moved). > > And also unlike KVM_RUN, the prefault path doesn't naturally gaurantee gaurantee -> guarantee > forward progress. E.g. to handle such a scenario, KVM would need to drop > and reacquire SRCU to break the deadlock between the memslot update > (synchronizes SRCU) and the prefault (waits for the memslot update to > complete). > > However, dropping SRCU creates more problems, as completing the memslot > update will bump the memslot generation, which in turn will invalidate the > MMU root. To handle that, prefaulting would need to handle pending > KVM_REQ_MMU_FREE_OBSOLETE_ROOTS requests and do kvm_mmu_reload() prior to > mapping each individual. > > I.e. to fully handle this scenario, prefaulting would eventually need to > look a lot like vcpu_enter_guest(). Given that there's no reasonable use > case and practically zero risk of breaking userspace, punt the problem to > userspace and avoid adding unnecessary complexity to the prefualt path. prefualt -> prefault > > Note, TDX's guest_memfd post-populate path is unaffected as slots_lock is > held for the entire duration of populate(), i.e. any memslot modifications > will be fully serialized against TDX's flavor of prefaulting. > > Reported-by: Reinette Chatre <reinette.chatre@intel.com> > Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com > Debugged-by: Yan Zhao <yan.y.zhao@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> Two typos above. Otherwise, Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> > --- > arch/x86/kvm/mmu/mmu.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 92ff15969a36..f31fad33c423 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4653,10 +4653,16 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, > /* > * Retry the page fault if the gfn hit a memslot that is being deleted > * or moved. This ensures any existing SPTEs for the old memslot will > - * be zapped before KVM inserts a new MMIO SPTE for the gfn. > + * be zapped before KVM inserts a new MMIO SPTE for the gfn. Punt the > + * error to userspace if this is a prefault, as KVM's prefaulting ABI > + * doesn't need provide the same forward progress guarantees as KVM_RUN. > */ > - if (slot->flags & KVM_MEMSLOT_INVALID) > + if (slot->flags & KVM_MEMSLOT_INVALID) { > + if (fault->prefetch) > + return -EAGAIN; > + > return RET_PF_RETRY; > + } > > if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) { > /* ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 2/3] KVM: TDX: Do not retry locally when the retry is caused by invalid memslot 2025-08-22 7:03 [PATCH v2 0/3] KVM: Fix deadlock for invalid memslots Yan Zhao 2025-08-22 7:03 ` [PATCH v2 1/3] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault Yan Zhao @ 2025-08-22 7:05 ` Yan Zhao 2025-09-09 3:29 ` Binbin Wu 2025-08-22 7:05 ` [PATCH v2 3/3] KVM: selftests: Test prefault memory during concurrent memslot removal Yan Zhao 2025-09-16 0:25 ` [PATCH v2 0/3] KVM: Fix deadlock for invalid memslots Sean Christopherson 3 siblings, 1 reply; 13+ messages in thread From: Yan Zhao @ 2025-08-22 7:05 UTC (permalink / raw) To: pbonzini, seanjc Cc: reinette.chatre, rick.p.edgecombe, linux-kernel, kvm, Yan Zhao From: Sean Christopherson <seanjc@google.com> Avoid local retries within the TDX EPT violation handler if a retry is triggered by faulting in an invalid memslot, indicating that the memslot is undergoing a removal process. This prevents the slot removal process from being blocked while waiting for the VMExit handler to release the SRCU lock. Opportunistically, export symbol kvm_vcpu_gfn_to_memslot() to allow for per-vCPU acceleration of gfn_to_memslot translation. [Yan: Wrote patch log, comment, fixed a minor error, function export] Reported-by: Reinette Chatre <reinette.chatre@intel.com> Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- arch/x86/kvm/vmx/tdx.c | 11 +++++++++++ virt/kvm/kvm_main.c | 1 + 2 files changed, 12 insertions(+) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 6784aaaced87..de2c4bb36069 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1992,6 +1992,11 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) * blocked by TDs, false positives are inevitable i.e., KVM may re-enter * the guest even if the IRQ/NMI can't be delivered. * + * Breaking out of the local retries if a retry is caused by faulting + * in an invalid memslot (indicating the slot is under removal), so that + * the slot removal will not be blocked due to waiting for releasing + * SRCU lock in the VMExit handler. + * * Note: even without breaking out of local retries, zero-step * mitigation may still occur due to * - invoking of TDH.VP.ENTER after KVM_EXIT_MEMORY_FAULT, @@ -2002,6 +2007,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) * handle retries locally in their EPT violation handlers. */ while (1) { + struct kvm_memory_slot *slot; + ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); if (ret != RET_PF_RETRY || !local_retry) @@ -2015,6 +2022,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) break; } + slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa)); + if (slot && slot->flags & KVM_MEMSLOT_INVALID) + break; + cond_resched(); } return ret; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 6c07dd423458..f769d1dccc21 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2661,6 +2661,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn return NULL; } +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot); bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn) { -- 2.43.2 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/3] KVM: TDX: Do not retry locally when the retry is caused by invalid memslot 2025-08-22 7:05 ` [PATCH v2 2/3] KVM: TDX: Do not retry locally when the retry is caused by invalid memslot Yan Zhao @ 2025-09-09 3:29 ` Binbin Wu 2025-09-09 14:18 ` Sean Christopherson 0 siblings, 1 reply; 13+ messages in thread From: Binbin Wu @ 2025-09-09 3:29 UTC (permalink / raw) To: Yan Zhao, seanjc Cc: pbonzini, reinette.chatre, rick.p.edgecombe, linux-kernel, kvm On 8/22/2025 3:05 PM, Yan Zhao wrote: > From: Sean Christopherson <seanjc@google.com> > > Avoid local retries within the TDX EPT violation handler if a retry is > triggered by faulting in an invalid memslot, indicating that the memslot is > undergoing a removal process. > > This prevents the slot removal process from being blocked while waiting for > the VMExit handler to release the SRCU lock. > > Opportunistically, export symbol kvm_vcpu_gfn_to_memslot() to allow for > per-vCPU acceleration of gfn_to_memslot translation. > > [Yan: Wrote patch log, comment, fixed a minor error, function export] > > Reported-by: Reinette Chatre <reinette.chatre@intel.com> > Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com > Signed-off-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > arch/x86/kvm/vmx/tdx.c | 11 +++++++++++ > virt/kvm/kvm_main.c | 1 + > 2 files changed, 12 insertions(+) > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 6784aaaced87..de2c4bb36069 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -1992,6 +1992,11 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > * blocked by TDs, false positives are inevitable i.e., KVM may re-enter > * the guest even if the IRQ/NMI can't be delivered. > * > + * Breaking out of the local retries if a retry is caused by faulting > + * in an invalid memslot (indicating the slot is under removal), so that > + * the slot removal will not be blocked due to waiting for releasing > + * SRCU lock in the VMExit handler. > + * > * Note: even without breaking out of local retries, zero-step > * mitigation may still occur due to > * - invoking of TDH.VP.ENTER after KVM_EXIT_MEMORY_FAULT, > @@ -2002,6 +2007,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > * handle retries locally in their EPT violation handlers. > */ > while (1) { > + struct kvm_memory_slot *slot; > + > ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); > > if (ret != RET_PF_RETRY || !local_retry) > @@ -2015,6 +2022,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > break; > } > > + slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa)); > + if (slot && slot->flags & KVM_MEMSLOT_INVALID) The slot couldn't be NULL here, right? So the checking for slot is to avoid de-referencing a NULL pointer in case of bug? > + break; > + > cond_resched(); > } > return ret; > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 6c07dd423458..f769d1dccc21 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2661,6 +2661,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn > > return NULL; > } > +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot); > > bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn) > { ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/3] KVM: TDX: Do not retry locally when the retry is caused by invalid memslot 2025-09-09 3:29 ` Binbin Wu @ 2025-09-09 14:18 ` Sean Christopherson 2025-09-10 2:02 ` Binbin Wu 0 siblings, 1 reply; 13+ messages in thread From: Sean Christopherson @ 2025-09-09 14:18 UTC (permalink / raw) To: Binbin Wu Cc: Yan Zhao, pbonzini, reinette.chatre, rick.p.edgecombe, linux-kernel, kvm On Tue, Sep 09, 2025, Binbin Wu wrote: > On 8/22/2025 3:05 PM, Yan Zhao wrote: > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > index 6784aaaced87..de2c4bb36069 100644 > > --- a/arch/x86/kvm/vmx/tdx.c > > +++ b/arch/x86/kvm/vmx/tdx.c > > @@ -1992,6 +1992,11 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > > * blocked by TDs, false positives are inevitable i.e., KVM may re-enter > > * the guest even if the IRQ/NMI can't be delivered. > > * > > + * Breaking out of the local retries if a retry is caused by faulting > > + * in an invalid memslot (indicating the slot is under removal), so that > > + * the slot removal will not be blocked due to waiting for releasing > > + * SRCU lock in the VMExit handler. > > + * > > * Note: even without breaking out of local retries, zero-step > > * mitigation may still occur due to > > * - invoking of TDH.VP.ENTER after KVM_EXIT_MEMORY_FAULT, > > @@ -2002,6 +2007,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > > * handle retries locally in their EPT violation handlers. > > */ > > while (1) { > > + struct kvm_memory_slot *slot; > > + > > ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); > > if (ret != RET_PF_RETRY || !local_retry) > > @@ -2015,6 +2022,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > > break; > > } > > + slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa)); > > + if (slot && slot->flags & KVM_MEMSLOT_INVALID) > > The slot couldn't be NULL here, right? Uh, hmm. It could be NULL. If the memslot deletion starts concurrently with the S-EPT violation, then the memslot could be transitioned to INVALID (prepared for deletion) prior to the vCPU acquiring SRCU after the VM-Exit. Memslot deletion could then assign to kvm->memslots with a NULL memslot. vCPU DELETE S-EPT Violation Set KVM_MEMSLOT_INVALID synchronize_srcu_expedited() Acquire SRCU __vmx_handle_ept_violation() RET_PF_RETRY due to INVALID Set memslot NULL kvm_vcpu_gfn_to_memslot() ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/3] KVM: TDX: Do not retry locally when the retry is caused by invalid memslot 2025-09-09 14:18 ` Sean Christopherson @ 2025-09-10 2:02 ` Binbin Wu 0 siblings, 0 replies; 13+ messages in thread From: Binbin Wu @ 2025-09-10 2:02 UTC (permalink / raw) To: Sean Christopherson Cc: Yan Zhao, pbonzini, reinette.chatre, rick.p.edgecombe, linux-kernel, kvm On 9/9/2025 10:18 PM, Sean Christopherson wrote: > On Tue, Sep 09, 2025, Binbin Wu wrote: >> On 8/22/2025 3:05 PM, Yan Zhao wrote: >>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c >>> index 6784aaaced87..de2c4bb36069 100644 >>> --- a/arch/x86/kvm/vmx/tdx.c >>> +++ b/arch/x86/kvm/vmx/tdx.c >>> @@ -1992,6 +1992,11 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) >>> * blocked by TDs, false positives are inevitable i.e., KVM may re-enter >>> * the guest even if the IRQ/NMI can't be delivered. >>> * >>> + * Breaking out of the local retries if a retry is caused by faulting >>> + * in an invalid memslot (indicating the slot is under removal), so that >>> + * the slot removal will not be blocked due to waiting for releasing >>> + * SRCU lock in the VMExit handler. >>> + * >>> * Note: even without breaking out of local retries, zero-step >>> * mitigation may still occur due to >>> * - invoking of TDH.VP.ENTER after KVM_EXIT_MEMORY_FAULT, >>> @@ -2002,6 +2007,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) >>> * handle retries locally in their EPT violation handlers. >>> */ >>> while (1) { >>> + struct kvm_memory_slot *slot; >>> + >>> ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); >>> if (ret != RET_PF_RETRY || !local_retry) >>> @@ -2015,6 +2022,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) >>> break; >>> } >>> + slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa)); >>> + if (slot && slot->flags & KVM_MEMSLOT_INVALID) >> The slot couldn't be NULL here, right? > Uh, hmm. It could be NULL. If the memslot deletion starts concurrently with the > S-EPT violation, then the memslot could be transitioned to INVALID (prepared for > deletion) prior to the vCPU acquiring SRCU after the VM-Exit. Memslot deletion > could then assign to kvm->memslots with a NULL memslot. > > vCPU DELETE > S-EPT Violation > Set KVM_MEMSLOT_INVALID > synchronize_srcu_expedited() > Acquire SRCU > __vmx_handle_ept_violation() > RET_PF_RETRY due to INVALID > Set memslot NULL > kvm_vcpu_gfn_to_memslot() Got it, thanks! ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 3/3] KVM: selftests: Test prefault memory during concurrent memslot removal 2025-08-22 7:03 [PATCH v2 0/3] KVM: Fix deadlock for invalid memslots Yan Zhao 2025-08-22 7:03 ` [PATCH v2 1/3] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault Yan Zhao 2025-08-22 7:05 ` [PATCH v2 2/3] KVM: TDX: Do not retry locally when the retry is caused by invalid memslot Yan Zhao @ 2025-08-22 7:05 ` Yan Zhao 2025-09-08 23:47 ` Sean Christopherson 2025-09-16 0:25 ` [PATCH v2 0/3] KVM: Fix deadlock for invalid memslots Sean Christopherson 3 siblings, 1 reply; 13+ messages in thread From: Yan Zhao @ 2025-08-22 7:05 UTC (permalink / raw) To: pbonzini, seanjc Cc: reinette.chatre, rick.p.edgecombe, linux-kernel, kvm, Yan Zhao Test prefault memory during concurrent memslot removal. Add a new param "remove_slot" to pre_fault_memory() to indicate testing concurrent memslot removal. When "remove_slot" is set: Create a remove_thread which deletes the test slot concurrently while the main thread is executing ioctl KVM_PRE_FAULT_MEMORY on the test slot memory range. Introduce variables "delete_thread_ready" and "prefault_ready" to synchronize the slot removal and ioctl KVM_PRE_FAULT_MEMORY. When the concurrency is achieved, ioctl KVM_PRE_FAULT_MEMORY should return the error EAGAIN. Otherwise, the ioctl should succeed as in cases where remove_slot is not set. Retry ioctl KVM_PRE_FAULT_MEMORY upon receiving EAGAIN. Since the memslot should have been successfully removed during the retry, EFAULT or ENOENT should be returned depending on whether the prefault is for private or shared memory. Split the existing "gpa" parameter in pre_fault_memory() into "base_gpa" and "offset" to facilitate adding the test slot back to "base_gpa" after the test concludes, ensuring that subsequent tests are not affected. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- .../selftests/kvm/pre_fault_memory_test.c | 94 +++++++++++++++---- 1 file changed, 78 insertions(+), 16 deletions(-) diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/testing/selftests/kvm/pre_fault_memory_test.c index 0350a8896a2f..56e65feb4c8c 100644 --- a/tools/testing/selftests/kvm/pre_fault_memory_test.c +++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c @@ -10,12 +10,16 @@ #include <test_util.h> #include <kvm_util.h> #include <processor.h> +#include <pthread.h> /* Arbitrarily chosen values */ #define TEST_SIZE (SZ_2M + PAGE_SIZE) #define TEST_NPAGES (TEST_SIZE / PAGE_SIZE) #define TEST_SLOT 10 +static bool prefault_ready; +static bool delete_thread_ready; + static void guest_code(uint64_t base_gpa) { volatile uint64_t val __used; @@ -30,17 +34,47 @@ static void guest_code(uint64_t base_gpa) GUEST_DONE(); } -static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 gpa, u64 size, - u64 left) +static void *remove_slot_worker(void *data) +{ + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data; + + WRITE_ONCE(delete_thread_ready, true); + + while (!READ_ONCE(prefault_ready)) + cpu_relax(); + + vm_mem_region_delete(vcpu->vm, TEST_SLOT); + + WRITE_ONCE(delete_thread_ready, false); + return NULL; +} + +static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 base_gpa, u64 offset, + u64 size, u64 left, bool private, bool remove_slot) { struct kvm_pre_fault_memory range = { - .gpa = gpa, + .gpa = base_gpa + offset, .size = size, .flags = 0, }; - u64 prev; + pthread_t remove_thread; + bool remove_hit = false; int ret, save_errno; + u64 prev; + if (remove_slot) { + pthread_create(&remove_thread, NULL, remove_slot_worker, vcpu); + + while (!READ_ONCE(delete_thread_ready)) + cpu_relax(); + + WRITE_ONCE(prefault_ready, true); + } + + /* + * EAGAIN may be returned if slot removal is performed during + * KVM_PRE_FAULT_MEMORY. + */ do { prev = range.size; ret = __vcpu_ioctl(vcpu, KVM_PRE_FAULT_MEMORY, &range); @@ -49,18 +83,42 @@ static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 gpa, u64 size, "%sexpecting range.size to change on %s", ret < 0 ? "not " : "", ret < 0 ? "failure" : "success"); - } while (ret >= 0 ? range.size : save_errno == EINTR); - TEST_ASSERT(range.size == left, - "Completed with %lld bytes left, expected %" PRId64, - range.size, left); + if (remove_slot && ret < 0 && save_errno == EAGAIN) + remove_hit = true; + + } while (ret >= 0 ? range.size : ((save_errno == EINTR) || (save_errno == EAGAIN))); - if (left == 0) - __TEST_ASSERT_VM_VCPU_IOCTL(!ret, "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm); - else - /* No memory slot causes RET_PF_EMULATE. it results in -ENOENT. */ - __TEST_ASSERT_VM_VCPU_IOCTL(ret && save_errno == ENOENT, + if (remove_slot) { + pthread_join(remove_thread, NULL); + WRITE_ONCE(prefault_ready, false); + + vm_userspace_mem_region_add(vcpu->vm, VM_MEM_SRC_ANONYMOUS, + base_gpa, TEST_SLOT, TEST_NPAGES, + private ? KVM_MEM_GUEST_MEMFD : 0); + } + + if (remove_hit) { + /* + * Prefault within a removed memory slot range returns + * - EFAULT for private memory or + * - ENOENT for shared memory (due to RET_PF_EMULATE). + */ + __TEST_ASSERT_VM_VCPU_IOCTL(ret && save_errno == (private ? EFAULT : ENOENT), "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm); + } else { + TEST_ASSERT(range.size == left, + "Completed with %lld bytes left, expected %" PRId64, + range.size, left); + + if (left == 0) + __TEST_ASSERT_VM_VCPU_IOCTL(!ret, "KVM_PRE_FAULT_MEMORY", + ret, vcpu->vm); + else + /* No memory slot causes RET_PF_EMULATE. it results in -ENOENT. */ + __TEST_ASSERT_VM_VCPU_IOCTL(ret && save_errno == ENOENT, + "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm); + } } static void __test_pre_fault_memory(unsigned long vm_type, bool private) @@ -97,9 +155,13 @@ static void __test_pre_fault_memory(unsigned long vm_type, bool private) if (private) vm_mem_set_private(vm, guest_test_phys_mem, TEST_SIZE); - pre_fault_memory(vcpu, guest_test_phys_mem, SZ_2M, 0); - pre_fault_memory(vcpu, guest_test_phys_mem + SZ_2M, PAGE_SIZE * 2, PAGE_SIZE); - pre_fault_memory(vcpu, guest_test_phys_mem + TEST_SIZE, PAGE_SIZE, PAGE_SIZE); + + pre_fault_memory(vcpu, guest_test_phys_mem, 0, SZ_2M, 0, private, true); + pre_fault_memory(vcpu, guest_test_phys_mem, 0, SZ_2M, 0, private, false); + pre_fault_memory(vcpu, guest_test_phys_mem, SZ_2M, PAGE_SIZE * 2, PAGE_SIZE, + private, false); + pre_fault_memory(vcpu, guest_test_phys_mem, TEST_SIZE, PAGE_SIZE, PAGE_SIZE, + private, false); vcpu_args_set(vcpu, 1, guest_test_virt_mem); vcpu_run(vcpu); -- 2.43.2 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 3/3] KVM: selftests: Test prefault memory during concurrent memslot removal 2025-08-22 7:05 ` [PATCH v2 3/3] KVM: selftests: Test prefault memory during concurrent memslot removal Yan Zhao @ 2025-09-08 23:47 ` Sean Christopherson 2025-09-15 8:21 ` Yan Zhao 0 siblings, 1 reply; 13+ messages in thread From: Sean Christopherson @ 2025-09-08 23:47 UTC (permalink / raw) To: Yan Zhao; +Cc: pbonzini, reinette.chatre, rick.p.edgecombe, linux-kernel, kvm On Fri, Aug 22, 2025, Yan Zhao wrote: > .../selftests/kvm/pre_fault_memory_test.c | 94 +++++++++++++++---- > 1 file changed, 78 insertions(+), 16 deletions(-) > > diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/testing/selftests/kvm/pre_fault_memory_test.c > index 0350a8896a2f..56e65feb4c8c 100644 > --- a/tools/testing/selftests/kvm/pre_fault_memory_test.c > +++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c > @@ -10,12 +10,16 @@ > #include <test_util.h> > #include <kvm_util.h> > #include <processor.h> > +#include <pthread.h> > > /* Arbitrarily chosen values */ > #define TEST_SIZE (SZ_2M + PAGE_SIZE) > #define TEST_NPAGES (TEST_SIZE / PAGE_SIZE) > #define TEST_SLOT 10 > > +static bool prefault_ready; > +static bool delete_thread_ready; > + > static void guest_code(uint64_t base_gpa) > { > volatile uint64_t val __used; > @@ -30,17 +34,47 @@ static void guest_code(uint64_t base_gpa) > GUEST_DONE(); > } > > -static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 gpa, u64 size, > - u64 left) > +static void *remove_slot_worker(void *data) > +{ > + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data; > + > + WRITE_ONCE(delete_thread_ready, true); > + > + while (!READ_ONCE(prefault_ready)) > + cpu_relax(); > + > + vm_mem_region_delete(vcpu->vm, TEST_SLOT); > + > + WRITE_ONCE(delete_thread_ready, false); Rather than use global variables, which necessitates these "dances" to get things back to the initial state, use an on-stack structure to communicate (and obviously make sure the structure is initialized :-D). > + return NULL; > +} > + > +static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 base_gpa, u64 offset, > + u64 size, u64 left, bool private, bool remove_slot) > { > struct kvm_pre_fault_memory range = { > - .gpa = gpa, > + .gpa = base_gpa + offset, > .size = size, > .flags = 0, > }; > - u64 prev; > + pthread_t remove_thread; > + bool remove_hit = false; > int ret, save_errno; > + u64 prev; > > + if (remove_slot) { I don't see any reason to make the slot removal conditional. There are three things we're interested in testing (so far): 1. Success 2. ENOENT due to no memslot 3. EAGAIN due to INVALID memslot #1 and #2 are mutually exclusive, or rather easier to test via separate testcases (because writing to non-existent memory is trivial). But for #3, I don't see a reason to make it mutually exclusive with #1 _or_ #2. As written, it's always mutually exclusive with #2 because otherwise it would be difficult (impossible?) to determine if KVM exited on the "right" address. But the only reason that's true is because the test recreates the slot *after* prefaulting, and _that_ makes #3 _conditionally_ mutually exclusive with #1, i.e. the test doesn't validate success if the INVALID memslot race is hit. Rather than make everything mutually exclusive, just restore the memslot and retry prefaulting. That also gives us easy bonus coverage that doing KVM_PRE_FAULT_MEMORY on memory that has already been faulted in is idempotent, i.e. that KVM_PRE_FAULT_MEMORY succeeds if it already succeeded (and nothing nuked the mappings in the interim). If the memslot is restored and the loop retries, then #3 becomes a complimentary and orthogonal testcase to #1 and #2. This? (with an opportunistic s/left/expected_left that confused me; I thought "left" meant how many bytes were left to prefault, but it actually means how many bytes are expected to be left when failure occurs). --- .../selftests/kvm/pre_fault_memory_test.c | 122 +++++++++++++++--- 1 file changed, 105 insertions(+), 17 deletions(-) diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/testing/selftests/kvm/pre_fault_memory_test.c index 0350a8896a2f..2dbabf4b0b15 100644 --- a/tools/testing/selftests/kvm/pre_fault_memory_test.c +++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c @@ -10,6 +10,7 @@ #include <test_util.h> #include <kvm_util.h> #include <processor.h> +#include <pthread.h> /* Arbitrarily chosen values */ #define TEST_SIZE (SZ_2M + PAGE_SIZE) @@ -30,18 +31,66 @@ static void guest_code(uint64_t base_gpa) GUEST_DONE(); } -static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 gpa, u64 size, - u64 left) +struct slot_worker_data { + struct kvm_vm *vm; + u64 gpa; + uint32_t flags; + bool worker_ready; + bool prefault_ready; + bool recreate_slot; +}; + +static void *delete_slot_worker(void *__data) +{ + struct slot_worker_data *data = __data; + struct kvm_vm *vm = data->vm; + + WRITE_ONCE(data->worker_ready, true); + + while (!READ_ONCE(data->prefault_ready)) + cpu_relax(); + + vm_mem_region_delete(vm, TEST_SLOT); + + while (!READ_ONCE(data->recreate_slot)) + cpu_relax(); + + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, data->gpa, + TEST_SLOT, TEST_NPAGES, data->flags); + + return NULL; +} + +static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 base_gpa, u64 offset, + u64 size, u64 expected_left, bool private) { struct kvm_pre_fault_memory range = { - .gpa = gpa, + .gpa = base_gpa + offset, .size = size, .flags = 0, }; - u64 prev; + struct slot_worker_data data = { + .vm = vcpu->vm, + .gpa = base_gpa, + .flags = private ? KVM_MEM_GUEST_MEMFD : 0, + }; + bool slot_recreated = false; + pthread_t slot_worker; int ret, save_errno; + u64 prev; - do { + /* + * Concurrently delete (and recreate) the slot to test KVM's handling + * of a racing memslot deletion with prefaulting. + */ + pthread_create(&slot_worker, NULL, delete_slot_worker, &data); + + while (!READ_ONCE(data.worker_ready)) + cpu_relax(); + + WRITE_ONCE(data.prefault_ready, true); + + for (;;) { prev = range.size; ret = __vcpu_ioctl(vcpu, KVM_PRE_FAULT_MEMORY, &range); save_errno = errno; @@ -49,18 +98,56 @@ static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 gpa, u64 size, "%sexpecting range.size to change on %s", ret < 0 ? "not " : "", ret < 0 ? "failure" : "success"); - } while (ret >= 0 ? range.size : save_errno == EINTR); - TEST_ASSERT(range.size == left, - "Completed with %lld bytes left, expected %" PRId64, - range.size, left); + /* + * Immediately retry prefaulting if KVM was interrupted by an + * unrelated signal/event. + */ + if (ret < 0 && save_errno == EINTR) + continue; - if (left == 0) - __TEST_ASSERT_VM_VCPU_IOCTL(!ret, "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm); + /* + * Tell the worker to recreate the slot in order to complete + * prefaulting (if prefault didn't already succeed before the + * slot was deleted) and/or to prepare for the next testcase. + * Wait for the worker to exit so that the next invocation of + * prefaulting is guaranteed to complete (assuming no KVM bugs). + * Always retry prefaulting to simply the retry logic. Either + * prefaulting already succeeded, in which case retrying should + * also succeed, or retry is needed to get a stable result. + */ + if (!slot_recreated) { + WRITE_ONCE(data.recreate_slot, true); + pthread_join(slot_worker, NULL); + slot_recreated = true; + continue; + } + + /* + * All done if there are no remaining bytes to prefault, or if + * prefaulting failed (EINTR was handled above, and EAGAIN due + * to prefaulting a memslot that's being actively deleted should + * be impossible since the memslot has already been recreated). + */ + if (!range.size || ret < 0) + break; + } + + TEST_ASSERT(range.size == expected_left, + "Completed with %llu bytes left, expected %lu", + range.size, expected_left); + + /* + * Assert success if prefaulting the entire range should succeed, i.e. + * complete with no bytes remaining. Otherwise prefaulting should have + * failed due to ENOENT (due to RET_PF_EMULATE for emulated MMIO when + * no memslot exists). + */ + if (!expected_left) + TEST_ASSERT_VM_VCPU_IOCTL(!ret, KVM_PRE_FAULT_MEMORY, ret, vcpu->vm); else - /* No memory slot causes RET_PF_EMULATE. it results in -ENOENT. */ - __TEST_ASSERT_VM_VCPU_IOCTL(ret && save_errno == ENOENT, - "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm); + TEST_ASSERT_VM_VCPU_IOCTL(ret && save_errno == ENOENT, + KVM_PRE_FAULT_MEMORY, ret, vcpu->vm); } static void __test_pre_fault_memory(unsigned long vm_type, bool private) @@ -97,9 +184,10 @@ static void __test_pre_fault_memory(unsigned long vm_type, bool private) if (private) vm_mem_set_private(vm, guest_test_phys_mem, TEST_SIZE); - pre_fault_memory(vcpu, guest_test_phys_mem, SZ_2M, 0); - pre_fault_memory(vcpu, guest_test_phys_mem + SZ_2M, PAGE_SIZE * 2, PAGE_SIZE); - pre_fault_memory(vcpu, guest_test_phys_mem + TEST_SIZE, PAGE_SIZE, PAGE_SIZE); + + pre_fault_memory(vcpu, guest_test_phys_mem, 0, SZ_2M, 0, private); + pre_fault_memory(vcpu, guest_test_phys_mem, SZ_2M, PAGE_SIZE * 2, PAGE_SIZE, private); + pre_fault_memory(vcpu, guest_test_phys_mem, TEST_SIZE, PAGE_SIZE, PAGE_SIZE, private); vcpu_args_set(vcpu, 1, guest_test_virt_mem); vcpu_run(vcpu); base-commit: ecbcc2461839e848970468b44db32282e5059925 -- ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 3/3] KVM: selftests: Test prefault memory during concurrent memslot removal 2025-09-08 23:47 ` Sean Christopherson @ 2025-09-15 8:21 ` Yan Zhao 2025-09-16 0:14 ` Sean Christopherson 0 siblings, 1 reply; 13+ messages in thread From: Yan Zhao @ 2025-09-15 8:21 UTC (permalink / raw) To: Sean Christopherson Cc: pbonzini, reinette.chatre, rick.p.edgecombe, linux-kernel, kvm On Mon, Sep 08, 2025 at 04:47:23PM -0700, Sean Christopherson wrote: > On Fri, Aug 22, 2025, Yan Zhao wrote: > > .../selftests/kvm/pre_fault_memory_test.c | 94 +++++++++++++++---- > > 1 file changed, 78 insertions(+), 16 deletions(-) > > > > diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/testing/selftests/kvm/pre_fault_memory_test.c > > index 0350a8896a2f..56e65feb4c8c 100644 > > --- a/tools/testing/selftests/kvm/pre_fault_memory_test.c > > +++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c > > @@ -10,12 +10,16 @@ > > #include <test_util.h> > > #include <kvm_util.h> > > #include <processor.h> > > +#include <pthread.h> > > > > /* Arbitrarily chosen values */ > > #define TEST_SIZE (SZ_2M + PAGE_SIZE) > > #define TEST_NPAGES (TEST_SIZE / PAGE_SIZE) > > #define TEST_SLOT 10 > > > > +static bool prefault_ready; > > +static bool delete_thread_ready; > > + > > static void guest_code(uint64_t base_gpa) > > { > > volatile uint64_t val __used; > > @@ -30,17 +34,47 @@ static void guest_code(uint64_t base_gpa) > > GUEST_DONE(); > > } > > > > -static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 gpa, u64 size, > > - u64 left) > > +static void *remove_slot_worker(void *data) > > +{ > > + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data; > > + > > + WRITE_ONCE(delete_thread_ready, true); > > + > > + while (!READ_ONCE(prefault_ready)) > > + cpu_relax(); > > + > > + vm_mem_region_delete(vcpu->vm, TEST_SLOT); > > + > > + WRITE_ONCE(delete_thread_ready, false); > > Rather than use global variables, which necessitates these "dances" to get things > back to the initial state, use an on-stack structure to communicate (and obviously > make sure the structure is initialized :-D). Sorry for the late reply. Indeed, this makes the code more elegant! > > + return NULL; > > +} > > + > > +static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 base_gpa, u64 offset, > > + u64 size, u64 left, bool private, bool remove_slot) > > { > > struct kvm_pre_fault_memory range = { > > - .gpa = gpa, > > + .gpa = base_gpa + offset, > > .size = size, > > .flags = 0, > > }; > > - u64 prev; > > + pthread_t remove_thread; > > + bool remove_hit = false; > > int ret, save_errno; > > + u64 prev; > > > > + if (remove_slot) { > > I don't see any reason to make the slot removal conditional. There are three > things we're interested in testing (so far): > > 1. Success > 2. ENOENT due to no memslot > 3. EAGAIN due to INVALID memslot > > #1 and #2 are mutually exclusive, or rather easier to test via separate testcases > (because writing to non-existent memory is trivial). But for #3, I don't see a > reason to make it mutually exclusive with #1 _or_ #2. > > As written, it's always mutually exclusive with #2 because otherwise it would be > difficult (impossible?) to determine if KVM exited on the "right" address. But > the only reason that's true is because the test recreates the slot *after* > prefaulting, and _that_ makes #3 _conditionally_ mutually exclusive with #1, > i.e. the test doesn't validate success if the INVALID memslot race is hit. > > Rather than make everything mutually exclusive, just restore the memslot and > retry prefaulting. That also gives us easy bonus coverage that doing > KVM_PRE_FAULT_MEMORY on memory that has already been faulted in is idempotent, > i.e. that KVM_PRE_FAULT_MEMORY succeeds if it already succeeded (and nothing > nuked the mappings in the interim). That's a good idea. > If the memslot is restored and the loop retries, then #3 becomes a complimentary > and orthogonal testcase to #1 and #2. > > This? (with an opportunistic s/left/expected_left that confused me; I thought > "left" meant how many bytes were left to prefault, but it actually means how many > bytes are expected to be left when failure occurs). Looks good to me, except for a minor bug. > + if (!slot_recreated) { > + WRITE_ONCE(data.recreate_slot, true); > + pthread_join(slot_worker, NULL); > + slot_recreated = true; > + continue; If delete_slot_worker() invokes vm_mem_region_delete() slowly enough due to scheduling delays, the return value from __vcpu_ioctl() could be 0 with range.size being 0 at this point. What about checking range.size before continuing? @@ -120,7 +126,8 @@ static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 base_gpa, u64 offset, WRITE_ONCE(data.recreate_slot, true); pthread_join(slot_worker, NULL); slot_recreated = true; - continue; + if (range.size) + continue; } Otherwise, the next __vcpu_ioctl() would return -1 with errno == EINVAL, which will break the assertion below. > + /* > + * Assert success if prefaulting the entire range should succeed, i.e. > + * complete with no bytes remaining. Otherwise prefaulting should have > + * failed due to ENOENT (due to RET_PF_EMULATE for emulated MMIO when > + * no memslot exists). > + */ > + if (!expected_left) > + TEST_ASSERT_VM_VCPU_IOCTL(!ret, KVM_PRE_FAULT_MEMORY, ret, vcpu->vm); ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 3/3] KVM: selftests: Test prefault memory during concurrent memslot removal 2025-09-15 8:21 ` Yan Zhao @ 2025-09-16 0:14 ` Sean Christopherson 2025-09-24 17:09 ` Sean Christopherson 0 siblings, 1 reply; 13+ messages in thread From: Sean Christopherson @ 2025-09-16 0:14 UTC (permalink / raw) To: Yan Zhao; +Cc: pbonzini, reinette.chatre, rick.p.edgecombe, linux-kernel, kvm On Mon, Sep 15, 2025, Yan Zhao wrote: > On Mon, Sep 08, 2025 at 04:47:23PM -0700, Sean Christopherson wrote: > > On Fri, Aug 22, 2025, Yan Zhao wrote: > > + if (!slot_recreated) { > > + WRITE_ONCE(data.recreate_slot, true); > > + pthread_join(slot_worker, NULL); > > + slot_recreated = true; > > + continue; > If delete_slot_worker() invokes vm_mem_region_delete() slowly enough due to > scheduling delays, the return value from __vcpu_ioctl() could be 0 with > range.size being 0 at this point. > > What about checking range.size before continuing? > > @@ -120,7 +126,8 @@ static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 base_gpa, u64 offset, > WRITE_ONCE(data.recreate_slot, true); > pthread_join(slot_worker, NULL); > slot_recreated = true; > - continue; > + if (range.size) > + continue; > } > > > Otherwise, the next __vcpu_ioctl() would return -1 with errno == EINVAL, which > will break the assertion below. Drat, I missed that kvm_vcpu_pre_fault_memory() returns -EINVAL on a size of '0' (see the wrong comment snippet "Either prefaulting already succeeded, in which case retrying should also succeed, or retry is needed to get a stable result"). I'll circle back to this tomorrow. IIRC, there was a reason I didn't want to check range.size in that path, but for the life of me I can't remember why :-/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 3/3] KVM: selftests: Test prefault memory during concurrent memslot removal 2025-09-16 0:14 ` Sean Christopherson @ 2025-09-24 17:09 ` Sean Christopherson 0 siblings, 0 replies; 13+ messages in thread From: Sean Christopherson @ 2025-09-24 17:09 UTC (permalink / raw) To: Yan Zhao; +Cc: pbonzini, reinette.chatre, rick.p.edgecombe, linux-kernel, kvm On Mon, Sep 15, 2025, Sean Christopherson wrote: > On Mon, Sep 15, 2025, Yan Zhao wrote: > > On Mon, Sep 08, 2025 at 04:47:23PM -0700, Sean Christopherson wrote: > > > On Fri, Aug 22, 2025, Yan Zhao wrote: > > > + if (!slot_recreated) { > > > + WRITE_ONCE(data.recreate_slot, true); > > > + pthread_join(slot_worker, NULL); > > > + slot_recreated = true; > > > + continue; > > If delete_slot_worker() invokes vm_mem_region_delete() slowly enough due to > > scheduling delays, the return value from __vcpu_ioctl() could be 0 with > > range.size being 0 at this point. > > > > What about checking range.size before continuing? > > > > @@ -120,7 +126,8 @@ static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 base_gpa, u64 offset, > > WRITE_ONCE(data.recreate_slot, true); > > pthread_join(slot_worker, NULL); > > slot_recreated = true; > > - continue; > > + if (range.size) > > + continue; > > } > > > > > > Otherwise, the next __vcpu_ioctl() would return -1 with errno == EINVAL, which > > will break the assertion below. > > Drat, I missed that kvm_vcpu_pre_fault_memory() returns -EINVAL on a size of '0' > (see the wrong comment snippet "Either prefaulting already succeeded, in which > case retrying should also succeed, or retry is needed to get a stable result"). > > I'll circle back to this tomorrow. IIRC, there was a reason I didn't want to > check range.size in that path, but for the life of me I can't remember why :-/ I'm 99% certain I was worried about false passes, but after working through the possible scenarios, I don't see any way for bailing on !range.size to result in missing a KVM bug. So I'll post a formal patch with the below sqaushed in. Thanks much! diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/testing/selftests/kvm/pre_fault_memory_test.c index 2dbabf4b0b15..f04768c1d2e4 100644 --- a/tools/testing/selftests/kvm/pre_fault_memory_test.c +++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c @@ -112,15 +112,24 @@ static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 base_gpa, u64 offset, * slot was deleted) and/or to prepare for the next testcase. * Wait for the worker to exit so that the next invocation of * prefaulting is guaranteed to complete (assuming no KVM bugs). - * Always retry prefaulting to simply the retry logic. Either - * prefaulting already succeeded, in which case retrying should - * also succeed, or retry is needed to get a stable result. */ if (!slot_recreated) { WRITE_ONCE(data.recreate_slot, true); pthread_join(slot_worker, NULL); slot_recreated = true; - continue; + + /* + * Retry prefaulting to get a stable result, i.e. to + * avoid seeing random EAGAIN failures. Don't retry if + * prefaulting already succeeded, as KVM disallows + * prefaulting with size=0, i.e. blindly retrying would + * result in test failures due to EINVAL. KVM should + * always return success if all bytes are prefaulted, + * i.e. there is no need to guard against EAGAIN being + * returned. + */ + if (range.size) + continue; } /* ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 0/3] KVM: Fix deadlock for invalid memslots 2025-08-22 7:03 [PATCH v2 0/3] KVM: Fix deadlock for invalid memslots Yan Zhao ` (2 preceding siblings ...) 2025-08-22 7:05 ` [PATCH v2 3/3] KVM: selftests: Test prefault memory during concurrent memslot removal Yan Zhao @ 2025-09-16 0:25 ` Sean Christopherson 3 siblings, 0 replies; 13+ messages in thread From: Sean Christopherson @ 2025-09-16 0:25 UTC (permalink / raw) To: Sean Christopherson, pbonzini, Yan Zhao Cc: reinette.chatre, rick.p.edgecombe, linux-kernel, kvm On Fri, 22 Aug 2025 15:03:04 +0800, Yan Zhao wrote: > This series addresses the deadlock issue encountered with invalid memslots > during prefaulting or TDX private EPT violations. > > Patches 1-2 are the new fixes from Sean. > Patch 1 is for the prefaulting case, > patch 2 for the TDX private EPT violation case. > > [...] Applied 1 and 2 to kvm-x86 mmu. I'll post a proper patch for the selftest once I figure out what I intended with respect to range.size. Thanks! [1/3] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault https://github.com/kvm-x86/linux/commit/3ccbf6f47098 [2/3] KVM: TDX: Do not retry locally when the retry is caused by invalid memslot https://github.com/kvm-x86/linux/commit/2bc2694fe20b -- https://github.com/kvm-x86/linux/tree/next ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-09-24 17:09 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-22 7:03 [PATCH v2 0/3] KVM: Fix deadlock for invalid memslots Yan Zhao 2025-08-22 7:03 ` [PATCH v2 1/3] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault Yan Zhao 2025-09-09 2:46 ` Binbin Wu 2025-08-22 7:05 ` [PATCH v2 2/3] KVM: TDX: Do not retry locally when the retry is caused by invalid memslot Yan Zhao 2025-09-09 3:29 ` Binbin Wu 2025-09-09 14:18 ` Sean Christopherson 2025-09-10 2:02 ` Binbin Wu 2025-08-22 7:05 ` [PATCH v2 3/3] KVM: selftests: Test prefault memory during concurrent memslot removal Yan Zhao 2025-09-08 23:47 ` Sean Christopherson 2025-09-15 8:21 ` Yan Zhao 2025-09-16 0:14 ` Sean Christopherson 2025-09-24 17:09 ` Sean Christopherson 2025-09-16 0:25 ` [PATCH v2 0/3] KVM: Fix deadlock for invalid memslots Sean Christopherson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).