* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
2025-05-19 17:06 ` Sean Christopherson
@ 2025-05-19 17:49 ` Reinette Chatre
2025-05-19 20:14 ` Edgecombe, Rick P
2025-05-20 5:33 ` Yan Zhao
2 siblings, 0 replies; 18+ messages in thread
From: Reinette Chatre @ 2025-05-19 17:49 UTC (permalink / raw)
To: Sean Christopherson, Rick P Edgecombe
Cc: Yan Y Zhao, kvm@vger.kernel.org, pbonzini@redhat.com,
linux-kernel@vger.kernel.org
Hi Sean,
On 5/19/25 10:06 AM, Sean Christopherson wrote:
> On Mon, May 19, 2025, Rick P Edgecombe wrote:
>> On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
>>> Was this hit by a real VMM? If so, why is a TDX VMM removing a memslot without
>>> kicking vCPUs out of KVM?
>>>
>>> Regardless, I would prefer not to add a new RET_PF_* flag for this. At a glance,
>>> KVM can simply drop and reacquire SRCU in the relevant paths.
>>
>> During the initial debugging and kicking around stage, this is the first
>> direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
>> kvm_tdp_map_page() tries to unlock without it being held. (although that version
>> didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
>> came up with the version in this series, which we held review on for the list:
>
> Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
>
>>> However, upon further consideration, I am reluctant to implement this fix for
>
> Which fix?
>
>>> the following reasons:
>>> - kvm_gmem_populate() already holds the kvm->slots_lock.
>>> - While retrying with srcu unlock and lock can workaround the
>>> KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
>>> and tdx_handle_ept_violation() faulting with different memslot layouts.
>
> This behavior has existed since pretty much the beginning of KVM time. TDX is the
> oddball that doesn't re-enter the guest. All other flavors re-enter the guest on
> RET_PF_RETRY, which means dropping and reacquiring SRCU. Which is why I don't like
> RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
>
> Arguably, _TDX_ is buggy by not providing this behavior.
>
>> I'm not sure why the second one is really a problem. For the first one I think
>> that path could just take the scru lock in the proper order with kvm-
>>> slots_lock?
>
> Acquiring SRCU inside slots_lock should be fine. The reserve order would be
> problematic, as KVM synchronizes SRCU while holding slots_lock.
>
> That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> hack. What about something like this?
>
> ---
> arch/x86/kvm/mmu.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 49 +++++++++++++++++++++++++++---------------
> arch/x86/kvm/vmx/tdx.c | 7 ++++--
> virt/kvm/kvm_main.c | 5 ++---
> 4 files changed, 41 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index b4b6860ab971..0fc68f0fe80e 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -259,6 +259,8 @@ extern bool tdp_mmu_enabled;
>
> bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> + u8 *level);
>
> static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index cbc84c6abc2e..4f16fe95173c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4851,24 +4851,15 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> {
> int r;
>
> - /*
> - * Restrict to TDP page fault, since that's the only case where the MMU
> - * is indexed by GPA.
> - */
> - if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> - return -EOPNOTSUPP;
> + if (signal_pending(current))
> + return -EINTR;
>
> - do {
> - if (signal_pending(current))
> - return -EINTR;
> + if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> + return -EIO;
>
> - if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> - return -EIO;
> -
> - cond_resched();
> - r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> - } while (r == RET_PF_RETRY);
> + cond_resched();
>
> + r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> if (r < 0)
> return r;
>
> @@ -4878,10 +4869,12 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> case RET_PF_WRITE_PROTECTED:
> return 0;
>
> + case RET_PF_RETRY:
> + return -EAGAIN;
> +
> case RET_PF_EMULATE:
> return -ENOENT;
>
> - case RET_PF_RETRY:
> case RET_PF_CONTINUE:
> case RET_PF_INVALID:
> default:
> @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> }
> EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
>
> +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> +{
> + int r;
> +
> + /*
> + * Restrict to TDP page fault, since that's the only case where the MMU
> + * is indexed by GPA.
> + */
> + if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> + return -EOPNOTSUPP;
> +
> + for (;;) {
> + r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> + if (r != -EAGAIN)
> + break;
> +
> + /* Comment goes here. */
> + kvm_vcpu_srcu_read_unlock(vcpu);
> + kvm_vcpu_srcu_read_lock(vcpu);
> + }
added "return r" here.
> +}
> +
> long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> struct kvm_pre_fault_memory *range)
> {
> @@ -4918,7 +4933,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> * Shadow paging uses GVA for kvm page fault, so restrict to
> * two-dimensional paging.
> */
> - r = kvm_tdp_map_page(vcpu, range->gpa, error_code, &level);
> + r = kvm_tdp_prefault_page(vcpu, range->gpa, error_code, &level);
> if (r < 0)
> return r;
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..1a232562080d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3075,8 +3075,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> if (ret != 1)
> return -ENOMEM;
>
> - ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> - if (ret < 0)
> + do {
> + ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> + while (ret == -EAGAIN);
added closing '}' here
> +
> + if (ret)
> goto out;
>
> /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b24db92e98f3..21a3fa7476dd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4266,7 +4266,6 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
> static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> struct kvm_pre_fault_memory *range)
> {
> - int idx;
> long r;
> u64 full_size;
>
> @@ -4279,7 +4278,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> return -EINVAL;
>
> vcpu_load(vcpu);
> - idx = srcu_read_lock(&vcpu->kvm->srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>
> full_size = range->size;
> do {
> @@ -4300,7 +4299,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> cond_resched();
> } while (range->size);
>
> - srcu_read_unlock(&vcpu->kvm->srcu, idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
> vcpu_put(vcpu);
>
> /* Return success if at least one page was mapped successfully. */
>
> base-commit: 12ca5c63556bbfcd77fe890fcdd1cd1adfb31fdd
> --
Just reporting on the testing side ...
I just made two small changes to patch as indicated. With these changes pre_fault_memory_test
(with changes from patch #2) hangs (with KVM still responding to signals). fwiw,
pre_fault_memory_test also hangs with [1].
I also tried out the TDX stress tests that you do not (yet) have access to. The stress tests
passes with [1] but hangs with this change.
Reinette
[1] https://lore.kernel.org/lkml/aCsy-m_esVjy8Pey@google.com/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
2025-05-19 17:06 ` Sean Christopherson
2025-05-19 17:49 ` Reinette Chatre
@ 2025-05-19 20:14 ` Edgecombe, Rick P
2025-05-20 5:33 ` Yan Zhao
2 siblings, 0 replies; 18+ messages in thread
From: Edgecombe, Rick P @ 2025-05-19 20:14 UTC (permalink / raw)
To: seanjc@google.com
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Chatre, Reinette,
linux-kernel@vger.kernel.org, Zhao, Yan Y
On Mon, 2025-05-19 at 10:06 -0700, Sean Christopherson wrote:
> On Mon, May 19, 2025, Rick P Edgecombe wrote:
> > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > > Was this hit by a real VMM? If so, why is a TDX VMM removing a memslot without
> > > kicking vCPUs out of KVM?
> > >
> > > Regardless, I would prefer not to add a new RET_PF_* flag for this. At a glance,
> > > KVM can simply drop and reacquire SRCU in the relevant paths.
> >
> > During the initial debugging and kicking around stage, this is the first
> > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> > kvm_tdp_map_page() tries to unlock without it being held. (although that version
> > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> > came up with the version in this series, which we held review on for the list:
>
> Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
>
> > > However, upon further consideration, I am reluctant to implement this fix for
>
> Which fix?
The one considered when debugging the issue. It was pretty much exactly like you
first suggested, but with the scru lock taken in tdx_gmem_post_populate(). Since
it was pretty much the same I just shared Yan's comment on it.
In case it looks like internal code review, here is some more history:
1. Reinette reports bug from internal test, wondering if it's valid userspace
behavior
2. I suggest scru root cause
3. Yan provides specific diff (pretty much what you suggested) for Reinette to
test, who finds the post_populate case generates a warning
4. Yan looks at fixing up post_populate case, but decides she doesn't like it
(the quoted blurb) and develops the alternative in this series
> > > the following reasons:
> > > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > > - While retrying with srcu unlock and lock can workaround the
> > > KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> > > and tdx_handle_ept_violation() faulting with different memslot layouts.
>
> This behavior has existed since pretty much the beginning of KVM time.
>
The non-tdx issues today are related to the pre-fault memory stuff, which
doesn't enter the guest for a different reason.
> TDX is the
> oddball that doesn't re-enter the guest. All other flavors re-enter the guest on
> RET_PF_RETRY, which means dropping and reacquiring SRCU. Which is why I don't like
> RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
>
> Arguably, _TDX_ is buggy by not providing this behavior.
Long term the zero step issue needs to be resolved. So yea we should avoid
building around it for anything not-critical (like this). But I don't see why
prefault is not in the same category of oddness.
>
> > I'm not sure why the second one is really a problem. For the first one I think
> > that path could just take the scru lock in the proper order with kvm-
> > > slots_lock?
>
> Acquiring SRCU inside slots_lock should be fine. The reserve order would be
> problematic, as KVM synchronizes SRCU while holding slots_lock.
>
> That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> hack. What about something like this?
Like Reinette noticed it doesn't pass the included test. It looks like
synchronize_srcu_expedited() is not waiting any longer, but there is some other
RET_PF_RETRY loop not caused by KVM_MEMSLOT_INVALID. Must be some side effect.
Will debug further.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
2025-05-19 17:06 ` Sean Christopherson
2025-05-19 17:49 ` Reinette Chatre
2025-05-19 20:14 ` Edgecombe, Rick P
@ 2025-05-20 5:33 ` Yan Zhao
2025-05-20 16:13 ` Sean Christopherson
2 siblings, 1 reply; 18+ messages in thread
From: Yan Zhao @ 2025-05-20 5:33 UTC (permalink / raw)
To: Sean Christopherson
Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
Reinette Chatre, linux-kernel@vger.kernel.org
On Mon, May 19, 2025 at 10:06:22AM -0700, Sean Christopherson wrote:
> On Mon, May 19, 2025, Rick P Edgecombe wrote:
> > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > > Was this hit by a real VMM? If so, why is a TDX VMM removing a memslot without
> > > kicking vCPUs out of KVM?
> > >
> > > Regardless, I would prefer not to add a new RET_PF_* flag for this. At a glance,
> > > KVM can simply drop and reacquire SRCU in the relevant paths.
> >
> > During the initial debugging and kicking around stage, this is the first
> > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> > kvm_tdp_map_page() tries to unlock without it being held. (although that version
> > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> > came up with the version in this series, which we held review on for the list:
>
> Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
>
> > > However, upon further consideration, I am reluctant to implement this fix for
>
> Which fix?
>
> > > the following reasons:
> > > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > > - While retrying with srcu unlock and lock can workaround the
> > > KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> > > and tdx_handle_ept_violation() faulting with different memslot layouts.
>
> This behavior has existed since pretty much the beginning of KVM time. TDX is the
> oddball that doesn't re-enter the guest. All other flavors re-enter the guest on
> RET_PF_RETRY, which means dropping and reacquiring SRCU. Which is why I don't like
> RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
>
> Arguably, _TDX_ is buggy by not providing this behavior.
>
> > I'm not sure why the second one is really a problem. For the first one I think
> > that path could just take the scru lock in the proper order with kvm-
> > >slots_lock?
>
> Acquiring SRCU inside slots_lock should be fine. The reserve order would be
> problematic, as KVM synchronizes SRCU while holding slots_lock.
>
> That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> hack. What about something like this?
So you want to avoid acquiring SRCU in the kvm_gmem_populate() path?
Generally I think it's good, except that it missed a kvm_mmu_reload() (please
refer to my comment below) and the kvm_vcpu_srcu_read_{un}lock() pair in
tdx_handle_ept_violation() path (So, Reinette reported it failed the TDX stress
tests at [1]).
As this is not a bug met in real VMM, could I play with this fix for a while and
come back to you later?
>
> ---
> arch/x86/kvm/mmu.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 49 +++++++++++++++++++++++++++---------------
> arch/x86/kvm/vmx/tdx.c | 7 ++++--
> virt/kvm/kvm_main.c | 5 ++---
> 4 files changed, 41 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index b4b6860ab971..0fc68f0fe80e 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -259,6 +259,8 @@ extern bool tdp_mmu_enabled;
>
> bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> + u8 *level);
>
> static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index cbc84c6abc2e..4f16fe95173c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4851,24 +4851,15 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> {
> int r;
>
> - /*
> - * Restrict to TDP page fault, since that's the only case where the MMU
> - * is indexed by GPA.
> - */
> - if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> - return -EOPNOTSUPP;
> + if (signal_pending(current))
> + return -EINTR;
>
> - do {
> - if (signal_pending(current))
> - return -EINTR;
> + if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> + return -EIO;
>
> - if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> - return -EIO;
> -
> - cond_resched();
> - r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> - } while (r == RET_PF_RETRY);
> + cond_resched();
>
> + r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> if (r < 0)
> return r;
>
> @@ -4878,10 +4869,12 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> case RET_PF_WRITE_PROTECTED:
> return 0;
>
> + case RET_PF_RETRY:
> + return -EAGAIN;
> +
> case RET_PF_EMULATE:
> return -ENOENT;
>
> - case RET_PF_RETRY:
> case RET_PF_CONTINUE:
> case RET_PF_INVALID:
> default:
> @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> }
> EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
>
> +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> +{
> + int r;
> +
> + /*
> + * Restrict to TDP page fault, since that's the only case where the MMU
> + * is indexed by GPA.
> + */
> + if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> + return -EOPNOTSUPP;
> +
> + for (;;) {
> + r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> + if (r != -EAGAIN)
> + break;
> +
> + /* Comment goes here. */
> + kvm_vcpu_srcu_read_unlock(vcpu);
> + kvm_vcpu_srcu_read_lock(vcpu);
For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
the memslot removal succeeds after releasing the SRCU, then the old root is
stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
from being always true.
[1] https://lore.kernel.org/all/dcdb3551-d95c-4724-84ee-d0a6611ca1bf@intel.com/
> + }
> +}
> +
> long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> struct kvm_pre_fault_memory *range)
> {
> @@ -4918,7 +4933,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> * Shadow paging uses GVA for kvm page fault, so restrict to
> * two-dimensional paging.
> */
> - r = kvm_tdp_map_page(vcpu, range->gpa, error_code, &level);
> + r = kvm_tdp_prefault_page(vcpu, range->gpa, error_code, &level);
> if (r < 0)
> return r;
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..1a232562080d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3075,8 +3075,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> if (ret != 1)
> return -ENOMEM;
>
> - ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> - if (ret < 0)
> + do {
> + ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> + while (ret == -EAGAIN);
> +
> + if (ret)
> goto out;
>
> /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b24db92e98f3..21a3fa7476dd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4266,7 +4266,6 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
> static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> struct kvm_pre_fault_memory *range)
> {
> - int idx;
> long r;
> u64 full_size;
>
> @@ -4279,7 +4278,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> return -EINVAL;
>
> vcpu_load(vcpu);
> - idx = srcu_read_lock(&vcpu->kvm->srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>
> full_size = range->size;
> do {
> @@ -4300,7 +4299,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> cond_resched();
> } while (range->size);
>
> - srcu_read_unlock(&vcpu->kvm->srcu, idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
> vcpu_put(vcpu);
>
> /* Return success if at least one page was mapped successfully. */
>
> base-commit: 12ca5c63556bbfcd77fe890fcdd1cd1adfb31fdd
> --
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
2025-05-20 5:33 ` Yan Zhao
@ 2025-05-20 16:13 ` Sean Christopherson
2025-05-21 1:45 ` Yan Zhao
0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2025-05-20 16:13 UTC (permalink / raw)
To: Yan Zhao
Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
Reinette Chatre, linux-kernel@vger.kernel.org
On Tue, May 20, 2025, Yan Zhao wrote:
> On Mon, May 19, 2025 at 10:06:22AM -0700, Sean Christopherson wrote:
> > On Mon, May 19, 2025, Rick P Edgecombe wrote:
> > > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > > > Was this hit by a real VMM? If so, why is a TDX VMM removing a memslot without
> > > > kicking vCPUs out of KVM?
> > > >
> > > > Regardless, I would prefer not to add a new RET_PF_* flag for this. At a glance,
> > > > KVM can simply drop and reacquire SRCU in the relevant paths.
> > >
> > > During the initial debugging and kicking around stage, this is the first
> > > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> > > kvm_tdp_map_page() tries to unlock without it being held. (although that version
> > > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> > > came up with the version in this series, which we held review on for the list:
> >
> > Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
> >
> > > > However, upon further consideration, I am reluctant to implement this fix for
> >
> > Which fix?
> >
> > > > the following reasons:
> > > > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > > > - While retrying with srcu unlock and lock can workaround the
> > > > KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> > > > and tdx_handle_ept_violation() faulting with different memslot layouts.
> >
> > This behavior has existed since pretty much the beginning of KVM time. TDX is the
> > oddball that doesn't re-enter the guest. All other flavors re-enter the guest on
> > RET_PF_RETRY, which means dropping and reacquiring SRCU. Which is why I don't like
> > RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
> >
> > Arguably, _TDX_ is buggy by not providing this behavior.
> >
> > > I'm not sure why the second one is really a problem. For the first one I think
> > > that path could just take the scru lock in the proper order with kvm-
> > > >slots_lock?
> >
> > Acquiring SRCU inside slots_lock should be fine. The reserve order would be
> > problematic, as KVM synchronizes SRCU while holding slots_lock.
> >
> > That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> > hack. What about something like this?
> So you want to avoid acquiring SRCU in the kvm_gmem_populate() path?
Yes, ideally. Acquiring SCRU wouldn't be the end of the world, but I don't love
the idea of taking a lock just so that the lock can be conditionally dropped in
a common flow. It's not a deal breaker (I wouldn't be surprised if there's at
least one path in KVM that acquires SRCU purely because of such behavior), but
separating kvm_tdp_prefault_page() from kvm_tdp_map_page()
> Generally I think it's good, except that it missed a kvm_mmu_reload() (please
> refer to my comment below) and the kvm_vcpu_srcu_read_{un}lock() pair in
> tdx_handle_ept_violation() path (So, Reinette reported it failed the TDX stress
> tests at [1]).
> > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > }
> > EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> >
> > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > +{
> > + int r;
> > +
> > + /*
> > + * Restrict to TDP page fault, since that's the only case where the MMU
> > + * is indexed by GPA.
> > + */
> > + if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > + return -EOPNOTSUPP;
> > +
> > + for (;;) {
> > + r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > + if (r != -EAGAIN)
> > + break;
> > +
> > + /* Comment goes here. */
> > + kvm_vcpu_srcu_read_unlock(vcpu);
> > + kvm_vcpu_srcu_read_lock(vcpu);
> For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> the memslot removal succeeds after releasing the SRCU, then the old root is
> stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> from being always true.
That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
otherwise kvm_mmu_reload() will do nothing.
Thinking about this scenario more, I don't mind punting this problem to userspace
for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
to userspace. Whereas for KVM_RUN, KVM can't change existing behavior without
breaking userspace, should provide consistent behavior regardless of VM type, and
KVM needs the "complex" code irrespective of this particular scenario.
I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
because then KVM is effectively providing the same overall behavior as KVM_RUN,
just without slightly different roles and responsibilities between KVM and
userspace. And -ENOENT is also flat out wrong for the case where a memslot is
being moved, but the new base+size still contains the to-be-faulted GPA.
I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
encountered during prefault (as identified by fault->prefetch).
For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
i.e. punting to userspace is not a viable option. But that path also has options
that aren't available to prefaulting. E.g. it could (and probably should) break
early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario. And as Rick
called out, the zero-step mess really needs to be solved in a more robust fashion.
So this?
---
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 20 May 2025 07:55:32 -0700
Subject: [PATCH] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves
memslot during prefault
Return -EAGAIN if userspace attemps to delete or move a memslot while also
prefaulting memory for that same memslot, i.e. force userspace to retry
instead of trying to handle the scenario entirely within KVM. Unlike
KVM_RUN, which needs to handle the scenario entirely within KVM because
userspace has come to depend on such behavior, KVM_PRE_FAULT_MEMORY can
return -EAGAIN without breaking userspace as this scenario can't have ever
worked (and there's no sane use case for prefaulting to a memslot that's
being deleted/moved).
And also unlike KVM_RUN, the prefault path doesn't naturally gaurantee
forward progress. E.g. to handle such a scenario, KVM would need to drop
and reacquire SRCU to break the deadlock between the memslot update
(synchronizes SRCU) and the prefault (waits for the memslot update to
complete).
However, dropping SRCU creates more problems, as completing the memslot
update will bump the memslot generation, which in turn will invalidate the
MMU root. To handle that, prefaulting would need to handle pending
KVM_REQ_MMU_FREE_OBSOLETE_ROOTS requests and do kvm_mmu_reload() prior to
mapping each individual.
I.e. to fully handle this scenario, prefaulting would eventually need to
look a lot like vcpu_enter_guest(). Given that there's no reasonable use
case and practically zero risk of breaking userspace, punt the problem to
userspace and avoid adding unnecessary complexity to the prefualt path.
Note, TDX's guest_memfd post-populate path is unaffected as slots_lock is
held for the entire duration of populate(), i.e. any memslot modifications
will be fully serialized against TDX's flavor of prefaulting.
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com
Debugged-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/mmu/mmu.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a284dce227a0..7ae56a3c7607 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4595,10 +4595,16 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
/*
* Retry the page fault if the gfn hit a memslot that is being deleted
* or moved. This ensures any existing SPTEs for the old memslot will
- * be zapped before KVM inserts a new MMIO SPTE for the gfn.
+ * be zapped before KVM inserts a new MMIO SPTE for the gfn. Punt the
+ * error to userspace if this is a prefault, as KVM's prefaulting ABI
+ * doesn't need provide the same forward progress guarantees as KVM_RUN.
*/
- if (slot->flags & KVM_MEMSLOT_INVALID)
+ if (slot->flags & KVM_MEMSLOT_INVALID) {
+ if (fault->prefetch)
+ return -EAGAIN;
+
return RET_PF_RETRY;
+ }
if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
/*
base-commit: 45eb29140e68ffe8e93a5471006858a018480a45
--
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
2025-05-20 16:13 ` Sean Christopherson
@ 2025-05-21 1:45 ` Yan Zhao
2025-05-21 15:45 ` Sean Christopherson
0 siblings, 1 reply; 18+ messages in thread
From: Yan Zhao @ 2025-05-21 1:45 UTC (permalink / raw)
To: Sean Christopherson
Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
Reinette Chatre, linux-kernel@vger.kernel.org
On Tue, May 20, 2025 at 09:13:25AM -0700, Sean Christopherson wrote:
> On Tue, May 20, 2025, Yan Zhao wrote:
> > On Mon, May 19, 2025 at 10:06:22AM -0700, Sean Christopherson wrote:
> > > On Mon, May 19, 2025, Rick P Edgecombe wrote:
> > > > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > > > > Was this hit by a real VMM? If so, why is a TDX VMM removing a memslot without
> > > > > kicking vCPUs out of KVM?
> > > > >
> > > > > Regardless, I would prefer not to add a new RET_PF_* flag for this. At a glance,
> > > > > KVM can simply drop and reacquire SRCU in the relevant paths.
> > > >
> > > > During the initial debugging and kicking around stage, this is the first
> > > > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> > > > kvm_tdp_map_page() tries to unlock without it being held. (although that version
> > > > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> > > > came up with the version in this series, which we held review on for the list:
> > >
> > > Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
> > >
> > > > > However, upon further consideration, I am reluctant to implement this fix for
> > >
> > > Which fix?
> > >
> > > > > the following reasons:
> > > > > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > > > > - While retrying with srcu unlock and lock can workaround the
> > > > > KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> > > > > and tdx_handle_ept_violation() faulting with different memslot layouts.
> > >
> > > This behavior has existed since pretty much the beginning of KVM time. TDX is the
> > > oddball that doesn't re-enter the guest. All other flavors re-enter the guest on
> > > RET_PF_RETRY, which means dropping and reacquiring SRCU. Which is why I don't like
> > > RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
> > >
> > > Arguably, _TDX_ is buggy by not providing this behavior.
> > >
> > > > I'm not sure why the second one is really a problem. For the first one I think
> > > > that path could just take the scru lock in the proper order with kvm-
> > > > >slots_lock?
> > >
> > > Acquiring SRCU inside slots_lock should be fine. The reserve order would be
> > > problematic, as KVM synchronizes SRCU while holding slots_lock.
> > >
> > > That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> > > hack. What about something like this?
> > So you want to avoid acquiring SRCU in the kvm_gmem_populate() path?
>
> Yes, ideally. Acquiring SCRU wouldn't be the end of the world, but I don't love
> the idea of taking a lock just so that the lock can be conditionally dropped in
> a common flow. It's not a deal breaker (I wouldn't be surprised if there's at
> least one path in KVM that acquires SRCU purely because of such behavior), but
> separating kvm_tdp_prefault_page() from kvm_tdp_map_page()
>
> > Generally I think it's good, except that it missed a kvm_mmu_reload() (please
> > refer to my comment below) and the kvm_vcpu_srcu_read_{un}lock() pair in
> > tdx_handle_ept_violation() path (So, Reinette reported it failed the TDX stress
> > tests at [1]).
>
> > > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > > }
> > > EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> > >
> > > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > > +{
> > > + int r;
> > > +
> > > + /*
> > > + * Restrict to TDP page fault, since that's the only case where the MMU
> > > + * is indexed by GPA.
> > > + */
> > > + if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > > + return -EOPNOTSUPP;
> > > +
> > > + for (;;) {
> > > + r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > > + if (r != -EAGAIN)
> > > + break;
> > > +
> > > + /* Comment goes here. */
> > > + kvm_vcpu_srcu_read_unlock(vcpu);
> > > + kvm_vcpu_srcu_read_lock(vcpu);
> > For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> > the memslot removal succeeds after releasing the SRCU, then the old root is
> > stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> > from being always true.
>
> That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
> otherwise kvm_mmu_reload() will do nothing.
In commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in
kvm_mmu_reload()"), KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is processed in
kvm_mmu_reload().
> Thinking about this scenario more, I don't mind punting this problem to userspace
> for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
> because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
> to userspace. Whereas for KVM_RUN, KVM can't change existing behavior without
> breaking userspace, should provide consistent behavior regardless of VM type, and
> KVM needs the "complex" code irrespective of this particular scenario.
>
> I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
> because then KVM is effectively providing the same overall behavior as KVM_RUN,
> just without slightly different roles and responsibilities between KVM and
> userspace. And -ENOENT is also flat out wrong for the case where a memslot is
> being moved, but the new base+size still contains the to-be-faulted GPA.
>
> I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
> into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
> encountered during prefault (as identified by fault->prefetch).
>
> For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
> i.e. punting to userspace is not a viable option. But that path also has options
> that aren't available to prefaulting. E.g. it could (and probably should) break
> early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
Hmm, for TDX, there's no request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS for slot
removal. (see commit aa8d1f48d353 ("KVM: x86/mmu: Introduce a quirk to control
memslot zap behavior").
> would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario. And as Rick
> called out, the zero-step mess really needs to be solved in a more robust fashion.
>
> So this?
Looks good to me for non-TDX side.
For TDX, could we provide below fix based on your change?
For private fault, -EFAULT will be returned to userspace after the retry anyway
after the slot is completed removed, which is unlike non-private faults that go
to emulate path after retry.
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4602,6 +4602,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
if (fault->prefetch)
return -EAGAIN;
+ if (fault->is_private) {
+ kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+ return -EFAULT;
+ }
+
return RET_PF_RETRY;
}
And would you mind if I included your patch in my next version? I can update the
related selftests as well.
Thanks
Yan
> ---
> From: Sean Christopherson <seanjc@google.com>
> Date: Tue, 20 May 2025 07:55:32 -0700
> Subject: [PATCH] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves
> memslot during prefault
>
> Return -EAGAIN if userspace attemps to delete or move a memslot while also
> prefaulting memory for that same memslot, i.e. force userspace to retry
> instead of trying to handle the scenario entirely within KVM. Unlike
> KVM_RUN, which needs to handle the scenario entirely within KVM because
> userspace has come to depend on such behavior, KVM_PRE_FAULT_MEMORY can
> return -EAGAIN without breaking userspace as this scenario can't have ever
> worked (and there's no sane use case for prefaulting to a memslot that's
> being deleted/moved).
>
> And also unlike KVM_RUN, the prefault path doesn't naturally gaurantee
> forward progress. E.g. to handle such a scenario, KVM would need to drop
> and reacquire SRCU to break the deadlock between the memslot update
> (synchronizes SRCU) and the prefault (waits for the memslot update to
> complete).
>
> However, dropping SRCU creates more problems, as completing the memslot
> update will bump the memslot generation, which in turn will invalidate the
> MMU root. To handle that, prefaulting would need to handle pending
> KVM_REQ_MMU_FREE_OBSOLETE_ROOTS requests and do kvm_mmu_reload() prior to
> mapping each individual.
>
> I.e. to fully handle this scenario, prefaulting would eventually need to
> look a lot like vcpu_enter_guest(). Given that there's no reasonable use
> case and practically zero risk of breaking userspace, punt the problem to
> userspace and avoid adding unnecessary complexity to the prefualt path.
>
> Note, TDX's guest_memfd post-populate path is unaffected as slots_lock is
> held for the entire duration of populate(), i.e. any memslot modifications
> will be fully serialized against TDX's flavor of prefaulting.
>
> Reported-by: Reinette Chatre <reinette.chatre@intel.com>
> Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com
> Debugged-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a284dce227a0..7ae56a3c7607 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4595,10 +4595,16 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
> /*
> * Retry the page fault if the gfn hit a memslot that is being deleted
> * or moved. This ensures any existing SPTEs for the old memslot will
> - * be zapped before KVM inserts a new MMIO SPTE for the gfn.
> + * be zapped before KVM inserts a new MMIO SPTE for the gfn. Punt the
> + * error to userspace if this is a prefault, as KVM's prefaulting ABI
> + * doesn't need provide the same forward progress guarantees as KVM_RUN.
> */
> - if (slot->flags & KVM_MEMSLOT_INVALID)
> + if (slot->flags & KVM_MEMSLOT_INVALID) {
> + if (fault->prefetch)
> + return -EAGAIN;
> +
> return RET_PF_RETRY;
> + }
>
> if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
> /*
>
> base-commit: 45eb29140e68ffe8e93a5471006858a018480a45
> --
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
2025-05-21 1:45 ` Yan Zhao
@ 2025-05-21 15:45 ` Sean Christopherson
2025-05-22 0:40 ` Yan Zhao
0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2025-05-21 15:45 UTC (permalink / raw)
To: Yan Zhao
Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
Reinette Chatre, linux-kernel@vger.kernel.org
On Wed, May 21, 2025, Yan Zhao wrote:
> On Tue, May 20, 2025 at 09:13:25AM -0700, Sean Christopherson wrote:
> > > > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > > > }
> > > > EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> > > >
> > > > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > > > +{
> > > > + int r;
> > > > +
> > > > + /*
> > > > + * Restrict to TDP page fault, since that's the only case where the MMU
> > > > + * is indexed by GPA.
> > > > + */
> > > > + if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > > > + return -EOPNOTSUPP;
> > > > +
> > > > + for (;;) {
> > > > + r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > > > + if (r != -EAGAIN)
> > > > + break;
> > > > +
> > > > + /* Comment goes here. */
> > > > + kvm_vcpu_srcu_read_unlock(vcpu);
> > > > + kvm_vcpu_srcu_read_lock(vcpu);
> > > For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> > > the memslot removal succeeds after releasing the SRCU, then the old root is
> > > stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> > > from being always true.
> >
> > That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
> > otherwise kvm_mmu_reload() will do nothing.
> In commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in
> kvm_mmu_reload()"), KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is processed in
> kvm_mmu_reload().
Oh, right! I completely forgot about that. Hmm, that reduces the complexity a
little bit, but I'm still leaning towards punting -EAGAIN to userspace.
> > Thinking about this scenario more, I don't mind punting this problem to userspace
> > for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
> > because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
> > to userspace. Whereas for KVM_RUN, KVM can't change existing behavior without
> > breaking userspace, should provide consistent behavior regardless of VM type, and
> > KVM needs the "complex" code irrespective of this particular scenario.
> >
> > I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
> > because then KVM is effectively providing the same overall behavior as KVM_RUN,
> > just without slightly different roles and responsibilities between KVM and
> > userspace. And -ENOENT is also flat out wrong for the case where a memslot is
> > being moved, but the new base+size still contains the to-be-faulted GPA.
> >
> > I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
> > into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
> > encountered during prefault (as identified by fault->prefetch).
> >
> > For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
> > i.e. punting to userspace is not a viable option. But that path also has options
> > that aren't available to prefaulting. E.g. it could (and probably should) break
> > early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
> Hmm, for TDX, there's no request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS for slot
> removal. (see commit aa8d1f48d353 ("KVM: x86/mmu: Introduce a quirk to control
> memslot zap behavior").
>
> > would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario. And as Rick
> > called out, the zero-step mess really needs to be solved in a more robust fashion.
> >
> > So this?
> Looks good to me for non-TDX side.
>
> For TDX, could we provide below fix based on your change?
Hmm, I'd prefer not to, mainly because I don't want to special case things even
more in the common MMU code, e.g. I don't want to bleed the "no memslot == exit"
logic into multiple locations. And very strictly speaking, a memory fault exit
isn't guaranteed, as userspace could set a new memory region before the vCPU
retries the fault.
Returning -EAGAIN isn't an option because that would break userspace (e.g. our
VMM doesn't handle EAGAIN and supports SNP), and documenting the behavior would
be weird. For KVM_PRE_FAULT_MEMORY, KVM's documentation can simply state that
EAGAIN is returned KVM encounters temporary resource contention and that userspace
should simply try again. It's an ABI change, but for a nascent ioctl() and a
scenario that won't be hit in practice, so I'm confident we can make the change
without breaking userspace.
And again, this is an unfortunate side effect of zero-step; there's no such
restriction for SNP, and ideally the TDX zero-step pain will be solved and this
would also go away for TDX too, so I'm hesitant to bake this behavior into KVM's
ABI.
My best idea is to special case this in tdx_handle_ept_violation(). It's also
very gross, but at least the nastiness is limited to the zero-step mitigation
mess, and is co-located with the code that doesn't actually play nice with
RET_PF_RETRY. E.g.
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..ca47d08ae112 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1907,6 +1907,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
* handle retries locally in their EPT violation handlers.
*/
while (1) {
+ struct kvm_memory_slot *slot;
+
ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
if (ret != RET_PF_RETRY || !local_retry)
@@ -1920,6 +1922,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
break;
}
+ slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
+ if (slot && slot->flags & KVM_MEMSLOT_INVALID)
+ break;
+
cond_resched();
}
return ret;
> For private fault, -EFAULT will be returned to userspace after the retry anyway
> after the slot is completed removed, which is unlike non-private faults that go
> to emulate path after retry.
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4602,6 +4602,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
> if (fault->prefetch)
> return -EAGAIN;
>
> + if (fault->is_private) {
> + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> + return -EFAULT;
> + }
> +
> return RET_PF_RETRY;
> }
>
>
> And would you mind if I included your patch in my next version? I can update the
> related selftests as well.
Yes, please do!
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
2025-05-21 15:45 ` Sean Christopherson
@ 2025-05-22 0:40 ` Yan Zhao
0 siblings, 0 replies; 18+ messages in thread
From: Yan Zhao @ 2025-05-22 0:40 UTC (permalink / raw)
To: Sean Christopherson
Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
Reinette Chatre, linux-kernel@vger.kernel.org
On Wed, May 21, 2025 at 08:45:21AM -0700, Sean Christopherson wrote:
> On Wed, May 21, 2025, Yan Zhao wrote:
> > On Tue, May 20, 2025 at 09:13:25AM -0700, Sean Christopherson wrote:
> > > > > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> > > > >
> > > > > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > > > > +{
> > > > > + int r;
> > > > > +
> > > > > + /*
> > > > > + * Restrict to TDP page fault, since that's the only case where the MMU
> > > > > + * is indexed by GPA.
> > > > > + */
> > > > > + if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > > > > + return -EOPNOTSUPP;
> > > > > +
> > > > > + for (;;) {
> > > > > + r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > > > > + if (r != -EAGAIN)
> > > > > + break;
> > > > > +
> > > > > + /* Comment goes here. */
> > > > > + kvm_vcpu_srcu_read_unlock(vcpu);
> > > > > + kvm_vcpu_srcu_read_lock(vcpu);
> > > > For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> > > > the memslot removal succeeds after releasing the SRCU, then the old root is
> > > > stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> > > > from being always true.
> > >
> > > That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
> > > otherwise kvm_mmu_reload() will do nothing.
> > In commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in
> > kvm_mmu_reload()"), KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is processed in
> > kvm_mmu_reload().
>
> Oh, right! I completely forgot about that. Hmm, that reduces the complexity a
> little bit, but I'm still leaning towards punting -EAGAIN to userspace.
>
> > > Thinking about this scenario more, I don't mind punting this problem to userspace
> > > for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
> > > because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
> > > to userspace. Whereas for KVM_RUN, KVM can't change existing behavior without
> > > breaking userspace, should provide consistent behavior regardless of VM type, and
> > > KVM needs the "complex" code irrespective of this particular scenario.
> > >
> > > I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
> > > because then KVM is effectively providing the same overall behavior as KVM_RUN,
> > > just without slightly different roles and responsibilities between KVM and
> > > userspace. And -ENOENT is also flat out wrong for the case where a memslot is
> > > being moved, but the new base+size still contains the to-be-faulted GPA.
> > >
> > > I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
> > > into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
> > > encountered during prefault (as identified by fault->prefetch).
> > >
> > > For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
> > > i.e. punting to userspace is not a viable option. But that path also has options
> > > that aren't available to prefaulting. E.g. it could (and probably should) break
> > > early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
> > Hmm, for TDX, there's no request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS for slot
> > removal. (see commit aa8d1f48d353 ("KVM: x86/mmu: Introduce a quirk to control
> > memslot zap behavior").
> >
> > > would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario. And as Rick
> > > called out, the zero-step mess really needs to be solved in a more robust fashion.
> > >
> > > So this?
> > Looks good to me for non-TDX side.
> >
> > For TDX, could we provide below fix based on your change?
>
> Hmm, I'd prefer not to, mainly because I don't want to special case things even
> more in the common MMU code, e.g. I don't want to bleed the "no memslot == exit"
> logic into multiple locations. And very strictly speaking, a memory fault exit
> isn't guaranteed, as userspace could set a new memory region before the vCPU
> retries the fault.
>
> Returning -EAGAIN isn't an option because that would break userspace (e.g. our
> VMM doesn't handle EAGAIN and supports SNP), and documenting the behavior would
> be weird. For KVM_PRE_FAULT_MEMORY, KVM's documentation can simply state that
> EAGAIN is returned KVM encounters temporary resource contention and that userspace
> should simply try again. It's an ABI change, but for a nascent ioctl() and a
> scenario that won't be hit in practice, so I'm confident we can make the change
> without breaking userspace.
>
> And again, this is an unfortunate side effect of zero-step; there's no such
> restriction for SNP, and ideally the TDX zero-step pain will be solved and this
> would also go away for TDX too, so I'm hesitant to bake this behavior into KVM's
> ABI.
>
> My best idea is to special case this in tdx_handle_ept_violation(). It's also
> very gross, but at least the nastiness is limited to the zero-step mitigation
> mess, and is co-located with the code that doesn't actually play nice with
> RET_PF_RETRY. E.g.
Thank you, Sean.
We'll go down this path.
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..ca47d08ae112 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1907,6 +1907,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> * handle retries locally in their EPT violation handlers.
> */
> while (1) {
> + struct kvm_memory_slot *slot;
> +
> ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
>
> if (ret != RET_PF_RETRY || !local_retry)
> @@ -1920,6 +1922,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> break;
> }
>
> + slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> + if (slot && slot->flags & KVM_MEMSLOT_INVALID)
> + break;
> +
> cond_resched();
> }
> return ret;
>
...
> > And would you mind if I included your patch in my next version? I can update the
> > related selftests as well.
>
> Yes, please do!
Thanks.
^ permalink raw reply [flat|nested] 18+ messages in thread