* [PATCH 0/4] Small changes related to prefetch and spurious faults
@ 2025-02-07 3:06 Yan Zhao
2025-02-07 3:07 ` [PATCH 1/4] KVM: x86/mmu: Further check old SPTE is leaf for spurious prefetch fault Yan Zhao
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Yan Zhao @ 2025-02-07 3:06 UTC (permalink / raw)
To: pbonzini, seanjc; +Cc: rick.p.edgecombe, linux-kernel, kvm, Yan Zhao
Hi,
This series contains some small changes related to prefetch/prefault and
spurious faults.
Patch 1 checks if a shadow-present old SPTE is leaf to determine a
prefetch fault is spurious.
Patch 2 merges the case of prefetch into the case of is_access_allowed().
Patch 3 is according to the previous discussion at [1].
Patch 4 always free obsolete roots before reload mmu.
With below scenario
1. add a memslot with size 4K
2. prefault GPA A in the memslot
3. delete the memslot
4. re-add the memslot with size 2M
5. prefault GPA A again.
Patch 1 is required if zap all quirk is disabled in step 3.
Patch 5 is required if zap all is performed in step 3 and if step 2/5 are
executed before any vcpu_run().
The series can be applied to both
f7bafceba76e9ab475b413578c1757ee18c3e44b and
eb723766b1030a23c38adf2348b7c3d1409d11f0.
Thanks
Yan
[1] https://lore.kernel.org/kvm/Z2WTZGHmPDXHSrTA@google.com/
Yan Zhao (4):
KVM: x86/mmu: Further check old SPTE is leaf for spurious prefetch
fault
KVM: x86/tdp_mmu: Merge the prefetch into the is_access_allowed()
check
KVM: x86/mmu: Make sure pfn is not changed for spurious fault
KVM: x86/mmu: Free obsolete roots when pre-faulting SPTEs
arch/x86/kvm/mmu/mmu.c | 8 +++++++-
arch/x86/kvm/mmu/tdp_mmu.c | 6 ++----
2 files changed, 9 insertions(+), 5 deletions(-)
--
2.43.2
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 1/4] KVM: x86/mmu: Further check old SPTE is leaf for spurious prefetch fault 2025-02-07 3:06 [PATCH 0/4] Small changes related to prefetch and spurious faults Yan Zhao @ 2025-02-07 3:07 ` Yan Zhao 2025-02-07 3:08 ` [PATCH 2/4] KVM: x86/tdp_mmu: Merge the prefetch into the is_access_allowed() check Yan Zhao ` (2 subsequent siblings) 3 siblings, 0 replies; 16+ messages in thread From: Yan Zhao @ 2025-02-07 3:07 UTC (permalink / raw) To: pbonzini, seanjc; +Cc: rick.p.edgecombe, linux-kernel, kvm, Yan Zhao Instead of simply treating a prefetch fault as spurious when there's a shadow-present old SPTE, further check if the old SPTE is leaf to determine if a prefetch fault is spurious. It's not reasonable to treat a prefetch fault as spurious when there's a shadow-present non-leaf SPTE while without a shadow-present leaf SPTE. e.g., with below sequence, a prefetch fault should not be regarded as spurious: 1. add a memslot with size 4K 2. prefault GPA A in the memslot 3. delete the memslot (zap all disabled) 4. re-add the memslot with size 2M 5. prefault GPA A again. In step 5, the prefetch fault attempts to install a 2M huge entry. Since step 3 zaps the leaf SPTE for GPA A while keeping the non-leaf SPTE, the leaf entry will remain empty after step 5 if the prefetch fault is regarded as spurious due to a shadow-present non-leaf SPTE found. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- arch/x86/kvm/mmu/mmu.c | 2 +- arch/x86/kvm/mmu/tdp_mmu.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a45ae60e84ab..3d74e680006f 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2846,7 +2846,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, } if (is_shadow_present_pte(*sptep)) { - if (prefetch) + if (prefetch && is_last_spte(*sptep, level)) return RET_PF_SPURIOUS; /* diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 046b6ba31197..ab65fd915ef2 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1137,7 +1137,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, if (WARN_ON_ONCE(sp->role.level != fault->goal_level)) return RET_PF_RETRY; - if (fault->prefetch && is_shadow_present_pte(iter->old_spte)) + if (fault->prefetch && is_shadow_present_pte(iter->old_spte) && + is_last_spte(iter->old_spte, iter->level)) return RET_PF_SPURIOUS; if (is_shadow_present_pte(iter->old_spte) && -- 2.43.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/4] KVM: x86/tdp_mmu: Merge the prefetch into the is_access_allowed() check 2025-02-07 3:06 [PATCH 0/4] Small changes related to prefetch and spurious faults Yan Zhao 2025-02-07 3:07 ` [PATCH 1/4] KVM: x86/mmu: Further check old SPTE is leaf for spurious prefetch fault Yan Zhao @ 2025-02-07 3:08 ` Yan Zhao 2025-02-07 15:03 ` Sean Christopherson 2025-02-07 3:09 ` [PATCH 3/4] KVM: x86/mmu: Make sure pfn is not changed for spurious fault Yan Zhao 2025-02-07 3:09 ` [PATCH 4/4] KVM: x86/mmu: Free obsolete roots when pre-faulting SPTEs Yan Zhao 3 siblings, 1 reply; 16+ messages in thread From: Yan Zhao @ 2025-02-07 3:08 UTC (permalink / raw) To: pbonzini, seanjc; +Cc: rick.p.edgecombe, linux-kernel, kvm, Yan Zhao Merge the prefetch check into the is_access_allowed() check to determine a spurious fault. In the TDP MMU, a spurious prefetch fault should also pass the is_access_allowed() check. Combining these checks to avoid redundancy. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- arch/x86/kvm/mmu/tdp_mmu.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index ab65fd915ef2..5f9e7374220e 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1137,10 +1137,6 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, if (WARN_ON_ONCE(sp->role.level != fault->goal_level)) return RET_PF_RETRY; - if (fault->prefetch && is_shadow_present_pte(iter->old_spte) && - is_last_spte(iter->old_spte, iter->level)) - return RET_PF_SPURIOUS; - if (is_shadow_present_pte(iter->old_spte) && is_access_allowed(fault, iter->old_spte) && is_last_spte(iter->old_spte, iter->level)) -- 2.43.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/4] KVM: x86/tdp_mmu: Merge the prefetch into the is_access_allowed() check 2025-02-07 3:08 ` [PATCH 2/4] KVM: x86/tdp_mmu: Merge the prefetch into the is_access_allowed() check Yan Zhao @ 2025-02-07 15:03 ` Sean Christopherson 2025-02-08 2:29 ` Yan Zhao 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2025-02-07 15:03 UTC (permalink / raw) To: Yan Zhao; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm On Fri, Feb 07, 2025, Yan Zhao wrote: > Merge the prefetch check into the is_access_allowed() check to determine a > spurious fault. > > In the TDP MMU, a spurious prefetch fault should also pass the > is_access_allowed() check. How so? 1. vCPU takes a write-fault on a swapped out page and queues an async #PF 2. A different task installs a writable SPTE 3. A third task write-protects the SPTE for dirty logging 4. Async #PF handler faults in the SPTE, encounters a read-only SPTE for its write fault. KVM shouldn't mark the gfn as dirty in this case. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/4] KVM: x86/tdp_mmu: Merge the prefetch into the is_access_allowed() check 2025-02-07 15:03 ` Sean Christopherson @ 2025-02-08 2:29 ` Yan Zhao 2025-02-10 22:17 ` Sean Christopherson 0 siblings, 1 reply; 16+ messages in thread From: Yan Zhao @ 2025-02-08 2:29 UTC (permalink / raw) To: Sean Christopherson; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm On Fri, Feb 07, 2025 at 07:03:46AM -0800, Sean Christopherson wrote: > On Fri, Feb 07, 2025, Yan Zhao wrote: > > Merge the prefetch check into the is_access_allowed() check to determine a > > spurious fault. > > > > In the TDP MMU, a spurious prefetch fault should also pass the > > is_access_allowed() check. > > How so? > > 1. vCPU takes a write-fault on a swapped out page and queues an async #PF > 2. A different task installs a writable SPTE > 3. A third task write-protects the SPTE for dirty logging > 4. Async #PF handler faults in the SPTE, encounters a read-only SPTE for its > write fault. > > KVM shouldn't mark the gfn as dirty in this case. Hmm, but when we prefetch an entry, if a gfn is not write-tracked, it allows to mark the gfn as dirty, just like when there's no existing SPTE, a prefetch fault also marks a gfn as dirty. If a gfn is write-tracked, make_spte() will not grant write-permission to make the gfn dirty. However, I admit that making the new SPTE as not-accessed again is not desired. What about below? @@ -983,7 +983,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, return RET_PF_RETRY; if (is_shadow_present_pte(iter->old_spte) && - is_access_allowed(fault, iter->old_spte) && + (fault->prefetch || is_access_allowed(fault, iter->old_spte)) && is_last_spte(iter->old_spte, iter->level)) return RET_PF_SPURIOUS; ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/4] KVM: x86/tdp_mmu: Merge the prefetch into the is_access_allowed() check 2025-02-08 2:29 ` Yan Zhao @ 2025-02-10 22:17 ` Sean Christopherson 0 siblings, 0 replies; 16+ messages in thread From: Sean Christopherson @ 2025-02-10 22:17 UTC (permalink / raw) To: Yan Zhao; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm On Sat, Feb 08, 2025, Yan Zhao wrote: > On Fri, Feb 07, 2025 at 07:03:46AM -0800, Sean Christopherson wrote: > > On Fri, Feb 07, 2025, Yan Zhao wrote: > > > Merge the prefetch check into the is_access_allowed() check to determine a > > > spurious fault. > > > > > > In the TDP MMU, a spurious prefetch fault should also pass the > > > is_access_allowed() check. > > > > How so? > > > > 1. vCPU takes a write-fault on a swapped out page and queues an async #PF > > 2. A different task installs a writable SPTE > > 3. A third task write-protects the SPTE for dirty logging > > 4. Async #PF handler faults in the SPTE, encounters a read-only SPTE for its > > write fault. > > > > KVM shouldn't mark the gfn as dirty in this case. > Hmm, but when we prefetch an entry, if a gfn is not write-tracked, it allows to > mark the gfn as dirty, just like when there's no existing SPTE, a prefetch fault > also marks a gfn as dirty. Yeah, but there's a difference between installing a SPTE and overwriting a SPTE. > If a gfn is write-tracked, make_spte() will not grant write-permission to make > the gfn dirty. > > However, I admit that making the new SPTE as not-accessed again is not desired. > What about below? > > @@ -983,7 +983,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, > return RET_PF_RETRY; > > if (is_shadow_present_pte(iter->old_spte) && > - is_access_allowed(fault, iter->old_spte) && > + (fault->prefetch || is_access_allowed(fault, iter->old_spte)) && > is_last_spte(iter->old_spte, iter->level)) > return RET_PF_SPURIOUS; Works for me. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 3/4] KVM: x86/mmu: Make sure pfn is not changed for spurious fault 2025-02-07 3:06 [PATCH 0/4] Small changes related to prefetch and spurious faults Yan Zhao 2025-02-07 3:07 ` [PATCH 1/4] KVM: x86/mmu: Further check old SPTE is leaf for spurious prefetch fault Yan Zhao 2025-02-07 3:08 ` [PATCH 2/4] KVM: x86/tdp_mmu: Merge the prefetch into the is_access_allowed() check Yan Zhao @ 2025-02-07 3:09 ` Yan Zhao 2025-02-07 15:07 ` Sean Christopherson 2025-02-07 3:09 ` [PATCH 4/4] KVM: x86/mmu: Free obsolete roots when pre-faulting SPTEs Yan Zhao 3 siblings, 1 reply; 16+ messages in thread From: Yan Zhao @ 2025-02-07 3:09 UTC (permalink / raw) To: pbonzini, seanjc; +Cc: rick.p.edgecombe, linux-kernel, kvm, Yan Zhao Make sure pfn is not changed for a spurious fault by warning in the TDP MMU. For shadow path, only treat a prefetch fault as spurious when pfn is not changed, since the rmap removal and add are required when pfn is changed. Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- arch/x86/kvm/mmu/mmu.c | 3 ++- arch/x86/kvm/mmu/tdp_mmu.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 3d74e680006f..47fd3712afe6 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2846,7 +2846,8 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, } if (is_shadow_present_pte(*sptep)) { - if (prefetch && is_last_spte(*sptep, level)) + if (prefetch && is_last_spte(*sptep, level) && + pfn == spte_to_pfn(*sptep)) return RET_PF_SPURIOUS; /* diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 5f9e7374220e..8b37e4f542f3 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1139,7 +1139,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, if (is_shadow_present_pte(iter->old_spte) && is_access_allowed(fault, iter->old_spte) && - is_last_spte(iter->old_spte, iter->level)) + is_last_spte(iter->old_spte, iter->level) && + !WARN_ON_ONCE(fault->pfn != spte_to_pfn(iter->old_spte))) return RET_PF_SPURIOUS; if (unlikely(!fault->slot)) -- 2.43.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] KVM: x86/mmu: Make sure pfn is not changed for spurious fault 2025-02-07 3:09 ` [PATCH 3/4] KVM: x86/mmu: Make sure pfn is not changed for spurious fault Yan Zhao @ 2025-02-07 15:07 ` Sean Christopherson 2025-02-08 2:37 ` Yan Zhao 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2025-02-07 15:07 UTC (permalink / raw) To: Yan Zhao; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm On Fri, Feb 07, 2025, Yan Zhao wrote: > Make sure pfn is not changed for a spurious fault by warning in the TDP > MMU. For shadow path, only treat a prefetch fault as spurious when pfn is > not changed, since the rmap removal and add are required when pfn is > changed. I like sanity checks, but I don't like special casing "prefetch" faults like this. KVM should _never_ change the PFN of a shadow-present SPTE. The TDP MMU already BUG()s on this, and mmu_spte_update() WARNs on the transition. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] KVM: x86/mmu: Make sure pfn is not changed for spurious fault 2025-02-07 15:07 ` Sean Christopherson @ 2025-02-08 2:37 ` Yan Zhao 2025-02-10 22:23 ` Sean Christopherson 0 siblings, 1 reply; 16+ messages in thread From: Yan Zhao @ 2025-02-08 2:37 UTC (permalink / raw) To: Sean Christopherson; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm On Fri, Feb 07, 2025 at 07:07:06AM -0800, Sean Christopherson wrote: > On Fri, Feb 07, 2025, Yan Zhao wrote: > > Make sure pfn is not changed for a spurious fault by warning in the TDP > > MMU. For shadow path, only treat a prefetch fault as spurious when pfn is > > not changed, since the rmap removal and add are required when pfn is > > changed. > > I like sanity checks, but I don't like special casing "prefetch" faults like this. > KVM should _never_ change the PFN of a shadow-present SPTE. The TDP MMU already > BUG()s on this, and mmu_spte_update() WARNs on the transition. However, both TDP MMU and mmu_set_spte() return RET_PF_SPURIOUS directly before the BUG() in TDP MMU or mmu_spte_update() could be hit. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] KVM: x86/mmu: Make sure pfn is not changed for spurious fault 2025-02-08 2:37 ` Yan Zhao @ 2025-02-10 22:23 ` Sean Christopherson 2025-02-11 6:48 ` Yan Zhao 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2025-02-10 22:23 UTC (permalink / raw) To: Yan Zhao; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm On Sat, Feb 08, 2025, Yan Zhao wrote: > On Fri, Feb 07, 2025 at 07:07:06AM -0800, Sean Christopherson wrote: > > On Fri, Feb 07, 2025, Yan Zhao wrote: > > > Make sure pfn is not changed for a spurious fault by warning in the TDP > > > MMU. For shadow path, only treat a prefetch fault as spurious when pfn is > > > not changed, since the rmap removal and add are required when pfn is > > > changed. > > > > I like sanity checks, but I don't like special casing "prefetch" faults like this. > > KVM should _never_ change the PFN of a shadow-present SPTE. The TDP MMU already > > BUG()s on this, and mmu_spte_update() WARNs on the transition. > However, both TDP MMU and mmu_set_spte() return RET_PF_SPURIOUS directly before > the BUG() in TDP MMU or mmu_spte_update() could be hit. Ah, that's very different than treating a prefetch fault as !spurious though. I would be a-ok with this: if (is_shadow_present_pte(iter->old_spte) && (fault->prefetch || is_access_allowed(fault, iter->old_spte)) && is_last_spte(iter->old_spte, iter->level)) { WARN_ON_ONCE(fault->pfn != spte_to_pfn(iter->old_spte)); return RET_PF_SPURIOUS; } ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] KVM: x86/mmu: Make sure pfn is not changed for spurious fault 2025-02-10 22:23 ` Sean Christopherson @ 2025-02-11 6:48 ` Yan Zhao 0 siblings, 0 replies; 16+ messages in thread From: Yan Zhao @ 2025-02-11 6:48 UTC (permalink / raw) To: Sean Christopherson; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm On Mon, Feb 10, 2025 at 02:23:38PM -0800, Sean Christopherson wrote: > On Sat, Feb 08, 2025, Yan Zhao wrote: > > On Fri, Feb 07, 2025 at 07:07:06AM -0800, Sean Christopherson wrote: > > > On Fri, Feb 07, 2025, Yan Zhao wrote: > > > > Make sure pfn is not changed for a spurious fault by warning in the TDP > > > > MMU. For shadow path, only treat a prefetch fault as spurious when pfn is > > > > not changed, since the rmap removal and add are required when pfn is > > > > changed. > > > > > > I like sanity checks, but I don't like special casing "prefetch" faults like this. > > > KVM should _never_ change the PFN of a shadow-present SPTE. The TDP MMU already > > > BUG()s on this, and mmu_spte_update() WARNs on the transition. > > However, both TDP MMU and mmu_set_spte() return RET_PF_SPURIOUS directly before > > the BUG() in TDP MMU or mmu_spte_update() could be hit. > > Ah, that's very different than treating a prefetch fault as !spurious though. I > would be a-ok with this: > > if (is_shadow_present_pte(iter->old_spte) && > (fault->prefetch || is_access_allowed(fault, iter->old_spte)) && > is_last_spte(iter->old_spte, iter->level)) { > WARN_ON_ONCE(fault->pfn != spte_to_pfn(iter->old_spte)); > return RET_PF_SPURIOUS; > } Thanks! Will also update the shadow MMU part as below. if (is_shadow_present_pte(*sptep)) { if (prefetch && is_last_spte(*sptep, level)) { WARN_ON_ONCE(pfn != spte_to_pfn(*sptep)); return RET_PF_SPURIOUS; } } ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 4/4] KVM: x86/mmu: Free obsolete roots when pre-faulting SPTEs 2025-02-07 3:06 [PATCH 0/4] Small changes related to prefetch and spurious faults Yan Zhao ` (2 preceding siblings ...) 2025-02-07 3:09 ` [PATCH 3/4] KVM: x86/mmu: Make sure pfn is not changed for spurious fault Yan Zhao @ 2025-02-07 3:09 ` Yan Zhao 2025-02-07 15:12 ` Sean Christopherson 3 siblings, 1 reply; 16+ messages in thread From: Yan Zhao @ 2025-02-07 3:09 UTC (permalink / raw) To: pbonzini, seanjc; +Cc: rick.p.edgecombe, linux-kernel, kvm, Yan Zhao Always free obsolete roots when pre-faulting SPTEs in case it's called after a root is invalidated (e.g., by memslot removal) but before any vcpu_enter_guest() processing of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. Lack of kvm_mmu_free_obsolete_roots() in this scenario can lead to kvm_mmu_reload() failing to load a new root if the current root hpa is an obsolete root (which is not INVALID_PAGE). Consequently, kvm_arch_vcpu_pre_fault_memory() will retry infinitely due to the checking of is_page_fault_stale(). It's safe to call kvm_mmu_free_obsolete_roots() even if there are no obsolete roots or if it's called a second time when vcpu_enter_guest() later processes KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. This is because kvm_mmu_free_obsolete_roots() sets an obsolete root to INVALID_PAGE and will do nothing to an INVALID_PAGE. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- arch/x86/kvm/mmu/mmu.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 47fd3712afe6..72f68458049a 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4740,7 +4740,12 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, /* * reload is efficient when called repeatedly, so we can do it on * every iteration. + * Before reload, free obsolete roots in case the prefault is called + * after a root is invalidated (e.g., by memslot removal) but + * before any vcpu_enter_guest() processing of + * KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. */ + kvm_mmu_free_obsolete_roots(vcpu); r = kvm_mmu_reload(vcpu); if (r) return r; -- 2.43.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 4/4] KVM: x86/mmu: Free obsolete roots when pre-faulting SPTEs 2025-02-07 3:09 ` [PATCH 4/4] KVM: x86/mmu: Free obsolete roots when pre-faulting SPTEs Yan Zhao @ 2025-02-07 15:12 ` Sean Christopherson 2025-02-08 3:01 ` Yan Zhao 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2025-02-07 15:12 UTC (permalink / raw) To: Yan Zhao; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm On Fri, Feb 07, 2025, Yan Zhao wrote: > Always free obsolete roots when pre-faulting SPTEs in case it's called > after a root is invalidated (e.g., by memslot removal) but before any > vcpu_enter_guest() processing of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. > > Lack of kvm_mmu_free_obsolete_roots() in this scenario can lead to > kvm_mmu_reload() failing to load a new root if the current root hpa is an > obsolete root (which is not INVALID_PAGE). Consequently, > kvm_arch_vcpu_pre_fault_memory() will retry infinitely due to the checking > of is_page_fault_stale(). > > It's safe to call kvm_mmu_free_obsolete_roots() even if there are no > obsolete roots or if it's called a second time when vcpu_enter_guest() > later processes KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. This is because > kvm_mmu_free_obsolete_roots() sets an obsolete root to INVALID_PAGE and > will do nothing to an INVALID_PAGE. Why is userspace changing memslots while prefaulting? > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > arch/x86/kvm/mmu/mmu.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 47fd3712afe6..72f68458049a 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4740,7 +4740,12 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > /* > * reload is efficient when called repeatedly, so we can do it on > * every iteration. > + * Before reload, free obsolete roots in case the prefault is called > + * after a root is invalidated (e.g., by memslot removal) but > + * before any vcpu_enter_guest() processing of > + * KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. > */ > + kvm_mmu_free_obsolete_roots(vcpu); > r = kvm_mmu_reload(vcpu); > if (r) > return r; I would prefer to do check for obsolete roots in kvm_mmu_reload() itself, but keep the main kvm_check_request() so that the common case handles the resulting TLB flush without having to loop back around in vcpu_enter_guest(). diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 050a0e229a4d..f2b36d32ef40 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -104,6 +104,9 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu) { + if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu)) + kvm_mmu_free_obsolete_roots(vcpu); + /* * Checking root.hpa is sufficient even when KVM has mirror root. * We can have either: ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 4/4] KVM: x86/mmu: Free obsolete roots when pre-faulting SPTEs 2025-02-07 15:12 ` Sean Christopherson @ 2025-02-08 3:01 ` Yan Zhao 2025-02-10 22:41 ` Sean Christopherson 0 siblings, 1 reply; 16+ messages in thread From: Yan Zhao @ 2025-02-08 3:01 UTC (permalink / raw) To: Sean Christopherson; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm On Fri, Feb 07, 2025 at 07:12:04AM -0800, Sean Christopherson wrote: > On Fri, Feb 07, 2025, Yan Zhao wrote: > > Always free obsolete roots when pre-faulting SPTEs in case it's called > > after a root is invalidated (e.g., by memslot removal) but before any > > vcpu_enter_guest() processing of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. > > > > Lack of kvm_mmu_free_obsolete_roots() in this scenario can lead to > > kvm_mmu_reload() failing to load a new root if the current root hpa is an > > obsolete root (which is not INVALID_PAGE). Consequently, > > kvm_arch_vcpu_pre_fault_memory() will retry infinitely due to the checking > > of is_page_fault_stale(). > > > > It's safe to call kvm_mmu_free_obsolete_roots() even if there are no > > obsolete roots or if it's called a second time when vcpu_enter_guest() > > later processes KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. This is because > > kvm_mmu_free_obsolete_roots() sets an obsolete root to INVALID_PAGE and > > will do nothing to an INVALID_PAGE. > > Why is userspace changing memslots while prefaulting? It currently only exists in the kvm selftest (written by myself...) Not sure if there's any real use case like this. > > > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > > --- > > arch/x86/kvm/mmu/mmu.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 47fd3712afe6..72f68458049a 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -4740,7 +4740,12 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > > /* > > * reload is efficient when called repeatedly, so we can do it on > > * every iteration. > > + * Before reload, free obsolete roots in case the prefault is called > > + * after a root is invalidated (e.g., by memslot removal) but > > + * before any vcpu_enter_guest() processing of > > + * KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. > > */ > > + kvm_mmu_free_obsolete_roots(vcpu); > > r = kvm_mmu_reload(vcpu); > > if (r) > > return r; > > I would prefer to do check for obsolete roots in kvm_mmu_reload() itself, but Yes, it's better! I previously considered doing in this way, but I was afraid to introduce overhead (the extra compare) to kvm_mmu_reload(), which is called quite frequently. But maybe we can remove the check in vcpu_enter_guest() to reduce the overhead? diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b2d9a16fd4d3..6a1f2780a094 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10731,8 +10731,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) goto out; } } - if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu)) - kvm_mmu_free_obsolete_roots(vcpu); if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu)) __kvm_migrate_timers(vcpu); if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu)) > keep the main kvm_check_request() so that the common case handles the resulting > TLB flush without having to loop back around in vcpu_enter_guest(). Hmm, I'm a little confused. What's is the resulting TLB flush? > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 050a0e229a4d..f2b36d32ef40 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -104,6 +104,9 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, > > static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu) > { > + if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu)) > + kvm_mmu_free_obsolete_roots(vcpu); > + > /* > * Checking root.hpa is sufficient even when KVM has mirror root. > * We can have either: > ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 4/4] KVM: x86/mmu: Free obsolete roots when pre-faulting SPTEs 2025-02-08 3:01 ` Yan Zhao @ 2025-02-10 22:41 ` Sean Christopherson 2025-02-11 5:38 ` Yan Zhao 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2025-02-10 22:41 UTC (permalink / raw) To: Yan Zhao; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm On Sat, Feb 08, 2025, Yan Zhao wrote: > On Fri, Feb 07, 2025 at 07:12:04AM -0800, Sean Christopherson wrote: > > On Fri, Feb 07, 2025, Yan Zhao wrote: > > > Always free obsolete roots when pre-faulting SPTEs in case it's called > > > after a root is invalidated (e.g., by memslot removal) but before any > > > vcpu_enter_guest() processing of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. > > > > > > Lack of kvm_mmu_free_obsolete_roots() in this scenario can lead to > > > kvm_mmu_reload() failing to load a new root if the current root hpa is an > > > obsolete root (which is not INVALID_PAGE). Consequently, > > > kvm_arch_vcpu_pre_fault_memory() will retry infinitely due to the checking > > > of is_page_fault_stale(). > > > > > > It's safe to call kvm_mmu_free_obsolete_roots() even if there are no > > > obsolete roots or if it's called a second time when vcpu_enter_guest() > > > later processes KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. This is because > > > kvm_mmu_free_obsolete_roots() sets an obsolete root to INVALID_PAGE and > > > will do nothing to an INVALID_PAGE. > > > > Why is userspace changing memslots while prefaulting? > It currently only exists in the kvm selftest (written by myself...) > Not sure if there's any real use case like this. It's decidedly odd. I asked, because maybe there's a way we can disallow the scenario. Doing that without making things more complex than simply handling obsolete roots is probably a fool's errand though. > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > > > --- > > > arch/x86/kvm/mmu/mmu.c | 5 +++++ > > > 1 file changed, 5 insertions(+) > > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > index 47fd3712afe6..72f68458049a 100644 > > > --- a/arch/x86/kvm/mmu/mmu.c > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > @@ -4740,7 +4740,12 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > > > /* > > > * reload is efficient when called repeatedly, so we can do it on > > > * every iteration. > > > + * Before reload, free obsolete roots in case the prefault is called > > > + * after a root is invalidated (e.g., by memslot removal) but > > > + * before any vcpu_enter_guest() processing of > > > + * KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. > > > */ > > > + kvm_mmu_free_obsolete_roots(vcpu); > > > r = kvm_mmu_reload(vcpu); > > > if (r) > > > return r; > > > > I would prefer to do check for obsolete roots in kvm_mmu_reload() itself, but > Yes, it's better! > I previously considered doing in this way, but I was afraid to introduce > overhead (the extra compare) to kvm_mmu_reload(), which is called quite > frequently. > > But maybe we can remove the check in vcpu_enter_guest() to reduce the overhead? > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index b2d9a16fd4d3..6a1f2780a094 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -10731,8 +10731,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > goto out; > } > } > - if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu)) > - kvm_mmu_free_obsolete_roots(vcpu); > if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu)) > __kvm_migrate_timers(vcpu); > if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu)) > > > keep the main kvm_check_request() so that the common case handles the resulting > > TLB flush without having to loop back around in vcpu_enter_guest(). > Hmm, I'm a little confused. > What's is the resulting TLB flush? For the common case where KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is pending before vcpu_enter_guest, kvm_mmu_free_obsolete_roots() may trigger KVM_REQ_TLB_FLUSH via kvm_mmu_commit_zap_page(). Processing KVM_REQ_MMU_FREE_OBSOLETE_ROOTS before KVM_REQ_TLB_FLUSH means vcpu_enter_guest() doesn't have to "abort" and redo the whole loop (the newly pending request won't be detected until kvm_vcpu_exit_request(), which isn't that late in the entry sequence, but there is a decent amount of work that needs to be undone). On the other hand, the cost of kvm_check_request(), especially a check that's guarded by kvm_request_pending(), is negligible. That said, obsolete roots shouldn't actually require a TLB flush. E.g. the TDP MMU hasn't flushed invalid roots since commit fcdffe97f80e ("KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots"). I'd have to think more about whether or not that's safe/correct for the shadow MMU though. For this case, I think it makes sense to just add the check in kvm_mmu_reload(). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 4/4] KVM: x86/mmu: Free obsolete roots when pre-faulting SPTEs 2025-02-10 22:41 ` Sean Christopherson @ 2025-02-11 5:38 ` Yan Zhao 0 siblings, 0 replies; 16+ messages in thread From: Yan Zhao @ 2025-02-11 5:38 UTC (permalink / raw) To: Sean Christopherson; +Cc: pbonzini, rick.p.edgecombe, linux-kernel, kvm On Mon, Feb 10, 2025 at 02:41:40PM -0800, Sean Christopherson wrote: > On Sat, Feb 08, 2025, Yan Zhao wrote: > > On Fri, Feb 07, 2025 at 07:12:04AM -0800, Sean Christopherson wrote: > > > On Fri, Feb 07, 2025, Yan Zhao wrote: > > > > Always free obsolete roots when pre-faulting SPTEs in case it's called > > > > after a root is invalidated (e.g., by memslot removal) but before any > > > > vcpu_enter_guest() processing of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. > > > > > > > > Lack of kvm_mmu_free_obsolete_roots() in this scenario can lead to > > > > kvm_mmu_reload() failing to load a new root if the current root hpa is an > > > > obsolete root (which is not INVALID_PAGE). Consequently, > > > > kvm_arch_vcpu_pre_fault_memory() will retry infinitely due to the checking > > > > of is_page_fault_stale(). > > > > > > > > It's safe to call kvm_mmu_free_obsolete_roots() even if there are no > > > > obsolete roots or if it's called a second time when vcpu_enter_guest() > > > > later processes KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. This is because > > > > kvm_mmu_free_obsolete_roots() sets an obsolete root to INVALID_PAGE and > > > > will do nothing to an INVALID_PAGE. > > > > > > Why is userspace changing memslots while prefaulting? > > It currently only exists in the kvm selftest (written by myself...) > > Not sure if there's any real use case like this. > > It's decidedly odd. I asked, because maybe there's a way we can disallow the > scenario. Doing that without making things more complex than simply handling > obsolete roots is probably a fool's errand though. Hmm, maybe it's not wise to do it like this. But it also looks not desired to disallow slot changes between two prefault ioctls. Rather than always prefaulting after memslots are finalized, not sure if any userspace implementation would follow a pattern like "add a memslot, perform a prefault". Then if a memslot is removed somehow between two "add and prefault", the second prefault ioctl would get stuck. > > > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > > > > --- > > > > arch/x86/kvm/mmu/mmu.c | 5 +++++ > > > > 1 file changed, 5 insertions(+) > > > > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > > index 47fd3712afe6..72f68458049a 100644 > > > > --- a/arch/x86/kvm/mmu/mmu.c > > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > > @@ -4740,7 +4740,12 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > > > > /* > > > > * reload is efficient when called repeatedly, so we can do it on > > > > * every iteration. > > > > + * Before reload, free obsolete roots in case the prefault is called > > > > + * after a root is invalidated (e.g., by memslot removal) but > > > > + * before any vcpu_enter_guest() processing of > > > > + * KVM_REQ_MMU_FREE_OBSOLETE_ROOTS. > > > > */ > > > > + kvm_mmu_free_obsolete_roots(vcpu); > > > > r = kvm_mmu_reload(vcpu); > > > > if (r) > > > > return r; > > > > > > I would prefer to do check for obsolete roots in kvm_mmu_reload() itself, but > > Yes, it's better! > > I previously considered doing in this way, but I was afraid to introduce > > overhead (the extra compare) to kvm_mmu_reload(), which is called quite > > frequently. > > > > But maybe we can remove the check in vcpu_enter_guest() to reduce the overhead? > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index b2d9a16fd4d3..6a1f2780a094 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -10731,8 +10731,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > > goto out; > > } > > } > > - if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu)) > > - kvm_mmu_free_obsolete_roots(vcpu); > > if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu)) > > __kvm_migrate_timers(vcpu); > > if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu)) > > > > > keep the main kvm_check_request() so that the common case handles the resulting > > > TLB flush without having to loop back around in vcpu_enter_guest(). > > Hmm, I'm a little confused. > > What's is the resulting TLB flush? > > For the common case where KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is pending before > vcpu_enter_guest, kvm_mmu_free_obsolete_roots() may trigger KVM_REQ_TLB_FLUSH > via kvm_mmu_commit_zap_page(). Processing KVM_REQ_MMU_FREE_OBSOLETE_ROOTS before > KVM_REQ_TLB_FLUSH means vcpu_enter_guest() doesn't have to "abort" and redo the > whole loop (the newly pending request won't be detected until kvm_vcpu_exit_request(), > which isn't that late in the entry sequence, but there is a decent amount of work > that needs to be undone). Thanks a lot for such detailed explanation! > On the other hand, the cost of kvm_check_request(), especially a check that's > guarded by kvm_request_pending(), is negligible. > > That said, obsolete roots shouldn't actually require a TLB flush. E.g. the TDP > MMU hasn't flushed invalid roots since commit fcdffe97f80e ("KVM: x86/mmu: Don't > do TLB flush when zappings SPTEs in invalid roots"). I'd have to think more about > whether or not that's safe/correct for the shadow MMU though. For shadow MMU, I think it's safe to skip the remote TLB flush in kvm_mmu_commit_zap_page() for an obsolete root, because kvm_mmu_zap_all_fast() is the only place to toggle mmu_valid_gen. kvm_mmu_zap_all_fast() makes all active sps obsolete after this toggle and kicks off all vCPUs after that. So, the kvm_x86_call(flush_tlb_current)() is able to carry the TLB flush in each vCPU if kvm_mmu_load() is invoked. > For this case, I think it makes sense to just add the check in kvm_mmu_reload(). Got it! ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-02-11 6:50 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-02-07 3:06 [PATCH 0/4] Small changes related to prefetch and spurious faults Yan Zhao 2025-02-07 3:07 ` [PATCH 1/4] KVM: x86/mmu: Further check old SPTE is leaf for spurious prefetch fault Yan Zhao 2025-02-07 3:08 ` [PATCH 2/4] KVM: x86/tdp_mmu: Merge the prefetch into the is_access_allowed() check Yan Zhao 2025-02-07 15:03 ` Sean Christopherson 2025-02-08 2:29 ` Yan Zhao 2025-02-10 22:17 ` Sean Christopherson 2025-02-07 3:09 ` [PATCH 3/4] KVM: x86/mmu: Make sure pfn is not changed for spurious fault Yan Zhao 2025-02-07 15:07 ` Sean Christopherson 2025-02-08 2:37 ` Yan Zhao 2025-02-10 22:23 ` Sean Christopherson 2025-02-11 6:48 ` Yan Zhao 2025-02-07 3:09 ` [PATCH 4/4] KVM: x86/mmu: Free obsolete roots when pre-faulting SPTEs Yan Zhao 2025-02-07 15:12 ` Sean Christopherson 2025-02-08 3:01 ` Yan Zhao 2025-02-10 22:41 ` Sean Christopherson 2025-02-11 5:38 ` Yan Zhao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox