* [RFC PATCH 0/1] Tweak TLB flushing when VMX is running on Hyper-V @ 2025-06-20 15:39 Jeremi Piotrowski 2025-06-20 15:39 ` [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes Jeremi Piotrowski 0 siblings, 1 reply; 12+ messages in thread From: Jeremi Piotrowski @ 2025-06-20 15:39 UTC (permalink / raw) To: Sean Christopherson, Vitaly Kuznetsov, Paolo Bonzini, kvm Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Jeremi Piotrowski Hi Sean/Parolo/Vitaly, Wanted to get your opinion on this change. Let me first introduce the scenario: We have been testing kata containers (containers wrapped in VM) in Azure, and found some significant issues with TLB flushing. This is a popular workload and requires launching many nested VMs quickly. When testing on a 64 core Intel VM (D64s_v5 in case someone is wondering), spinning up some 150-ish nested VMs in parallel, performance starts getting worse the more nested VMs are already running, CPU usage spikes to 100% on all cores and doesn't settle even when all nested VMs boot up. On an idle system a single nested VMs boots within seconds, but once we have a couple dozen running or so (doing nothing inside), boot time gets longer and longer for each new nested VM, they start hitting startup timeout etc. In some cases we never reach the point where all nested VMs are up and running. Investigating the issue we found that this can't be reproduced on AMD and on Intel when EPT is disabled. In both these cases the scenario completes within 20s or so. TPD_MMU or not doesn't make a difference. With EPT=Y the case takes minutes.Out of curiousity I also ran the test case on an n4-standard-64 VM on GCP and found that EPT=Y runs in ~30s, while EPT=N runs in ~20s (which I found slightly interesting). So that's when we starting looking at the TLB flushing code and found that INVEPT.global is used on every CPU migration and that it's an expensive function on Hyper-V. It also has an impact on every running nested VM, so we end up with lots of INVEPT.global calls - we reach 2000 calls/s before we're essentially stuck in 100% guest ttime. That's why I'm looking at tweaking the TLB flushing behavior to avoid it. I came across past discussions on this topic ([^1]) and after some thinking see two options: 1. Do you see a way to optimize this generically to avoid KVM_REQ_TLB_FLUSH on migration in current KVM? In nested (as in: KVM running nested) I think we rarely see CPU pinning used the way we it is on baremetal so it's not a rare of an operation. Much has also changed since [^1] and with kvm_mmu_reset_context() still being called in many paths we might be over flushing. Perhaps a loop flushing individual roots with roles that do not have a post_set_xxx hook that does flushing? 2. We can approach this in a Hyper-V specific way, using the dedicated flush hypercall, which is what the following RFC patch does. This hypercall acts as a broadcast INVEPT.single. I believe that using the flush hypercall in flush_tlb_current() is sufficient to ensure the right semantics and correctness. The one thing I haven't made up my mind about yet is whether we could still use a flush of the current root on migration or not - I can imagine at most an INVEPT.single, I also haven't yet figured out how that could be plumbed in if it's really necessary (can't put it in KVM_REQ_TLB_FLUSH because that would break the assumption that it is stronger than KVM_REQ_TLB_FLUSH_CURRENT). With 2. the performance is comparable to EPT=N on Intel, roughly 20s for the test scenario. Let me know what you think about this and if you have any suggestions. Best wishes, Jeremi [^1]: https://lore.kernel.org/kvm/YQljNBBp%2FEousNBk@google.com/ Jeremi Piotrowski (1): KVM: VMX: Use Hyper-V EPT flush for local TLB flushes arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/vmx.c | 20 +++++++++++++++++--- arch/x86/kvm/vmx/vmx_onhyperv.h | 6 ++++++ arch/x86/kvm/x86.c | 3 +++ 4 files changed, 27 insertions(+), 3 deletions(-) -- 2.39.5 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes 2025-06-20 15:39 [RFC PATCH 0/1] Tweak TLB flushing when VMX is running on Hyper-V Jeremi Piotrowski @ 2025-06-20 15:39 ` Jeremi Piotrowski 2025-06-27 8:31 ` Vitaly Kuznetsov 0 siblings, 1 reply; 12+ messages in thread From: Jeremi Piotrowski @ 2025-06-20 15:39 UTC (permalink / raw) To: Sean Christopherson, Vitaly Kuznetsov, Paolo Bonzini, kvm Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Jeremi Piotrowski Use Hyper-V's HvCallFlushGuestPhysicalAddressSpace for local TLB flushes. This makes any KVM_REQ_TLB_FLUSH_CURRENT (such as on root alloc) visible to all CPUs which means we no longer need to do a KVM_REQ_TLB_FLUSH on CPU migration. The goal is to avoid invept-global in KVM_REQ_TLB_FLUSH. Hyper-V uses a shadow page table for the nested hypervisor (KVM) and has to invalidate all EPT roots when invept-global is issued. This has a performance impact on all nested VMs. KVM issues KVM_REQ_TLB_FLUSH on CPU migration, and under load the performance hit causes vCPUs to use up more of their slice of CPU time, leading to more CPU migrations. This has a snowball effect and causes CPU usage spikes. By issuing the hypercall we are now guaranteed that any root modification that requires a local TLB flush becomes visible to all CPUs. The same hypercall is already used in kvm_arch_flush_remote_tlbs and kvm_arch_flush_remote_tlbs_range. The KVM expectation is that roots are flushed locally on alloc and we achieve consistency on migration by flushing all roots - the new behavior of achieving consistency on alloc on Hyper-V is a superset of the expected guarantees. This makes the KVM_REQ_TLB_FLUSH on CPU migration no longer necessary on Hyper-V. Coincidentally - we now match the behavior of SVM on Hyper-V. Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/vmx.c | 20 +++++++++++++++++--- arch/x86/kvm/vmx/vmx_onhyperv.h | 6 ++++++ arch/x86/kvm/x86.c | 3 +++ 4 files changed, 27 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b4a391929cdb..d3acab19f425 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1077,6 +1077,7 @@ struct kvm_vcpu_arch { #if IS_ENABLED(CONFIG_HYPERV) hpa_t hv_root_tdp; + bool hv_vmx_use_flush_guest_mapping; #endif }; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 4953846cb30d..f537e0df56fc 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1485,8 +1485,12 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) /* * Flush all EPTP/VPID contexts, the new pCPU may have stale * TLB entries from its previous association with the vCPU. + * Unless we are running on Hyper-V where we promotes local TLB + * flushes to be visible across all CPUs so no need to do again + * on migration. */ - kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); + if (!vmx_hv_use_flush_guest_mapping(vcpu)) + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); /* * Linux uses per-cpu TSS and GDT, so set these when switching @@ -3243,11 +3247,21 @@ void vmx_flush_tlb_current(struct kvm_vcpu *vcpu) if (!VALID_PAGE(root_hpa)) return; - if (enable_ept) + if (enable_ept) { + /* + * hyperv_flush_guest_mapping() has the semantics of + * invept-single across all pCPUs. This makes root + * modifications consistent across pCPUs, so an invept-global + * on migration is no longer required. + */ + if (vmx_hv_use_flush_guest_mapping(vcpu)) + return (void)WARN_ON_ONCE(hyperv_flush_guest_mapping(root_hpa)); + ept_sync_context(construct_eptp(vcpu, root_hpa, mmu->root_role.level)); - else + } else { vpid_sync_context(vmx_get_current_vpid(vcpu)); + } } void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr) diff --git a/arch/x86/kvm/vmx/vmx_onhyperv.h b/arch/x86/kvm/vmx/vmx_onhyperv.h index cdf8cbb69209..a5c64c90e49e 100644 --- a/arch/x86/kvm/vmx/vmx_onhyperv.h +++ b/arch/x86/kvm/vmx/vmx_onhyperv.h @@ -119,6 +119,11 @@ static inline void evmcs_load(u64 phys_addr) } void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf); + +static inline bool vmx_hv_use_flush_guest_mapping(struct kvm_vcpu *vcpu) +{ + return vcpu->arch.hv_vmx_use_flush_guest_mapping; +} #else /* !IS_ENABLED(CONFIG_HYPERV) */ static __always_inline bool kvm_is_using_evmcs(void) { return false; } static __always_inline void evmcs_write64(unsigned long field, u64 value) {} @@ -128,6 +133,7 @@ static __always_inline u64 evmcs_read64(unsigned long field) { return 0; } static __always_inline u32 evmcs_read32(unsigned long field) { return 0; } static __always_inline u16 evmcs_read16(unsigned long field) { return 0; } static inline void evmcs_load(u64 phys_addr) {} +static inline bool vmx_hv_use_flush_guest_mapping(struct kvm_vcpu *vcpu) { return false; } #endif /* IS_ENABLED(CONFIG_HYPERV) */ #endif /* __ARCH_X86_KVM_VMX_ONHYPERV_H__ */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b58a74c1722d..cbde795096a6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -25,6 +25,7 @@ #include "tss.h" #include "kvm_cache_regs.h" #include "kvm_emulate.h" +#include "kvm_onhyperv.h" #include "mmu/page_track.h" #include "x86.h" #include "cpuid.h" @@ -12390,6 +12391,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) #if IS_ENABLED(CONFIG_HYPERV) vcpu->arch.hv_root_tdp = INVALID_PAGE; + vcpu->arch.hv_vmx_use_flush_guest_mapping = + (kvm_x86_ops.flush_remote_tlbs == hv_flush_remote_tlbs); #endif r = kvm_x86_call(vcpu_create)(vcpu); -- 2.39.5 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes 2025-06-20 15:39 ` [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes Jeremi Piotrowski @ 2025-06-27 8:31 ` Vitaly Kuznetsov 2025-07-02 16:11 ` Jeremi Piotrowski 0 siblings, 1 reply; 12+ messages in thread From: Vitaly Kuznetsov @ 2025-06-27 8:31 UTC (permalink / raw) To: Jeremi Piotrowski, Sean Christopherson, Paolo Bonzini, kvm Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Jeremi Piotrowski Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: > Use Hyper-V's HvCallFlushGuestPhysicalAddressSpace for local TLB flushes. > This makes any KVM_REQ_TLB_FLUSH_CURRENT (such as on root alloc) visible to > all CPUs which means we no longer need to do a KVM_REQ_TLB_FLUSH on CPU > migration. > > The goal is to avoid invept-global in KVM_REQ_TLB_FLUSH. Hyper-V uses a > shadow page table for the nested hypervisor (KVM) and has to invalidate all > EPT roots when invept-global is issued. This has a performance impact on > all nested VMs. KVM issues KVM_REQ_TLB_FLUSH on CPU migration, and under > load the performance hit causes vCPUs to use up more of their slice of CPU > time, leading to more CPU migrations. This has a snowball effect and causes > CPU usage spikes. > > By issuing the hypercall we are now guaranteed that any root modification > that requires a local TLB flush becomes visible to all CPUs. The same > hypercall is already used in kvm_arch_flush_remote_tlbs and > kvm_arch_flush_remote_tlbs_range. The KVM expectation is that roots are > flushed locally on alloc and we achieve consistency on migration by > flushing all roots - the new behavior of achieving consistency on alloc on > Hyper-V is a superset of the expected guarantees. This makes the > KVM_REQ_TLB_FLUSH on CPU migration no longer necessary on Hyper-V. Sounds reasonable overall, my only concern (not sure if valid or not) is that using the hypercall for local flushes is going to be more expensive than invept-context we do today and thus while the performance is improved for the scenario when vCPUs are migrating a lot, we will take a hit in other cases. > > Coincidentally - we now match the behavior of SVM on Hyper-V. > > Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> > --- > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/vmx/vmx.c | 20 +++++++++++++++++--- > arch/x86/kvm/vmx/vmx_onhyperv.h | 6 ++++++ > arch/x86/kvm/x86.c | 3 +++ > 4 files changed, 27 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index b4a391929cdb..d3acab19f425 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1077,6 +1077,7 @@ struct kvm_vcpu_arch { > > #if IS_ENABLED(CONFIG_HYPERV) > hpa_t hv_root_tdp; > + bool hv_vmx_use_flush_guest_mapping; > #endif > }; > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 4953846cb30d..f537e0df56fc 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -1485,8 +1485,12 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) > /* > * Flush all EPTP/VPID contexts, the new pCPU may have stale > * TLB entries from its previous association with the vCPU. > + * Unless we are running on Hyper-V where we promotes local TLB s,promotes,promote, or, as Sean doesn't like pronouns, "... where local TLB flushes are promoted ..." > + * flushes to be visible across all CPUs so no need to do again > + * on migration. > */ > - kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); > + if (!vmx_hv_use_flush_guest_mapping(vcpu)) > + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); > > /* > * Linux uses per-cpu TSS and GDT, so set these when switching > @@ -3243,11 +3247,21 @@ void vmx_flush_tlb_current(struct kvm_vcpu *vcpu) > if (!VALID_PAGE(root_hpa)) > return; > > - if (enable_ept) > + if (enable_ept) { > + /* > + * hyperv_flush_guest_mapping() has the semantics of > + * invept-single across all pCPUs. This makes root > + * modifications consistent across pCPUs, so an invept-global > + * on migration is no longer required. > + */ > + if (vmx_hv_use_flush_guest_mapping(vcpu)) > + return (void)WARN_ON_ONCE(hyperv_flush_guest_mapping(root_hpa)); > + HvCallFlushGuestPhysicalAddressSpace sounds like a heavy operation as it affects all processors. Is there any visible perfomance impact of this change when there are no migrations (e.g. with vCPU pinning)? Or do we believe that Hyper-V actually handles invept-context the exact same way? > ept_sync_context(construct_eptp(vcpu, root_hpa, > mmu->root_role.level)); > - else > + } else { > vpid_sync_context(vmx_get_current_vpid(vcpu)); > + } > } > > void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr) > diff --git a/arch/x86/kvm/vmx/vmx_onhyperv.h b/arch/x86/kvm/vmx/vmx_onhyperv.h > index cdf8cbb69209..a5c64c90e49e 100644 > --- a/arch/x86/kvm/vmx/vmx_onhyperv.h > +++ b/arch/x86/kvm/vmx/vmx_onhyperv.h > @@ -119,6 +119,11 @@ static inline void evmcs_load(u64 phys_addr) > } > > void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf); > + > +static inline bool vmx_hv_use_flush_guest_mapping(struct kvm_vcpu *vcpu) > +{ > + return vcpu->arch.hv_vmx_use_flush_guest_mapping; > +} > #else /* !IS_ENABLED(CONFIG_HYPERV) */ > static __always_inline bool kvm_is_using_evmcs(void) { return false; } > static __always_inline void evmcs_write64(unsigned long field, u64 value) {} > @@ -128,6 +133,7 @@ static __always_inline u64 evmcs_read64(unsigned long field) { return 0; } > static __always_inline u32 evmcs_read32(unsigned long field) { return 0; } > static __always_inline u16 evmcs_read16(unsigned long field) { return 0; } > static inline void evmcs_load(u64 phys_addr) {} > +static inline bool vmx_hv_use_flush_guest_mapping(struct kvm_vcpu *vcpu) { return false; } > #endif /* IS_ENABLED(CONFIG_HYPERV) */ > > #endif /* __ARCH_X86_KVM_VMX_ONHYPERV_H__ */ > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index b58a74c1722d..cbde795096a6 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -25,6 +25,7 @@ > #include "tss.h" > #include "kvm_cache_regs.h" > #include "kvm_emulate.h" > +#include "kvm_onhyperv.h" > #include "mmu/page_track.h" > #include "x86.h" > #include "cpuid.h" > @@ -12390,6 +12391,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > > #if IS_ENABLED(CONFIG_HYPERV) > vcpu->arch.hv_root_tdp = INVALID_PAGE; > + vcpu->arch.hv_vmx_use_flush_guest_mapping = > + (kvm_x86_ops.flush_remote_tlbs == hv_flush_remote_tlbs); > #endif > > r = kvm_x86_call(vcpu_create)(vcpu); -- Vitaly ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes 2025-06-27 8:31 ` Vitaly Kuznetsov @ 2025-07-02 16:11 ` Jeremi Piotrowski 2025-07-09 15:46 ` Vitaly Kuznetsov 0 siblings, 1 reply; 12+ messages in thread From: Jeremi Piotrowski @ 2025-07-02 16:11 UTC (permalink / raw) To: Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini, kvm Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv On 27/06/2025 10:31, Vitaly Kuznetsov wrote: > Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: > >> Use Hyper-V's HvCallFlushGuestPhysicalAddressSpace for local TLB flushes. >> This makes any KVM_REQ_TLB_FLUSH_CURRENT (such as on root alloc) visible to >> all CPUs which means we no longer need to do a KVM_REQ_TLB_FLUSH on CPU >> migration. >> >> The goal is to avoid invept-global in KVM_REQ_TLB_FLUSH. Hyper-V uses a >> shadow page table for the nested hypervisor (KVM) and has to invalidate all >> EPT roots when invept-global is issued. This has a performance impact on >> all nested VMs. KVM issues KVM_REQ_TLB_FLUSH on CPU migration, and under >> load the performance hit causes vCPUs to use up more of their slice of CPU >> time, leading to more CPU migrations. This has a snowball effect and causes >> CPU usage spikes. >> >> By issuing the hypercall we are now guaranteed that any root modification >> that requires a local TLB flush becomes visible to all CPUs. The same >> hypercall is already used in kvm_arch_flush_remote_tlbs and >> kvm_arch_flush_remote_tlbs_range. The KVM expectation is that roots are >> flushed locally on alloc and we achieve consistency on migration by >> flushing all roots - the new behavior of achieving consistency on alloc on >> Hyper-V is a superset of the expected guarantees. This makes the >> KVM_REQ_TLB_FLUSH on CPU migration no longer necessary on Hyper-V. > > Sounds reasonable overall, my only concern (not sure if valid or not) is > that using the hypercall for local flushes is going to be more expensive > than invept-context we do today and thus while the performance is > improved for the scenario when vCPUs are migrating a lot, we will take a > hit in other cases. > Discussion below, I think the impact should be limited and will try to quantify it. >> >> Coincidentally - we now match the behavior of SVM on Hyper-V. >> >> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> >> --- >> arch/x86/include/asm/kvm_host.h | 1 + >> arch/x86/kvm/vmx/vmx.c | 20 +++++++++++++++++--- >> arch/x86/kvm/vmx/vmx_onhyperv.h | 6 ++++++ >> arch/x86/kvm/x86.c | 3 +++ >> 4 files changed, 27 insertions(+), 3 deletions(-) >> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >> index b4a391929cdb..d3acab19f425 100644 >> --- a/arch/x86/include/asm/kvm_host.h >> +++ b/arch/x86/include/asm/kvm_host.h >> @@ -1077,6 +1077,7 @@ struct kvm_vcpu_arch { >> >> #if IS_ENABLED(CONFIG_HYPERV) >> hpa_t hv_root_tdp; >> + bool hv_vmx_use_flush_guest_mapping; >> #endif >> }; >> >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c >> index 4953846cb30d..f537e0df56fc 100644 >> --- a/arch/x86/kvm/vmx/vmx.c >> +++ b/arch/x86/kvm/vmx/vmx.c >> @@ -1485,8 +1485,12 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) >> /* >> * Flush all EPTP/VPID contexts, the new pCPU may have stale >> * TLB entries from its previous association with the vCPU. >> + * Unless we are running on Hyper-V where we promotes local TLB > > s,promotes,promote, or, as Sean doesn't like pronouns, > > "... where local TLB flushes are promoted ..." > Will do. >> + * flushes to be visible across all CPUs so no need to do again >> + * on migration. >> */ >> - kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); >> + if (!vmx_hv_use_flush_guest_mapping(vcpu)) >> + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); >> >> /* >> * Linux uses per-cpu TSS and GDT, so set these when switching >> @@ -3243,11 +3247,21 @@ void vmx_flush_tlb_current(struct kvm_vcpu *vcpu) >> if (!VALID_PAGE(root_hpa)) >> return; >> >> - if (enable_ept) >> + if (enable_ept) { >> + /* >> + * hyperv_flush_guest_mapping() has the semantics of >> + * invept-single across all pCPUs. This makes root >> + * modifications consistent across pCPUs, so an invept-global >> + * on migration is no longer required. >> + */ >> + if (vmx_hv_use_flush_guest_mapping(vcpu)) >> + return (void)WARN_ON_ONCE(hyperv_flush_guest_mapping(root_hpa)); >> + > > HvCallFlushGuestPhysicalAddressSpace sounds like a heavy operation as it > affects all processors. Is there any visible perfomance impact of this > change when there are no migrations (e.g. with vCPU pinning)? Or do we > believe that Hyper-V actually handles invept-context the exact same way? > I'm going to have to do some more investigation to answer that - do you have an idea of a workload that would be sensitive to tlb flushes that I could compare this on? In terms of cost, Hyper-V needs to invalidate the VMs shadow page table for a root and do the tlb flush. The first part is CPU intensive but is the same in both cases (hypercall and invept-single). The tlb flush part will require a bit more work for the hypercall as it needs to happen on all cores, and the tlb will now be empty for that root. My assumption is that these local tlb flushes are rather rare as they will only happen when: - new root is allocated - we need to switch to a special root So not very frequent post vm boot (with or without pinning). And the effect of the tlb being empty for that root on other CPUs should be a neutral, as users of the root would have performed the same local flush at a later point in time (when using it). All the other mmu updates use kvm_flush_remote_tlbs* which already go through the hypercall. Jeremi ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes 2025-07-02 16:11 ` Jeremi Piotrowski @ 2025-07-09 15:46 ` Vitaly Kuznetsov 2025-07-15 0:39 ` Sean Christopherson 0 siblings, 1 reply; 12+ messages in thread From: Vitaly Kuznetsov @ 2025-07-09 15:46 UTC (permalink / raw) To: Jeremi Piotrowski Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Sean Christopherson, Paolo Bonzini, kvm Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: > On 27/06/2025 10:31, Vitaly Kuznetsov wrote: >> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: >> >>> Use Hyper-V's HvCallFlushGuestPhysicalAddressSpace for local TLB flushes. >>> This makes any KVM_REQ_TLB_FLUSH_CURRENT (such as on root alloc) visible to >>> all CPUs which means we no longer need to do a KVM_REQ_TLB_FLUSH on CPU >>> migration. >>> >>> The goal is to avoid invept-global in KVM_REQ_TLB_FLUSH. Hyper-V uses a >>> shadow page table for the nested hypervisor (KVM) and has to invalidate all >>> EPT roots when invept-global is issued. This has a performance impact on >>> all nested VMs. KVM issues KVM_REQ_TLB_FLUSH on CPU migration, and under >>> load the performance hit causes vCPUs to use up more of their slice of CPU >>> time, leading to more CPU migrations. This has a snowball effect and causes >>> CPU usage spikes. >>> >>> By issuing the hypercall we are now guaranteed that any root modification >>> that requires a local TLB flush becomes visible to all CPUs. The same >>> hypercall is already used in kvm_arch_flush_remote_tlbs and >>> kvm_arch_flush_remote_tlbs_range. The KVM expectation is that roots are >>> flushed locally on alloc and we achieve consistency on migration by >>> flushing all roots - the new behavior of achieving consistency on alloc on >>> Hyper-V is a superset of the expected guarantees. This makes the >>> KVM_REQ_TLB_FLUSH on CPU migration no longer necessary on Hyper-V. >> >> Sounds reasonable overall, my only concern (not sure if valid or not) is >> that using the hypercall for local flushes is going to be more expensive >> than invept-context we do today and thus while the performance is >> improved for the scenario when vCPUs are migrating a lot, we will take a >> hit in other cases. >> > Sorry for delayed reply! .... >>> return; >>> >>> - if (enable_ept) >>> + if (enable_ept) { >>> + /* >>> + * hyperv_flush_guest_mapping() has the semantics of >>> + * invept-single across all pCPUs. This makes root >>> + * modifications consistent across pCPUs, so an invept-global >>> + * on migration is no longer required. >>> + */ >>> + if (vmx_hv_use_flush_guest_mapping(vcpu)) >>> + return (void)WARN_ON_ONCE(hyperv_flush_guest_mapping(root_hpa)); >>> + >> >> HvCallFlushGuestPhysicalAddressSpace sounds like a heavy operation as it >> affects all processors. Is there any visible perfomance impact of this >> change when there are no migrations (e.g. with vCPU pinning)? Or do we >> believe that Hyper-V actually handles invept-context the exact same way? >> > I'm going to have to do some more investigation to answer that - do you have an > idea of a workload that would be sensitive to tlb flushes that I could compare > this on? > > In terms of cost, Hyper-V needs to invalidate the VMs shadow page table for a root > and do the tlb flush. The first part is CPU intensive but is the same in both cases > (hypercall and invept-single). The tlb flush part will require a bit more work for > the hypercall as it needs to happen on all cores, and the tlb will now be empty > for that root. > > My assumption is that these local tlb flushes are rather rare as they will > only happen when: > - new root is allocated > - we need to switch to a special root > KVM's MMU is an amazing maze so I'd appreciate if someone more knowledgeble corrects me;t my understanding is that we call *_flush_tlb_current() from two places: kvm_mmu_load() and this covers the two cases above. These should not be common under normal circumstances but can be frequent in some special cases, e.g. when running a nested setup. Given that we're already running on top of Hyper-V, this means 3+ level nesting which I don't believe anyone really cares about. kvm_vcpu_flush_tlb_current() from KVM_REQ_TLB_FLUSH_CURRENT. These are things like some CR4 writes, APIC mode changes, ... which also shouldn't be that common but VM boot time can be affected. So I'd suggest to test big VM startup time, i.e. take the biggest available instance type on Azure and measure how much time it takes to boot a VM which has the same vCPU count. Honestly, I don't expect to see a significant change but I guess it's still worth checking. > So not very frequent post vm boot (with or without pinning). And the effect of the > tlb being empty for that root on other CPUs should be a neutral, as users of the > root would have performed the same local flush at a later point in > time (when using it). > > All the other mmu updates use kvm_flush_remote_tlbs* which already go > through the hypercall. -- Vitaly ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes 2025-07-09 15:46 ` Vitaly Kuznetsov @ 2025-07-15 0:39 ` Sean Christopherson 2025-08-04 15:49 ` Vitaly Kuznetsov 0 siblings, 1 reply; 12+ messages in thread From: Sean Christopherson @ 2025-07-15 0:39 UTC (permalink / raw) To: Vitaly Kuznetsov Cc: Jeremi Piotrowski, Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini, kvm On Wed, Jul 09, 2025, Vitaly Kuznetsov wrote: > Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: > > > On 27/06/2025 10:31, Vitaly Kuznetsov wrote: > >> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: > >> > >>> Use Hyper-V's HvCallFlushGuestPhysicalAddressSpace for local TLB flushes. > >>> This makes any KVM_REQ_TLB_FLUSH_CURRENT (such as on root alloc) visible to > >>> all CPUs which means we no longer need to do a KVM_REQ_TLB_FLUSH on CPU > >>> migration. > >>> > >>> The goal is to avoid invept-global in KVM_REQ_TLB_FLUSH. Hyper-V uses a > >>> shadow page table for the nested hypervisor (KVM) and has to invalidate all > >>> EPT roots when invept-global is issued. What all does "invalidate" mean here? E.g. is this "just" a hardware TLB flush, or is Hyper-V blasting away and rebuilding page tables? If it's the latter, that seems like a Hyper-V bug/problem. > >>> This has a performance impact on all nested VMs. KVM issues > >>> KVM_REQ_TLB_FLUSH on CPU migration, and under load the performance hit > >>> causes vCPUs to use up more of their slice of CPU time, leading to more > >>> CPU migrations. This has a snowball effect and causes CPU usage spikes. Is the performance hit due to flushing *hardware* TLBs, or due to Hyper-V needing to rebuilding shadow page tables? > >>> By issuing the hypercall we are now guaranteed that any root modification > >>> that requires a local TLB flush becomes visible to all CPUs. The same > >>> hypercall is already used in kvm_arch_flush_remote_tlbs and > >>> kvm_arch_flush_remote_tlbs_range. The KVM expectation is that roots are > >>> flushed locally on alloc and we achieve consistency on migration by > >>> flushing all roots - the new behavior of achieving consistency on alloc on > >>> Hyper-V is a superset of the expected guarantees. This makes the > >>> KVM_REQ_TLB_FLUSH on CPU migration no longer necessary on Hyper-V. > >> > >> Sounds reasonable overall, my only concern (not sure if valid or not) is > >> that using the hypercall for local flushes is going to be more expensive > >> than invept-context we do today and thus while the performance is > >> improved for the scenario when vCPUs are migrating a lot, we will take a > >> hit in other cases. > >> > > > > Sorry for delayed reply! > > .... > > >>> return; > >>> > >>> - if (enable_ept) > >>> + if (enable_ept) { > >>> + /* > >>> + * hyperv_flush_guest_mapping() has the semantics of > >>> + * invept-single across all pCPUs. This makes root > >>> + * modifications consistent across pCPUs, so an invept-global > >>> + * on migration is no longer required. Unfortunately, this isn't quite right. If vCPU0 and vCPU1 share an EPT root, APIC virtualization is enabled, and vCPU0 is running with x2APIC but vCPU1 is running with xAPIC, then KVM needs to flush TLBs if vCPU1 is loaded on a "new" pCPU, because vCPU0 could have inserted non-vAPIC TLB entries for that pCPU. Hrm, but KVM doesn't actually handle that properly. KVM only forces a TLB flush if the vCPU wasn't already loaded, so if vCPU0 and vCPU1 are running on the same pCPU, i.e. vCPU1 isn't being migrated to the pCPU that was previously running vCPU0, then I believe vCPU1 could consume stale TLB entries. Setting that aside for the moment, I would much prefer to elide this TLB flush whenver possible, irrespective of whether KVM is running on bare metal or in a VM, and irrespective of the host hypervisor. And then if/when SVM is converted to use per-vCPU ASIDs[*], give SVM the exact same treatment. More below. [*] https://lore.kernel.org/all/aFXrFKvZcJ3dN4k_@google.com > >> HvCallFlushGuestPhysicalAddressSpace sounds like a heavy operation as it > >> affects all processors. Is there any visible perfomance impact of this > >> change when there are no migrations (e.g. with vCPU pinning)? Or do we > >> believe that Hyper-V actually handles invept-context the exact same way? > >> > > I'm going to have to do some more investigation to answer that - do you have an > > idea of a workload that would be sensitive to tlb flushes that I could compare > > this on? > > > > In terms of cost, Hyper-V needs to invalidate the VMs shadow page table for a root > > and do the tlb flush. The first part is CPU intensive but is the same in both cases > > (hypercall and invept-single). The tlb flush part will require a bit more work for > > the hypercall as it needs to happen on all cores, and the tlb will now be empty > > for that root. > > > > My assumption is that these local tlb flushes are rather rare as they will > > only happen when: > > - new root is allocated > > - we need to switch to a special root > > > > KVM's MMU is an amazing maze so I'd appreciate if someone more > knowledgeble corrects me;t my understanding is that we call > *_flush_tlb_current() from two places: > > kvm_mmu_load() and this covers the two cases above. These should not be > common under normal circumstances but can be frequent in some special > cases, e.g. when running a nested setup. Given that we're already > running on top of Hyper-V, this means 3+ level nesting which I don't > believe anyone really cares about. Heh, don't be too sure about that. People just love running "containers" inside VMs, without thinking too hard about what they're doing :-) In general, I don't like effectively turning KVM_REQ_TLB_FLUSH_CURRENT into kvm_flush_remote_tlbs(), and I *really* don't like doing so for one specific setup. It's hard enough to capture the differences between KVM's various TLB flushes hooks/requests, and special casing KVM-on-Hyper-V is just asking for unique bugs. Conceptually, I _think_ this is pretty straightforward: when a root is allocated, flush the root on all *pCPUs*. KVM currently flushes the root when a vCPU first uses a root, which necessitates flushing on migration. Alternatively, KVM could track which pCPUs a vCPU has run on, but that would get expensive, and blasting a flush on alloc should be much simpler. The two wrinkles I can think of are the x2APIC vs. xAPIC problem above (which I think needs to be handled no matter what), and CPU hotplug (which is easy enough to handle, I just didn't type it up). It'll take more work than the below, e.g. to have VMX's construct_eptp() pull the level and A/D bits from kvm_mmu_page (vendor code can get at the kvm_mmu_page with root_to_sp()), but for the core concept/skeleton, I think this is it? diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6e838cb6c9e1..298130445182 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3839,6 +3839,37 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu) } EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots); +struct kvm_tlb_flush_root { + struct kvm *kvm; + hpa_t root; +}; + +static void kvm_flush_tlb_root(void *__data) +{ + struct kvm_tlb_flush_root *data = __data; + + kvm_x86_call(flush_tlb_root)(data->kvm, data->root); +} + +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root) +{ + struct kvm_tlb_flush_root data = { + .kvm = kvm, + .root = __pa(root->spt), + }; + + /* + * Flush any TLB entries for the new root, the provenance of the root + * is unknown. Even if KVM ensures there are no stale TLB entries + * for a freed root, in theory another hypervisor could have left + * stale entries. Flushing on alloc also allows KVM to skip the TLB + * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing + * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is + * migrated to a different pCPU. + */ + on_each_cpu(kvm_flush_tlb_root, &data, 1); +} + static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level) { @@ -3852,7 +3883,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, WARN_ON_ONCE(role.direct && role.has_4_byte_gpte); sp = kvm_mmu_get_shadow_page(vcpu, gfn, role); - ++sp->root_count; + if (!sp->root_count++) + kvm_mmu_flush_all_tlbs_root(vcpu->kvm, sp); return __pa(sp->spt); } @@ -5961,15 +5993,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) kvm_mmu_sync_roots(vcpu); kvm_mmu_load_pgd(vcpu); - - /* - * Flush any TLB entries for the new root, the provenance of the root - * is unknown. Even if KVM ensures there are no stale TLB entries - * for a freed root, in theory another hypervisor could have left - * stale entries. Flushing on alloc also allows KVM to skip the TLB - * flush when freeing a root (see kvm_tdp_mmu_put_root()). - */ - kvm_x86_call(flush_tlb_current)(vcpu); out: return r; } diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 65f3c89d7c5d..3cbf0d612f5e 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -167,6 +167,8 @@ static inline bool is_mirror_sp(const struct kvm_mmu_page *sp) return sp->role.is_mirror; } +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root); + static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) { /* diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 7f3d7229b2c1..3ff36d09b4fa 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -302,6 +302,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror) */ refcount_set(&root->tdp_mmu_root_count, 2); list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots); + kvm_mmu_flush_all_tlbs_root(vcpu->kvm, root); out_spin_unlock: spin_unlock(&kvm->arch.tdp_mmu_pages_lock); ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes 2025-07-15 0:39 ` Sean Christopherson @ 2025-08-04 15:49 ` Vitaly Kuznetsov 2025-08-04 23:09 ` Sean Christopherson 0 siblings, 1 reply; 12+ messages in thread From: Vitaly Kuznetsov @ 2025-08-04 15:49 UTC (permalink / raw) To: Sean Christopherson Cc: Jeremi Piotrowski, Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini, kvm Sean Christopherson <seanjc@google.com> writes: ... > > It'll take more work than the below, e.g. to have VMX's construct_eptp() pull the > level and A/D bits from kvm_mmu_page (vendor code can get at the kvm_mmu_page with > root_to_sp()), but for the core concept/skeleton, I think this is it? > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6e838cb6c9e1..298130445182 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3839,6 +3839,37 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu) > } > EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots); > > +struct kvm_tlb_flush_root { > + struct kvm *kvm; > + hpa_t root; > +}; > + > +static void kvm_flush_tlb_root(void *__data) > +{ > + struct kvm_tlb_flush_root *data = __data; > + > + kvm_x86_call(flush_tlb_root)(data->kvm, data->root); > +} > + > +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root) > +{ > + struct kvm_tlb_flush_root data = { > + .kvm = kvm, > + .root = __pa(root->spt), > + }; > + > + /* > + * Flush any TLB entries for the new root, the provenance of the root > + * is unknown. Even if KVM ensures there are no stale TLB entries > + * for a freed root, in theory another hypervisor could have left > + * stale entries. Flushing on alloc also allows KVM to skip the TLB > + * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing > + * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is > + * migrated to a different pCPU. > + */ > + on_each_cpu(kvm_flush_tlb_root, &data, 1); Would it make sense to complement this with e.g. a CPU mask tracking all the pCPUs where the VM has ever been seen running (+ a flush when a new one is added to it)? I'm worried about the potential performance impact for a case when a huge host is running a lot of small VMs in 'partitioning' mode (i.e. when all vCPUs are pinned). Additionally, this may have a negative impact on RT use-cases where each unnecessary interruption can be seen problematic. > +} > + > static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, > u8 level) > { > @@ -3852,7 +3883,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, > WARN_ON_ONCE(role.direct && role.has_4_byte_gpte); > > sp = kvm_mmu_get_shadow_page(vcpu, gfn, role); > - ++sp->root_count; > + if (!sp->root_count++) > + kvm_mmu_flush_all_tlbs_root(vcpu->kvm, sp); > > return __pa(sp->spt); > } > @@ -5961,15 +5993,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) > kvm_mmu_sync_roots(vcpu); > > kvm_mmu_load_pgd(vcpu); > - > - /* > - * Flush any TLB entries for the new root, the provenance of the root > - * is unknown. Even if KVM ensures there are no stale TLB entries > - * for a freed root, in theory another hypervisor could have left > - * stale entries. Flushing on alloc also allows KVM to skip the TLB > - * flush when freeing a root (see kvm_tdp_mmu_put_root()). > - */ > - kvm_x86_call(flush_tlb_current)(vcpu); > out: > return r; > } > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index 65f3c89d7c5d..3cbf0d612f5e 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -167,6 +167,8 @@ static inline bool is_mirror_sp(const struct kvm_mmu_page *sp) > return sp->role.is_mirror; > } > > +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root); > + > static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) > { > /* > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 7f3d7229b2c1..3ff36d09b4fa 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -302,6 +302,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror) > */ > refcount_set(&root->tdp_mmu_root_count, 2); > list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots); > + kvm_mmu_flush_all_tlbs_root(vcpu->kvm, root); > > out_spin_unlock: > spin_unlock(&kvm->arch.tdp_mmu_pages_lock); > -- Vitaly ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes 2025-08-04 15:49 ` Vitaly Kuznetsov @ 2025-08-04 23:09 ` Sean Christopherson 2025-08-05 18:04 ` Jeremi Piotrowski 0 siblings, 1 reply; 12+ messages in thread From: Sean Christopherson @ 2025-08-04 23:09 UTC (permalink / raw) To: Vitaly Kuznetsov Cc: Jeremi Piotrowski, Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini, kvm On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote: > Sean Christopherson <seanjc@google.com> writes: > > It'll take more work than the below, e.g. to have VMX's construct_eptp() pull the > > level and A/D bits from kvm_mmu_page (vendor code can get at the kvm_mmu_page with > > root_to_sp()), but for the core concept/skeleton, I think this is it? > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 6e838cb6c9e1..298130445182 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -3839,6 +3839,37 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu) > > } > > EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots); > > > > +struct kvm_tlb_flush_root { > > + struct kvm *kvm; > > + hpa_t root; > > +}; > > + > > +static void kvm_flush_tlb_root(void *__data) > > +{ > > + struct kvm_tlb_flush_root *data = __data; > > + > > + kvm_x86_call(flush_tlb_root)(data->kvm, data->root); > > +} > > + > > +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root) > > +{ > > + struct kvm_tlb_flush_root data = { > > + .kvm = kvm, > > + .root = __pa(root->spt), > > + }; > > + > > + /* > > + * Flush any TLB entries for the new root, the provenance of the root > > + * is unknown. Even if KVM ensures there are no stale TLB entries > > + * for a freed root, in theory another hypervisor could have left > > + * stale entries. Flushing on alloc also allows KVM to skip the TLB > > + * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing > > + * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is > > + * migrated to a different pCPU. > > + */ > > + on_each_cpu(kvm_flush_tlb_root, &data, 1); > > Would it make sense to complement this with e.g. a CPU mask tracking all > the pCPUs where the VM has ever been seen running (+ a flush when a new > one is added to it)? > > I'm worried about the potential performance impact for a case when a > huge host is running a lot of small VMs in 'partitioning' mode > (i.e. when all vCPUs are pinned). Additionally, this may have a negative > impact on RT use-cases where each unnecessary interruption can be seen > problematic. Oof, right. And it's not even a VM-to-VM noisy neighbor problem, e.g. a few vCPUs using nested TDP could generate a lot of noist IRQs through a VM. Hrm. So I think the basic idea is so flawed/garbage that even enhancing it with per-VM pCPU tracking wouldn't work. I do think you've got the right idea with a pCPU mask though, but instead of using a mask to scope IPIs, use it to elide TLB flushes. With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time: SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2. Allocating a cpumask for each TDP MMU root seems reasonable. Then on task migration, instead of doing a global INVEPT, only INVEPT the current and prev_roots (because getting a new root will trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU has already done a flush for the root. Or we could do the optimized tracking for all roots. x86 supports at most 8192 CPUs, which means 1KiB per root. That doesn't seem at all painful given that each shadow pages consumes 4KiB... ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes 2025-08-04 23:09 ` Sean Christopherson @ 2025-08-05 18:04 ` Jeremi Piotrowski 2025-08-05 23:42 ` Sean Christopherson 0 siblings, 1 reply; 12+ messages in thread From: Jeremi Piotrowski @ 2025-08-05 18:04 UTC (permalink / raw) To: Sean Christopherson, Vitaly Kuznetsov Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini, kvm On 05/08/2025 01:09, Sean Christopherson wrote: > On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote: >> Sean Christopherson <seanjc@google.com> writes: >>> It'll take more work than the below, e.g. to have VMX's construct_eptp() pull the >>> level and A/D bits from kvm_mmu_page (vendor code can get at the kvm_mmu_page with >>> root_to_sp()), but for the core concept/skeleton, I think this is it? >>> >>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >>> index 6e838cb6c9e1..298130445182 100644 >>> --- a/arch/x86/kvm/mmu/mmu.c >>> +++ b/arch/x86/kvm/mmu/mmu.c >>> @@ -3839,6 +3839,37 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu) >>> } >>> EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots); >>> >>> +struct kvm_tlb_flush_root { >>> + struct kvm *kvm; >>> + hpa_t root; >>> +}; >>> + >>> +static void kvm_flush_tlb_root(void *__data) >>> +{ >>> + struct kvm_tlb_flush_root *data = __data; >>> + >>> + kvm_x86_call(flush_tlb_root)(data->kvm, data->root); >>> +} >>> + >>> +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root) >>> +{ >>> + struct kvm_tlb_flush_root data = { >>> + .kvm = kvm, >>> + .root = __pa(root->spt), >>> + }; >>> + >>> + /* >>> + * Flush any TLB entries for the new root, the provenance of the root >>> + * is unknown. Even if KVM ensures there are no stale TLB entries >>> + * for a freed root, in theory another hypervisor could have left >>> + * stale entries. Flushing on alloc also allows KVM to skip the TLB >>> + * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing >>> + * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is >>> + * migrated to a different pCPU. >>> + */ >>> + on_each_cpu(kvm_flush_tlb_root, &data, 1); >> >> Would it make sense to complement this with e.g. a CPU mask tracking all >> the pCPUs where the VM has ever been seen running (+ a flush when a new >> one is added to it)? >> >> I'm worried about the potential performance impact for a case when a >> huge host is running a lot of small VMs in 'partitioning' mode >> (i.e. when all vCPUs are pinned). Additionally, this may have a negative >> impact on RT use-cases where each unnecessary interruption can be seen >> problematic. > > Oof, right. And it's not even a VM-to-VM noisy neighbor problem, e.g. a few > vCPUs using nested TDP could generate a lot of noist IRQs through a VM. Hrm. > > So I think the basic idea is so flawed/garbage that even enhancing it with per-VM > pCPU tracking wouldn't work. I do think you've got the right idea with a pCPU mask > though, but instead of using a mask to scope IPIs, use it to elide TLB flushes. Sorry for the delay in replying, I've been sidetracked a bit. I like this idea more, not special casing the TLB flushing approach per hypervisor is preferable. > > With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time: > SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2. Allocating a cpumask for each > TDP MMU root seems reasonable. Then on task migration, instead of doing a global > INVEPT, only INVEPT the current and prev_roots (because getting a new root will > trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU > has already done a flush for the root. Just to make sure I follow: current+prev_roots do you mean literally those (i.e. cached prev roots) or all roots on kvm->arch.tdp_mmu_roots? So this would mean: on pCPU migration, check if current mmu has is_tdp_mmu_active() and then perform the INVEPT-single over roots instead of INVEPT-global. Otherwise stick to the KVM_REQ_TLB_FLUSH. Would there need to be a check for is_guest_mode(), or that the switch is coming from the vmx/nested.c? I suppose not because nested doesn't seem to use TDP MMU. > > Or we could do the optimized tracking for all roots. x86 supports at most 8192 > CPUs, which means 1KiB per root. That doesn't seem at all painful given that > each shadow pages consumes 4KiB... Similar question here: which all roots would need to be tracked+flushed for shadow paging? pae_roots? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes 2025-08-05 18:04 ` Jeremi Piotrowski @ 2025-08-05 23:42 ` Sean Christopherson 2025-08-15 13:49 ` Jeremi Piotrowski 0 siblings, 1 reply; 12+ messages in thread From: Sean Christopherson @ 2025-08-05 23:42 UTC (permalink / raw) To: Jeremi Piotrowski Cc: Vitaly Kuznetsov, Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini, kvm [-- Attachment #1: Type: text/plain, Size: 5181 bytes --] On Tue, Aug 05, 2025, Jeremi Piotrowski wrote: > On 05/08/2025 01:09, Sean Christopherson wrote: > > On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote: > >> Sean Christopherson <seanjc@google.com> writes: > >>> +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root) > >>> +{ > >>> + struct kvm_tlb_flush_root data = { > >>> + .kvm = kvm, > >>> + .root = __pa(root->spt), > >>> + }; > >>> + > >>> + /* > >>> + * Flush any TLB entries for the new root, the provenance of the root > >>> + * is unknown. Even if KVM ensures there are no stale TLB entries > >>> + * for a freed root, in theory another hypervisor could have left > >>> + * stale entries. Flushing on alloc also allows KVM to skip the TLB > >>> + * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing > >>> + * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is > >>> + * migrated to a different pCPU. > >>> + */ > >>> + on_each_cpu(kvm_flush_tlb_root, &data, 1); > >> > >> Would it make sense to complement this with e.g. a CPU mask tracking all > >> the pCPUs where the VM has ever been seen running (+ a flush when a new > >> one is added to it)? > >> > >> I'm worried about the potential performance impact for a case when a > >> huge host is running a lot of small VMs in 'partitioning' mode > >> (i.e. when all vCPUs are pinned). Additionally, this may have a negative > >> impact on RT use-cases where each unnecessary interruption can be seen > >> problematic. > > > > Oof, right. And it's not even a VM-to-VM noisy neighbor problem, e.g. a few > > vCPUs using nested TDP could generate a lot of noist IRQs through a VM. Hrm. > > > > So I think the basic idea is so flawed/garbage that even enhancing it with per-VM > > pCPU tracking wouldn't work. I do think you've got the right idea with a pCPU mask > > though, but instead of using a mask to scope IPIs, use it to elide TLB flushes. > > Sorry for the delay in replying, I've been sidetracked a bit. No worries, I guarantee my delays will make your delays pale in comparison :-D > I like this idea more, not special casing the TLB flushing approach per hypervisor is > preferable. > > > > > With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time: > > SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2. Allocating a cpumask for each > > TDP MMU root seems reasonable. Then on task migration, instead of doing a global > > INVEPT, only INVEPT the current and prev_roots (because getting a new root will > > trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU > > has already done a flush for the root. > > Just to make sure I follow: current+prev_roots do you mean literally those > (i.e. cached prev roots) or all roots on kvm->arch.tdp_mmu_roots? The former, i.e. "root" and all "prev_roots" entries in a kvm_mmu structure. > So this would mean: on pCPU migration, check if current mmu has is_tdp_mmu_active() > and then perform the INVEPT-single over roots instead of INVEPT-global. Otherwise stick > to the KVM_REQ_TLB_FLUSH. No, KVM would still need to ensure shadow roots are flushed as well, because KVM doesn't flush TLBs when switching to a previous root (see fast_pgd_switch()). More at the bottom. > Would there need to be a check for is_guest_mode(), or that the switch is > coming from the vmx/nested.c? I suppose not because nested doesn't seem to > use TDP MMU. Nested can use the TDP MMU, though there's practically no code in KVM that explicitly deals with this scenario. If L1 is using legacy shadow paging, i.e. is NOT using EPT/NPT, then KVM will use the TDP MMU to map L2 (with kvm_mmu_page_role.guest_mode=1 to differentiate from the L1 TDP MMU). > > Or we could do the optimized tracking for all roots. x86 supports at most 8192 > > CPUs, which means 1KiB per root. That doesn't seem at all painful given that > > each shadow pages consumes 4KiB... > > Similar question here: which all roots would need to be tracked+flushed for shadow > paging? pae_roots? Same general answer, "root" and all "prev_roots" entries. KVM uses up to two "struct kvm_mmu" instances to actually map memory into the guest: root_mmu and guest_mmu. The third instance, nested_mmu, is used to model gva->gpa translations for L2, i.e. is used only to walk L2 stage-1 page tables, and is never used to map memory into the guest, i.e. can't have entries in hardware TLBs. The basic gist is to add a cpumask in each root, and then elide TLB flushes on pCPU migration if KVM has flushed the root at least once. Patch 5/5 in the attached set of patches provides a *very* rough sketch. Hopefully its enough to convey the core idea. Patches 1-4 compile, but are otherwise untested. I'll post patches 1-3 as a small series once their tested, as those cleanups are worth doing irrespective of any optimizations we make to pCPU migration. P.S. everyone and their mother thinks guest_mmu and nested_mmu are terrible names, but no one has come up with names good enough to convince everyone to get out from behind the bikeshed :-) [-- Attachment #2: 0001-KVM-VMX-Hoist-construct_eptp-up-in-vmx.c.patch --] [-- Type: text/x-diff, Size: 1723 bytes --] From 8d0e63076371b04ca018577238b6d9b4e6cb1834 Mon Sep 17 00:00:00 2001 From: Sean Christopherson <seanjc@google.com> Date: Tue, 5 Aug 2025 14:29:19 -0700 Subject: [PATCH 1/5] KVM: VMX: Hoist construct_eptp() "up" in vmx.c Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/vmx/vmx.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index aa157fe5b7b3..9533eabc2182 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -3188,6 +3188,20 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu) return to_vmx(vcpu)->vpid; } +u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) +{ + u64 eptp = VMX_EPTP_MT_WB; + + eptp |= (root_level == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4; + + if (enable_ept_ad_bits && + (!is_guest_mode(vcpu) || nested_ept_ad_enabled(vcpu))) + eptp |= VMX_EPTP_AD_ENABLE_BIT; + eptp |= root_hpa; + + return eptp; +} + void vmx_flush_tlb_current(struct kvm_vcpu *vcpu) { struct kvm_mmu *mmu = vcpu->arch.mmu; @@ -3365,20 +3379,6 @@ static int vmx_get_max_ept_level(void) return 4; } -u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) -{ - u64 eptp = VMX_EPTP_MT_WB; - - eptp |= (root_level == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4; - - if (enable_ept_ad_bits && - (!is_guest_mode(vcpu) || nested_ept_ad_enabled(vcpu))) - eptp |= VMX_EPTP_AD_ENABLE_BIT; - eptp |= root_hpa; - - return eptp; -} - void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) { struct kvm *kvm = vcpu->kvm; base-commit: 196d9e72c4b0bd68b74a4ec7f52d248f37d0f030 -- 2.50.1.565.gc32cd1483b-goog [-- Attachment #3: 0002-KVM-nVMX-Hardcode-dummy-EPTP-used-for-early-nested-c.patch --] [-- Type: text/x-diff, Size: 2483 bytes --] From 2ca5f9bccff0458dab303d1929b9e13e869b7c85 Mon Sep 17 00:00:00 2001 From: Sean Christopherson <seanjc@google.com> Date: Tue, 5 Aug 2025 14:29:46 -0700 Subject: [PATCH 2/5] KVM: nVMX: Hardcode dummy EPTP used for early nested consistency checks Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/vmx/nested.c | 8 +++----- arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/vmx/vmx.h | 1 - 3 files changed, 4 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index b8ea1969113d..f3f5da3ee2cc 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2278,13 +2278,11 @@ static void prepare_vmcs02_constant_state(struct vcpu_vmx *vmx) vmx->nested.vmcs02_initialized = true; /* - * We don't care what the EPTP value is we just need to guarantee - * it's valid so we don't get a false positive when doing early - * consistency checks. + * If early consistency checks are enabled, stuff the EPT Pointer with + * dummy *legal* value to avoid false positives on bad control state. */ if (enable_ept && nested_early_check) - vmcs_write64(EPT_POINTER, - construct_eptp(&vmx->vcpu, 0, PT64_ROOT_4LEVEL)); + vmcs_write64(EPT_POINTER, VMX_EPTP_MT_WB | VMX_EPTP_PWL_4); if (vmx->ve_info) vmcs_write64(VE_INFORMATION_ADDRESS, __pa(vmx->ve_info)); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9533eabc2182..8fc114e6fa56 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -3188,7 +3188,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu) return to_vmx(vcpu)->vpid; } -u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) +static u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) { u64 eptp = VMX_EPTP_MT_WB; diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index d3389baf3ab3..7c3f8b908c69 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -366,7 +366,6 @@ void set_cr4_guest_host_mask(struct vcpu_vmx *vmx); void ept_save_pdptrs(struct kvm_vcpu *vcpu); void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); -u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level); bool vmx_guest_inject_ac(struct kvm_vcpu *vcpu); void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu); -- 2.50.1.565.gc32cd1483b-goog [-- Attachment #4: 0003-KVM-VMX-Use-kvm_mmu_page-role-to-construct-EPTP-not-.patch --] [-- Type: text/x-diff, Size: 2624 bytes --] From f79f76040166e741261e5f819ed23595922a8ba2 Mon Sep 17 00:00:00 2001 From: Sean Christopherson <seanjc@google.com> Date: Tue, 5 Aug 2025 14:46:31 -0700 Subject: [PATCH 3/5] KVM: VMX: Use kvm_mmu_page role to construct EPTP, not current vCPU state Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/vmx/vmx.c | 37 ++++++++++++++++++++++++++----------- 1 file changed, 26 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 8fc114e6fa56..2408aae01837 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -3188,20 +3188,36 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu) return to_vmx(vcpu)->vpid; } -static u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) +static u64 construct_eptp(hpa_t root_hpa) { - u64 eptp = VMX_EPTP_MT_WB; + struct kvm_mmu_page *root = root_to_sp(root_hpa); + u64 eptp = root_hpa | VMX_EPTP_MT_WB; - eptp |= (root_level == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4; + /* + * EPT roots should always have an associated MMU page. Return a "bad" + * EPTP to induce VM-Fail instead of continuing on in a unknown state. + */ + if (WARN_ON_ONCE(!root)) + return INVALID_PAGE; - if (enable_ept_ad_bits && - (!is_guest_mode(vcpu) || nested_ept_ad_enabled(vcpu))) + eptp |= (root->role.level == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4; + + if (enable_ept_ad_bits && !root->role.ad_disabled) eptp |= VMX_EPTP_AD_ENABLE_BIT; - eptp |= root_hpa; return eptp; } +static void vmx_flush_tlb_ept_root(hpa_t root_hpa) +{ + u64 eptp = construct_eptp(root_hpa); + + if (VALID_PAGE(eptp)) + ept_sync_context(eptp); + else + ept_sync_global(); +} + void vmx_flush_tlb_current(struct kvm_vcpu *vcpu) { struct kvm_mmu *mmu = vcpu->arch.mmu; @@ -3212,8 +3228,7 @@ void vmx_flush_tlb_current(struct kvm_vcpu *vcpu) return; if (enable_ept) - ept_sync_context(construct_eptp(vcpu, root_hpa, - mmu->root_role.level)); + vmx_flush_tlb_ept_root(root_hpa); else vpid_sync_context(vmx_get_current_vpid(vcpu)); } @@ -3384,11 +3399,11 @@ void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) struct kvm *kvm = vcpu->kvm; bool update_guest_cr3 = true; unsigned long guest_cr3; - u64 eptp; if (enable_ept) { - eptp = construct_eptp(vcpu, root_hpa, root_level); - vmcs_write64(EPT_POINTER, eptp); + KVM_MMU_WARN_ON(!root_to_sp(root_hpa) || + root_level != root_to_sp(root_hpa)->role.level); + vmcs_write64(EPT_POINTER, construct_eptp(root_hpa)); hv_track_root_tdp(vcpu, root_hpa); -- 2.50.1.565.gc32cd1483b-goog [-- Attachment #5: 0004-KVM-VMX-Flush-only-active-EPT-roots-on-pCPU-migratio.patch --] [-- Type: text/x-diff, Size: 1998 bytes --] From 501f4c799f207a07933279485f76205f91e4537f Mon Sep 17 00:00:00 2001 From: Sean Christopherson <seanjc@google.com> Date: Tue, 5 Aug 2025 15:13:27 -0700 Subject: [PATCH 4/5] KVM: VMX: Flush only active EPT roots on pCPU migration Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/vmx/vmx.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 2408aae01837..b42747e2293d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1395,6 +1395,8 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu) } } +static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu); + void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -1431,7 +1433,12 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) * Flush all EPTP/VPID contexts, the new pCPU may have stale * TLB entries from its previous association with the vCPU. */ - kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); + if (enable_ept) { + vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu); + vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu); + } else { + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); + } /* * Linux uses per-cpu TSS and GDT, so set these when switching @@ -3254,6 +3261,24 @@ void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu) vpid_sync_context(vmx_get_current_vpid(vcpu)); } +static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa) +{ + if (!VALID_PAGE(root_hpa)) + return; + + vmx_flush_tlb_ept_root(root_hpa); +} + +static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu) +{ + int i; + + __vmx_flush_ept_on_pcpu_migration(mmu->root.hpa); + + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) + __vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa); +} + void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu) { struct kvm_mmu *mmu = vcpu->arch.walk_mmu; -- 2.50.1.565.gc32cd1483b-goog [-- Attachment #6: 0005-KVM-VMX-Sketch-in-possible-framework-for-eliding-TLB.patch --] [-- Type: text/x-diff, Size: 3484 bytes --] From ca798b2e1de4d0975ee808108c7514fe738f0898 Mon Sep 17 00:00:00 2001 From: Sean Christopherson <seanjc@google.com> Date: Tue, 5 Aug 2025 15:58:13 -0700 Subject: [PATCH 5/5] KVM: VMX: Sketch in possible framework for eliding TLB flushes on pCPU migration Not-Signed-off-by: Sean Christopherson <seanjc@google.com> (anyone that makes this work deserves full credit) --- arch/x86/kvm/mmu/mmu.c | 3 +++ arch/x86/kvm/mmu/tdp_mmu.c | 2 ++ arch/x86/kvm/vmx/vmx.c | 21 ++++++++++++++------- 3 files changed, 19 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6e838cb6c9e1..925efbaae9b9 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3854,6 +3854,9 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, sp = kvm_mmu_get_shadow_page(vcpu, gfn, role); ++sp->root_count; + if (level >= PT64_ROOT_4LEVEL) + kvm_x86_call(alloc_root_cpu_mask)(root); + return __pa(sp->spt); } diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 7f3d7229b2c1..bf4b0b9a7816 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -293,6 +293,8 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror) root = tdp_mmu_alloc_sp(vcpu); tdp_mmu_init_sp(root, NULL, 0, role); + kvm_x86_call(alloc_root_cpu_mask)(root); + /* * TDP MMU roots are kept until they are explicitly invalidated, either * by a memslot update or by the destruction of the VM. Initialize the diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index b42747e2293d..e85830189cfc 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1395,7 +1395,7 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu) } } -static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu); +static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu); void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) { @@ -1434,8 +1434,8 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) * TLB entries from its previous association with the vCPU. */ if (enable_ept) { - vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu); - vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu); + vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu, cpu); + vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu, cpu); } else { kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); } @@ -3261,22 +3261,29 @@ void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu) vpid_sync_context(vmx_get_current_vpid(vcpu)); } -static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa) +static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa, int cpu) { + struct kvm_mmu_page *root; + if (!VALID_PAGE(root_hpa)) return; + root = root_to_sp(root_hpa); + if (!WARN_ON_ONCE(!root) && + test_and_set_bit(cpu, root->cpu_flushed_mask)) + return; + vmx_flush_tlb_ept_root(root_hpa); } -static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu) +static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu) { int i; - __vmx_flush_ept_on_pcpu_migration(mmu->root.hpa); + __vmx_flush_ept_on_pcpu_migration(mmu->root.hpa, cpu); for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) - __vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa); + __vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa, cpu); } void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu) -- 2.50.1.565.gc32cd1483b-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes 2025-08-05 23:42 ` Sean Christopherson @ 2025-08-15 13:49 ` Jeremi Piotrowski 2025-08-19 22:50 ` Sean Christopherson 0 siblings, 1 reply; 12+ messages in thread From: Jeremi Piotrowski @ 2025-08-15 13:49 UTC (permalink / raw) To: Sean Christopherson Cc: Vitaly Kuznetsov, Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini, kvm [-- Attachment #1: Type: text/plain, Size: 4470 bytes --] On Tue, Aug 05, 2025 at 04:42:46PM -0700, Sean Christopherson wrote: > On Tue, Aug 05, 2025, Jeremi Piotrowski wrote: > > On 05/08/2025 01:09, Sean Christopherson wrote: > > > On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote: > > >> Sean Christopherson <seanjc@google.com> writes: (snip) > > > > > > Oof, right. And it's not even a VM-to-VM noisy neighbor problem, e.g. a few > > > vCPUs using nested TDP could generate a lot of noist IRQs through a VM. Hrm. > > > > > > So I think the basic idea is so flawed/garbage that even enhancing it with per-VM > > > pCPU tracking wouldn't work. I do think you've got the right idea with a pCPU mask > > > though, but instead of using a mask to scope IPIs, use it to elide TLB flushes. > > > > Sorry for the delay in replying, I've been sidetracked a bit. > > No worries, I guarantee my delays will make your delays pale in comparison :-D > > > I like this idea more, not special casing the TLB flushing approach per hypervisor is > > preferable. > > > > > > > > With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time: > > > SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2. Allocating a cpumask for each > > > TDP MMU root seems reasonable. Then on task migration, instead of doing a global > > > INVEPT, only INVEPT the current and prev_roots (because getting a new root will > > > trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU > > > has already done a flush for the root. > > > > Just to make sure I follow: current+prev_roots do you mean literally those > > (i.e. cached prev roots) or all roots on kvm->arch.tdp_mmu_roots? > > The former, i.e. "root" and all "prev_roots" entries in a kvm_mmu structure. > > > So this would mean: on pCPU migration, check if current mmu has is_tdp_mmu_active() > > and then perform the INVEPT-single over roots instead of INVEPT-global. Otherwise stick > > to the KVM_REQ_TLB_FLUSH. > > No, KVM would still need to ensure shadow roots are flushed as well, because KVM > doesn't flush TLBs when switching to a previous root (see fast_pgd_switch()). > More at the bottom. > > > Would there need to be a check for is_guest_mode(), or that the switch is > > coming from the vmx/nested.c? I suppose not because nested doesn't seem to > > use TDP MMU. > > Nested can use the TDP MMU, though there's practically no code in KVM that explicitly > deals with this scenario. If L1 is using legacy shadow paging, i.e. is NOT using > EPT/NPT, then KVM will use the TDP MMU to map L2 (with kvm_mmu_page_role.guest_mode=1 > to differentiate from the L1 TDP MMU). > > > > Or we could do the optimized tracking for all roots. x86 supports at most 8192 > > > CPUs, which means 1KiB per root. That doesn't seem at all painful given that > > > each shadow pages consumes 4KiB... > > > > Similar question here: which all roots would need to be tracked+flushed for shadow > > paging? pae_roots? > > Same general answer, "root" and all "prev_roots" entries. KVM uses up to two > "struct kvm_mmu" instances to actually map memory into the guest: root_mmu and > guest_mmu. The third instance, nested_mmu, is used to model gva->gpa translations > for L2, i.e. is used only to walk L2 stage-1 page tables, and is never used to > map memory into the guest, i.e. can't have entries in hardware TLBs. > > The basic gist is to add a cpumask in each root, and then elide TLB flushes on > pCPU migration if KVM has flushed the root at least once. Patch 5/5 in the attached > set of patches provides a *very* rough sketch. Hopefully its enough to convey the > core idea. > > Patches 1-4 compile, but are otherwise untested. I'll post patches 1-3 as a small > series once their tested, as those cleanups are worth doing irrespective of any > optimizations we make to pCPU migration. > Thanks for the detailed explanation and the patches Sean! I started working on extending patch 5, wanted to post it here to make sure I'm on the right track. It works in testing so far and shows promising performance - it gets rid of all the pathological cases I saw before. I haven't checked whether I broke SVM yet, and I need figure out a way to always keep the cpumask "offstack" so that we don't blow up every struct kvm_mmu_page instance with an inline cpumask - it needs to stay optional. I also came across kvm_mmu_is_dummy_root(), that check is included in root_to_sp(). Can you think of any other checks that we might need to handle? [-- Attachment #2: 0001-KVM-VMX-Sketch-in-possible-framework-for-eliding-TLB.patch --] [-- Type: text/x-diff, Size: 8063 bytes --] From 8fb6d18ad4cbdd1802df45be49358a6d6acf72a0 Mon Sep 17 00:00:00 2001 From: Sean Christopherson <seanjc@google.com> Date: Tue, 5 Aug 2025 15:58:13 -0700 Subject: [PATCH] KVM: VMX: Sketch in possible framework for eliding TLB flushes on pCPU migration Not-Signed-off-by: Sean Christopherson <seanjc@google.com> (anyone that makes this work deserves full credit) Not-yet-Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> --- arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/kvm/mmu/mmu.c | 5 +++++ arch/x86/kvm/mmu/mmu_internal.h | 4 ++++ arch/x86/kvm/mmu/tdp_mmu.c | 4 ++++ arch/x86/kvm/vmx/main.c | 1 + arch/x86/kvm/vmx/vmx.c | 28 +++++++++++++++++++++------- arch/x86/kvm/vmx/x86_ops.h | 1 + 8 files changed, 40 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index 8d50e3e0a19b..60351dd22f2f 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -99,6 +99,7 @@ KVM_X86_OP_OPTIONAL(link_external_spt) KVM_X86_OP_OPTIONAL(set_external_spte) KVM_X86_OP_OPTIONAL(free_external_spt) KVM_X86_OP_OPTIONAL(remove_external_spte) +KVM_X86_OP_OPTIONAL(alloc_root_cpu_mask) KVM_X86_OP(has_wbinvd_exit) KVM_X86_OP(get_l2_tsc_offset) KVM_X86_OP(get_l2_tsc_multiplier) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b4a391929cdb..a3d415c3ea8b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1801,6 +1801,9 @@ struct kvm_x86_ops { void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level); + /* Allocate per-root pCPU flush mask. */ + void (*alloc_root_cpu_mask)(struct kvm_mmu_page *root); + /* Update external mapping with page table link. */ int (*link_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level, void *external_spt); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4e06e2e89a8f..721ee8ea76bd 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -20,6 +20,7 @@ #include "ioapic.h" #include "mmu.h" #include "mmu_internal.h" +#include <linux/cpumask.h> #include "tdp_mmu.h" #include "x86.h" #include "kvm_cache_regs.h" @@ -1820,6 +1821,7 @@ static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp) list_del(&sp->link); free_page((unsigned long)sp->spt); free_page((unsigned long)sp->shadowed_translation); + free_cpumask_var(sp->cpu_flushed_mask); kmem_cache_free(mmu_page_header_cache, sp); } @@ -3827,6 +3829,9 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, sp = kvm_mmu_get_shadow_page(vcpu, gfn, role); ++sp->root_count; + if (level >= PT64_ROOT_4LEVEL) + kvm_x86_call(alloc_root_cpu_mask)(sp); + return __pa(sp->spt); } diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index db8f33e4de62..5acb3dd34b36 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -7,6 +7,7 @@ #include <asm/kvm_host.h> #include "mmu.h" +#include <linux/cpumask.h> #ifdef CONFIG_KVM_PROVE_MMU #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x) @@ -145,6 +146,9 @@ struct kvm_mmu_page { /* Used for freeing the page asynchronously if it is a TDP MMU page. */ struct rcu_head rcu_head; #endif + + /* Mask tracking which host CPUs have flushed this EPT root */ + cpumask_var_t cpu_flushed_mask; }; extern struct kmem_cache *mmu_page_header_cache; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 7f3d7229b2c1..40c7f46f553c 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -3,6 +3,7 @@ #include "mmu.h" #include "mmu_internal.h" +#include <linux/cpumask.h> #include "mmutrace.h" #include "tdp_iter.h" #include "tdp_mmu.h" @@ -57,6 +58,7 @@ static void tdp_mmu_free_sp(struct kvm_mmu_page *sp) { free_page((unsigned long)sp->external_spt); free_page((unsigned long)sp->spt); + free_cpumask_var(sp->cpu_flushed_mask); kmem_cache_free(mmu_page_header_cache, sp); } @@ -293,6 +295,8 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror) root = tdp_mmu_alloc_sp(vcpu); tdp_mmu_init_sp(root, NULL, 0, role); + kvm_x86_call(alloc_root_cpu_mask)(root); + /* * TDP MMU roots are kept until they are explicitly invalidated, either * by a memslot update or by the destruction of the VM. Initialize the diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c index d1e02e567b57..ec7f6899443d 100644 --- a/arch/x86/kvm/vmx/main.c +++ b/arch/x86/kvm/vmx/main.c @@ -1005,6 +1005,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = { .write_tsc_multiplier = vt_op(write_tsc_multiplier), .load_mmu_pgd = vt_op(load_mmu_pgd), + .alloc_root_cpu_mask = vmx_alloc_root_cpu_mask, .check_intercept = vmx_check_intercept, .handle_exit_irqoff = vmx_handle_exit_irqoff, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index eec2d866e7f1..a6d93624c2d4 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -28,6 +28,7 @@ #include <linux/slab.h> #include <linux/tboot.h> #include <linux/trace_events.h> +#include <linux/cpumask.h> #include <linux/entry-kvm.h> #include <asm/apic.h> @@ -62,6 +63,7 @@ #include "kvm_cache_regs.h" #include "lapic.h" #include "mmu.h" +#include "mmu/spte.h" #include "nested.h" #include "pmu.h" #include "sgx.h" @@ -1450,7 +1452,7 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu) } } -static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu); +static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu); void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) { @@ -1489,8 +1491,8 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) * TLB entries from its previous association with the vCPU. */ if (enable_ept) { - vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu); - vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu); + vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu, cpu); + vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu, cpu); } else { kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); } @@ -3307,22 +3309,34 @@ void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu) vpid_sync_context(vmx_get_current_vpid(vcpu)); } -static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa) +void vmx_alloc_root_cpu_mask(struct kvm_mmu_page *root) { + WARN_ON_ONCE(!zalloc_cpumask_var(&root->cpu_flushed_mask, + GFP_KERNEL_ACCOUNT)); +} + +static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa, int cpu) +{ + struct kvm_mmu_page *root; + if (!VALID_PAGE(root_hpa)) return; + root = root_to_sp(root_hpa); + if (!root || cpumask_test_and_set_cpu(cpu, root->cpu_flushed_mask)) + return; + vmx_flush_tlb_ept_root(root_hpa); } -static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu) +static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu) { int i; - __vmx_flush_ept_on_pcpu_migration(mmu->root.hpa); + __vmx_flush_ept_on_pcpu_migration(mmu->root.hpa, cpu); for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) - __vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa); + __vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa, cpu); } void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h index b4596f651232..4406d53e6ebe 100644 --- a/arch/x86/kvm/vmx/x86_ops.h +++ b/arch/x86/kvm/vmx/x86_ops.h @@ -84,6 +84,7 @@ void vmx_flush_tlb_all(struct kvm_vcpu *vcpu); void vmx_flush_tlb_current(struct kvm_vcpu *vcpu); void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr); void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu); +void vmx_alloc_root_cpu_mask(struct kvm_mmu_page *root); void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask); u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu); void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall); -- 2.39.5 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes 2025-08-15 13:49 ` Jeremi Piotrowski @ 2025-08-19 22:50 ` Sean Christopherson 0 siblings, 0 replies; 12+ messages in thread From: Sean Christopherson @ 2025-08-19 22:50 UTC (permalink / raw) To: Jeremi Piotrowski Cc: Vitaly Kuznetsov, Dave Hansen, linux-kernel, alanjiang, chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini, kvm On Fri, Aug 15, 2025, Jeremi Piotrowski wrote: > On Tue, Aug 05, 2025 at 04:42:46PM -0700, Sean Christopherson wrote: > I started working on extending patch 5, wanted to post it here to make sure I'm > on the right track. > > It works in testing so far and shows promising performance - it gets rid of all > the pathological cases I saw before. Nice :-) > I haven't checked whether I broke SVM yet, and I need figure out a way to > always keep the cpumask "offstack" so that we don't blow up every struct > kvm_mmu_page instance with an inline cpumask - it needs to stay optional. Doh, I meant to include an idea or two for this in my earlier response. /The best I can come up with is > I also came across kvm_mmu_is_dummy_root(), that check is included in > root_to_sp(). Can you think of any other checks that we might need to handle? Don't think so? > @@ -3827,6 +3829,9 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, > sp = kvm_mmu_get_shadow_page(vcpu, gfn, role); > ++sp->root_count; > > + if (level >= PT64_ROOT_4LEVEL) Was this my code? If so, we should move this into the VMX code, because the fact that PAE roots can be ignored is really a detail of nested EPT, not the overall sceheme. > + kvm_x86_call(alloc_root_cpu_mask)(sp); Ah shoot. Allocating here won't work, because mmu_lock is held and allocating might sleep. I don't want to force an atomic allocation, because that can dip into pools that KVM really shouldn't use. The "standard" way KVM deals with this is to utilize a kvm_mmu_memory_cache. If we do that and add e.g kvm_vcpu_arch.mmu_roots_flushed_cache, then we trivially do the allocation in mmu_topup_memory_caches(). That would eliminate the error handling in vmx_alloc_root_cpu_mask(), and might make it slightly less awful to deal with the "offstack" cpumask. Hmm, and then instead of calling into VMX to do the allocation, maybe just have a flag to communicate that vendor code wants per-root flush tracking? I haven't thought hard about SVM, but I wouldn't be surprised if SVM ends up wanting the same functionality after we switch to per-vCPU ASIDs. > + > return __pa(sp->spt); > } ... > @@ -3307,22 +3309,34 @@ void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu) > vpid_sync_context(vmx_get_current_vpid(vcpu)); > } > > -static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa) > +void vmx_alloc_root_cpu_mask(struct kvm_mmu_page *root) > { This should be conditioned on enable_ept. > + WARN_ON_ONCE(!zalloc_cpumask_var(&root->cpu_flushed_mask, > + GFP_KERNEL_ACCOUNT)); > +} > + > +static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa, int cpu) > +{ > + struct kvm_mmu_page *root; > + > if (!VALID_PAGE(root_hpa)) > return; > > + root = root_to_sp(root_hpa); > + if (!root || cpumask_test_and_set_cpu(cpu, root->cpu_flushed_mask)) Hmm, this should flush if "root" is NULL, because the aforementioned "special" roots don't have a shadow page. But unless I'm missing an edge case (of an edge case), this particular code can WARN_ON_ONCE() since EPT should never need to use any of the special roots. We might need to filter out dummy roots somewhere to avoid false positives, but that should be easy enough. For the mask, it's probably worth splitting test_and_set into separate operations, as the common case will likely be that the root has been used on this pCPU. The test_and_set version will generate a LOCK BTS instruction, and so for the common case where the bit is already set, KVM will generate an atomic access, which can cause noise/bottlenecks E.g. if (WARN_ON_ONCE(!root)) goto flush; if (cpumask_test_cpu(cpu, root->cpu_flushed_mask)) return; cpumask_set_cpu(cpu, root->cpu_flushed_mask); flush: vmx_flush_tlb_ept_root(root_hpa); > + return; > + > vmx_flush_tlb_ept_root(root_hpa); > } > > -static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu) > +static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu) > { > int i; > > - __vmx_flush_ept_on_pcpu_migration(mmu->root.hpa); > + __vmx_flush_ept_on_pcpu_migration(mmu->root.hpa, cpu); > > for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) > - __vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa); > + __vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa, cpu); > } > > void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu) > diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h > index b4596f651232..4406d53e6ebe 100644 > --- a/arch/x86/kvm/vmx/x86_ops.h > +++ b/arch/x86/kvm/vmx/x86_ops.h > @@ -84,6 +84,7 @@ void vmx_flush_tlb_all(struct kvm_vcpu *vcpu); > void vmx_flush_tlb_current(struct kvm_vcpu *vcpu); > void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr); > void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu); > +void vmx_alloc_root_cpu_mask(struct kvm_mmu_page *root); > void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask); > u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu); > void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall); > -- > 2.39.5 > ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-08-19 22:51 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-20 15:39 [RFC PATCH 0/1] Tweak TLB flushing when VMX is running on Hyper-V Jeremi Piotrowski 2025-06-20 15:39 ` [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes Jeremi Piotrowski 2025-06-27 8:31 ` Vitaly Kuznetsov 2025-07-02 16:11 ` Jeremi Piotrowski 2025-07-09 15:46 ` Vitaly Kuznetsov 2025-07-15 0:39 ` Sean Christopherson 2025-08-04 15:49 ` Vitaly Kuznetsov 2025-08-04 23:09 ` Sean Christopherson 2025-08-05 18:04 ` Jeremi Piotrowski 2025-08-05 23:42 ` Sean Christopherson 2025-08-15 13:49 ` Jeremi Piotrowski 2025-08-19 22:50 ` Sean Christopherson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).