[RFC PATCH 0/1] Tweak TLB flushing when VMX is running on Hyper-V

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/1] Tweak TLB flushing when VMX is running on Hyper-V
@ 2025-06-20 15:39 Jeremi Piotrowski
  2025-06-20 15:39 ` [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes Jeremi Piotrowski
  0 siblings, 1 reply; 12+ messages in thread
From: Jeremi Piotrowski @ 2025-06-20 15:39 UTC (permalink / raw)
  To: Sean Christopherson, Vitaly Kuznetsov, Paolo Bonzini, kvm
  Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma,
	andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, linux-hyperv, Jeremi Piotrowski

Hi Sean/Parolo/Vitaly,

Wanted to get your opinion on this change. Let me first introduce the scenario:

We have been testing kata containers (containers wrapped in VM) in Azure, and
found some significant issues with TLB flushing. This is a popular workload and
requires launching many nested VMs quickly. When testing on a 64 core Intel VM
(D64s_v5 in case someone is wondering), spinning up some 150-ish nested VMs in
parallel, performance starts getting worse the more nested VMs are already
running, CPU usage spikes to 100% on all cores and doesn't settle even when all
nested VMs boot up. On an idle system a single nested VMs boots within seconds,
but once we have a couple dozen running or so (doing nothing inside), boot time
gets longer and longer for each new nested VM, they start hitting startup
timeout etc. In some cases we never reach the point where all nested VMs are
up and running.

Investigating the issue we found that this can't be reproduced on AMD and on
Intel when EPT is disabled. In both these cases the scenario completes within
20s or so. TPD_MMU or not doesn't make a difference. With EPT=Y the case takes
minutes.Out of curiousity I also ran the test case on an n4-standard-64 VM on
GCP and found that EPT=Y runs in ~30s, while EPT=N runs in ~20s (which I found
slightly interesting).

So that's when we starting looking at the TLB flushing code and found that
INVEPT.global is used on every CPU migration and that it's an expensive
function on Hyper-V. It also has an impact on every running nested VM, so we
end up with lots of INVEPT.global calls - we reach 2000 calls/s before we're
essentially stuck in 100% guest ttime.  That's why I'm looking at tweaking the
TLB flushing behavior to avoid it. I came across past discussions on this topic
([^1]) and after some thinking see two options:

1. Do you see a way to optimize this generically to avoid KVM_REQ_TLB_FLUSH on
migration in current KVM? In nested (as in: KVM running nested) I think we
rarely see CPU pinning used the way we it is on baremetal so it's not a rare of
an operation. Much has also changed since [^1] and with kvm_mmu_reset_context()
still being called in many paths we might be over flushing. Perhaps a loop
flushing individual roots with roles that do not have a post_set_xxx hook that
does flushing?

2. We can approach this in a Hyper-V specific way, using the dedicated flush
hypercall, which is what the following RFC patch does. This hypercall acts as a
broadcast INVEPT.single. I believe that using the flush hypercall in
flush_tlb_current() is sufficient to ensure the right semantics and correctness.
The one thing I haven't made up my mind about yet is whether we could still use
a flush of the current root on migration or not - I can imagine at most an
INVEPT.single, I also haven't yet figured out how that could be plumbed in if
it's really necessary (can't put it in KVM_REQ_TLB_FLUSH because that would
break the assumption that it is stronger than KVM_REQ_TLB_FLUSH_CURRENT).

With 2. the performance is comparable to EPT=N on Intel, roughly 20s for the
test scenario.

Let me know what you think about this and if you have any suggestions.

Best wishes,
Jeremi

[^1]: https://lore.kernel.org/kvm/YQljNBBp%2FEousNBk@google.com/

Jeremi Piotrowski (1):
  KVM: VMX: Use Hyper-V EPT flush for local TLB flushes

 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/vmx.c          | 20 +++++++++++++++++---
 arch/x86/kvm/vmx/vmx_onhyperv.h |  6 ++++++
 arch/x86/kvm/x86.c              |  3 +++
 4 files changed, 27 insertions(+), 3 deletions(-)

-- 
2.39.5

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
  2025-06-20 15:39 [RFC PATCH 0/1] Tweak TLB flushing when VMX is running on Hyper-V Jeremi Piotrowski
@ 2025-06-20 15:39 ` Jeremi Piotrowski
  2025-06-27  8:31   ` Vitaly Kuznetsov
  0 siblings, 1 reply; 12+ messages in thread
From: Jeremi Piotrowski @ 2025-06-20 15:39 UTC (permalink / raw)
  To: Sean Christopherson, Vitaly Kuznetsov, Paolo Bonzini, kvm
  Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma,
	andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, linux-hyperv, Jeremi Piotrowski

Use Hyper-V's HvCallFlushGuestPhysicalAddressSpace for local TLB flushes.
This makes any KVM_REQ_TLB_FLUSH_CURRENT (such as on root alloc) visible to
all CPUs which means we no longer need to do a KVM_REQ_TLB_FLUSH on CPU
migration.

The goal is to avoid invept-global in KVM_REQ_TLB_FLUSH. Hyper-V uses a
shadow page table for the nested hypervisor (KVM) and has to invalidate all
EPT roots when invept-global is issued. This has a performance impact on
all nested VMs.  KVM issues KVM_REQ_TLB_FLUSH on CPU migration, and under
load the performance hit causes vCPUs to use up more of their slice of CPU
time, leading to more CPU migrations. This has a snowball effect and causes
CPU usage spikes.

By issuing the hypercall we are now guaranteed that any root modification
that requires a local TLB flush becomes visible to all CPUs. The same
hypercall is already used in kvm_arch_flush_remote_tlbs and
kvm_arch_flush_remote_tlbs_range.  The KVM expectation is that roots are
flushed locally on alloc and we achieve consistency on migration by
flushing all roots - the new behavior of achieving consistency on alloc on
Hyper-V is a superset of the expected guarantees. This makes the
KVM_REQ_TLB_FLUSH on CPU migration no longer necessary on Hyper-V.

Coincidentally - we now match the behavior of SVM on Hyper-V.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/vmx.c          | 20 +++++++++++++++++---
 arch/x86/kvm/vmx/vmx_onhyperv.h |  6 ++++++
 arch/x86/kvm/x86.c              |  3 +++
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b4a391929cdb..d3acab19f425 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1077,6 +1077,7 @@ struct kvm_vcpu_arch {
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	hpa_t hv_root_tdp;
+	bool hv_vmx_use_flush_guest_mapping;
 #endif
 };
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4953846cb30d..f537e0df56fc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1485,8 +1485,12 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
 		/*
 		 * Flush all EPTP/VPID contexts, the new pCPU may have stale
 		 * TLB entries from its previous association with the vCPU.
+		 * Unless we are running on Hyper-V where we promotes local TLB
+		 * flushes to be visible across all CPUs so no need to do again
+		 * on migration.
 		 */
-		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+		if (!vmx_hv_use_flush_guest_mapping(vcpu))
+			kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 
 		/*
 		 * Linux uses per-cpu TSS and GDT, so set these when switching
@@ -3243,11 +3247,21 @@ void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 	if (!VALID_PAGE(root_hpa))
 		return;
 
-	if (enable_ept)
+	if (enable_ept) {
+		/*
+		 * hyperv_flush_guest_mapping() has the semantics of
+		 * invept-single across all pCPUs. This makes root
+		 * modifications consistent across pCPUs, so an invept-global
+		 * on migration is no longer required.
+		 */
+		if (vmx_hv_use_flush_guest_mapping(vcpu))
+			return (void)WARN_ON_ONCE(hyperv_flush_guest_mapping(root_hpa));
+
 		ept_sync_context(construct_eptp(vcpu, root_hpa,
 						mmu->root_role.level));
-	else
+	} else {
 		vpid_sync_context(vmx_get_current_vpid(vcpu));
+	}
 }
 
 void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
diff --git a/arch/x86/kvm/vmx/vmx_onhyperv.h b/arch/x86/kvm/vmx/vmx_onhyperv.h
index cdf8cbb69209..a5c64c90e49e 100644
--- a/arch/x86/kvm/vmx/vmx_onhyperv.h
+++ b/arch/x86/kvm/vmx/vmx_onhyperv.h
@@ -119,6 +119,11 @@ static inline void evmcs_load(u64 phys_addr)
 }
 
 void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf);
+
+static inline bool vmx_hv_use_flush_guest_mapping(struct kvm_vcpu *vcpu)
+{
+	return vcpu->arch.hv_vmx_use_flush_guest_mapping;
+}
 #else /* !IS_ENABLED(CONFIG_HYPERV) */
 static __always_inline bool kvm_is_using_evmcs(void) { return false; }
 static __always_inline void evmcs_write64(unsigned long field, u64 value) {}
@@ -128,6 +133,7 @@ static __always_inline u64 evmcs_read64(unsigned long field) { return 0; }
 static __always_inline u32 evmcs_read32(unsigned long field) { return 0; }
 static __always_inline u16 evmcs_read16(unsigned long field) { return 0; }
 static inline void evmcs_load(u64 phys_addr) {}
+static inline bool vmx_hv_use_flush_guest_mapping(struct kvm_vcpu *vcpu) { return false; }
 #endif /* IS_ENABLED(CONFIG_HYPERV) */
 
 #endif /* __ARCH_X86_KVM_VMX_ONHYPERV_H__ */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b58a74c1722d..cbde795096a6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -25,6 +25,7 @@
 #include "tss.h"
 #include "kvm_cache_regs.h"
 #include "kvm_emulate.h"
+#include "kvm_onhyperv.h"
 #include "mmu/page_track.h"
 #include "x86.h"
 #include "cpuid.h"
@@ -12390,6 +12391,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	vcpu->arch.hv_root_tdp = INVALID_PAGE;
+	vcpu->arch.hv_vmx_use_flush_guest_mapping =
+		(kvm_x86_ops.flush_remote_tlbs == hv_flush_remote_tlbs);
 #endif
 
 	r = kvm_x86_call(vcpu_create)(vcpu);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
  2025-06-20 15:39 ` [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes Jeremi Piotrowski
@ 2025-06-27  8:31   ` Vitaly Kuznetsov
  2025-07-02 16:11     ` Jeremi Piotrowski
  0 siblings, 1 reply; 12+ messages in thread
From: Vitaly Kuznetsov @ 2025-06-27  8:31 UTC (permalink / raw)
  To: Jeremi Piotrowski, Sean Christopherson, Paolo Bonzini, kvm
  Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma,
	andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, linux-hyperv, Jeremi Piotrowski

Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:

> Use Hyper-V's HvCallFlushGuestPhysicalAddressSpace for local TLB flushes.
> This makes any KVM_REQ_TLB_FLUSH_CURRENT (such as on root alloc) visible to
> all CPUs which means we no longer need to do a KVM_REQ_TLB_FLUSH on CPU
> migration.
>
> The goal is to avoid invept-global in KVM_REQ_TLB_FLUSH. Hyper-V uses a
> shadow page table for the nested hypervisor (KVM) and has to invalidate all
> EPT roots when invept-global is issued. This has a performance impact on
> all nested VMs.  KVM issues KVM_REQ_TLB_FLUSH on CPU migration, and under
> load the performance hit causes vCPUs to use up more of their slice of CPU
> time, leading to more CPU migrations. This has a snowball effect and causes
> CPU usage spikes.
>
> By issuing the hypercall we are now guaranteed that any root modification
> that requires a local TLB flush becomes visible to all CPUs. The same
> hypercall is already used in kvm_arch_flush_remote_tlbs and
> kvm_arch_flush_remote_tlbs_range.  The KVM expectation is that roots are
> flushed locally on alloc and we achieve consistency on migration by
> flushing all roots - the new behavior of achieving consistency on alloc on
> Hyper-V is a superset of the expected guarantees. This makes the
> KVM_REQ_TLB_FLUSH on CPU migration no longer necessary on Hyper-V.

Sounds reasonable overall, my only concern (not sure if valid or not) is
that using the hypercall for local flushes is going to be more expensive
than invept-context we do today and thus while the performance is
improved for the scenario when vCPUs are migrating a lot, we will take a
hit in other cases.

>
> Coincidentally - we now match the behavior of SVM on Hyper-V.
>
> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/vmx/vmx.c          | 20 +++++++++++++++++---
>  arch/x86/kvm/vmx/vmx_onhyperv.h |  6 ++++++
>  arch/x86/kvm/x86.c              |  3 +++
>  4 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b4a391929cdb..d3acab19f425 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1077,6 +1077,7 @@ struct kvm_vcpu_arch {
>  
>  #if IS_ENABLED(CONFIG_HYPERV)
>  	hpa_t hv_root_tdp;
> +	bool hv_vmx_use_flush_guest_mapping;
>  #endif
>  };
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 4953846cb30d..f537e0df56fc 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1485,8 +1485,12 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
>  		/*
>  		 * Flush all EPTP/VPID contexts, the new pCPU may have stale
>  		 * TLB entries from its previous association with the vCPU.
> +		 * Unless we are running on Hyper-V where we promotes local TLB

s,promotes,promote, or, as Sean doesn't like pronouns, 

"... where local TLB flushes are promoted ..."

> +		 * flushes to be visible across all CPUs so no need to do again
> +		 * on migration.
>  		 */
> -		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> +		if (!vmx_hv_use_flush_guest_mapping(vcpu))
> +			kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
>  
>  		/*
>  		 * Linux uses per-cpu TSS and GDT, so set these when switching
> @@ -3243,11 +3247,21 @@ void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
>  	if (!VALID_PAGE(root_hpa))
>  		return;
>  
> -	if (enable_ept)
> +	if (enable_ept) {
> +		/*
> +		 * hyperv_flush_guest_mapping() has the semantics of
> +		 * invept-single across all pCPUs. This makes root
> +		 * modifications consistent across pCPUs, so an invept-global
> +		 * on migration is no longer required.
> +		 */
> +		if (vmx_hv_use_flush_guest_mapping(vcpu))
> +			return (void)WARN_ON_ONCE(hyperv_flush_guest_mapping(root_hpa));
> +

HvCallFlushGuestPhysicalAddressSpace sounds like a heavy operation as it
affects all processors. Is there any visible perfomance impact of this
change when there are no migrations (e.g. with vCPU pinning)? Or do we
believe that Hyper-V actually handles invept-context the exact same way?

>  		ept_sync_context(construct_eptp(vcpu, root_hpa,
>  						mmu->root_role.level));
> -	else
> +	} else {
>  		vpid_sync_context(vmx_get_current_vpid(vcpu));
> +	}
>  }
>  
>  void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
> diff --git a/arch/x86/kvm/vmx/vmx_onhyperv.h b/arch/x86/kvm/vmx/vmx_onhyperv.h
> index cdf8cbb69209..a5c64c90e49e 100644
> --- a/arch/x86/kvm/vmx/vmx_onhyperv.h
> +++ b/arch/x86/kvm/vmx/vmx_onhyperv.h
> @@ -119,6 +119,11 @@ static inline void evmcs_load(u64 phys_addr)
>  }
>  
>  void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf);
> +
> +static inline bool vmx_hv_use_flush_guest_mapping(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.hv_vmx_use_flush_guest_mapping;
> +}
>  #else /* !IS_ENABLED(CONFIG_HYPERV) */
>  static __always_inline bool kvm_is_using_evmcs(void) { return false; }
>  static __always_inline void evmcs_write64(unsigned long field, u64 value) {}
> @@ -128,6 +133,7 @@ static __always_inline u64 evmcs_read64(unsigned long field) { return 0; }
>  static __always_inline u32 evmcs_read32(unsigned long field) { return 0; }
>  static __always_inline u16 evmcs_read16(unsigned long field) { return 0; }
>  static inline void evmcs_load(u64 phys_addr) {}
> +static inline bool vmx_hv_use_flush_guest_mapping(struct kvm_vcpu *vcpu) { return false; }
>  #endif /* IS_ENABLED(CONFIG_HYPERV) */
>  
>  #endif /* __ARCH_X86_KVM_VMX_ONHYPERV_H__ */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b58a74c1722d..cbde795096a6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -25,6 +25,7 @@
>  #include "tss.h"
>  #include "kvm_cache_regs.h"
>  #include "kvm_emulate.h"
> +#include "kvm_onhyperv.h"
>  #include "mmu/page_track.h"
>  #include "x86.h"
>  #include "cpuid.h"
> @@ -12390,6 +12391,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  
>  #if IS_ENABLED(CONFIG_HYPERV)
>  	vcpu->arch.hv_root_tdp = INVALID_PAGE;
> +	vcpu->arch.hv_vmx_use_flush_guest_mapping =
> +		(kvm_x86_ops.flush_remote_tlbs == hv_flush_remote_tlbs);
>  #endif
>  
>  	r = kvm_x86_call(vcpu_create)(vcpu);

-- 
Vitaly


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
  2025-06-27  8:31   ` Vitaly Kuznetsov
@ 2025-07-02 16:11     ` Jeremi Piotrowski
  2025-07-09 15:46       ` Vitaly Kuznetsov
  0 siblings, 1 reply; 12+ messages in thread
From: Jeremi Piotrowski @ 2025-07-02 16:11 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini, kvm
  Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma,
	andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, linux-hyperv

On 27/06/2025 10:31, Vitaly Kuznetsov wrote:
> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:
> 
>> Use Hyper-V's HvCallFlushGuestPhysicalAddressSpace for local TLB flushes.
>> This makes any KVM_REQ_TLB_FLUSH_CURRENT (such as on root alloc) visible to
>> all CPUs which means we no longer need to do a KVM_REQ_TLB_FLUSH on CPU
>> migration.
>>
>> The goal is to avoid invept-global in KVM_REQ_TLB_FLUSH. Hyper-V uses a
>> shadow page table for the nested hypervisor (KVM) and has to invalidate all
>> EPT roots when invept-global is issued. This has a performance impact on
>> all nested VMs.  KVM issues KVM_REQ_TLB_FLUSH on CPU migration, and under
>> load the performance hit causes vCPUs to use up more of their slice of CPU
>> time, leading to more CPU migrations. This has a snowball effect and causes
>> CPU usage spikes.
>>
>> By issuing the hypercall we are now guaranteed that any root modification
>> that requires a local TLB flush becomes visible to all CPUs. The same
>> hypercall is already used in kvm_arch_flush_remote_tlbs and
>> kvm_arch_flush_remote_tlbs_range.  The KVM expectation is that roots are
>> flushed locally on alloc and we achieve consistency on migration by
>> flushing all roots - the new behavior of achieving consistency on alloc on
>> Hyper-V is a superset of the expected guarantees. This makes the
>> KVM_REQ_TLB_FLUSH on CPU migration no longer necessary on Hyper-V.
> 
> Sounds reasonable overall, my only concern (not sure if valid or not) is
> that using the hypercall for local flushes is going to be more expensive
> than invept-context we do today and thus while the performance is
> improved for the scenario when vCPUs are migrating a lot, we will take a
> hit in other cases.
> 

Discussion below, I think the impact should be limited and will try to quantify it.

>>
>> Coincidentally - we now match the behavior of SVM on Hyper-V.
>>
>> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |  1 +
>>  arch/x86/kvm/vmx/vmx.c          | 20 +++++++++++++++++---
>>  arch/x86/kvm/vmx/vmx_onhyperv.h |  6 ++++++
>>  arch/x86/kvm/x86.c              |  3 +++
>>  4 files changed, 27 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index b4a391929cdb..d3acab19f425 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1077,6 +1077,7 @@ struct kvm_vcpu_arch {
>>  
>>  #if IS_ENABLED(CONFIG_HYPERV)
>>  	hpa_t hv_root_tdp;
>> +	bool hv_vmx_use_flush_guest_mapping;
>>  #endif
>>  };
>>  
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 4953846cb30d..f537e0df56fc 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -1485,8 +1485,12 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
>>  		/*
>>  		 * Flush all EPTP/VPID contexts, the new pCPU may have stale
>>  		 * TLB entries from its previous association with the vCPU.
>> +		 * Unless we are running on Hyper-V where we promotes local TLB
> 
> s,promotes,promote, or, as Sean doesn't like pronouns, 
> 
> "... where local TLB flushes are promoted ..."
> 

Will do.

>> +		 * flushes to be visible across all CPUs so no need to do again
>> +		 * on migration.
>>  		 */
>> -		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
>> +		if (!vmx_hv_use_flush_guest_mapping(vcpu))
>> +			kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
>>  
>>  		/*
>>  		 * Linux uses per-cpu TSS and GDT, so set these when switching
>> @@ -3243,11 +3247,21 @@ void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
>>  	if (!VALID_PAGE(root_hpa))
>>  		return;
>>  
>> -	if (enable_ept)
>> +	if (enable_ept) {
>> +		/*
>> +		 * hyperv_flush_guest_mapping() has the semantics of
>> +		 * invept-single across all pCPUs. This makes root
>> +		 * modifications consistent across pCPUs, so an invept-global
>> +		 * on migration is no longer required.
>> +		 */
>> +		if (vmx_hv_use_flush_guest_mapping(vcpu))
>> +			return (void)WARN_ON_ONCE(hyperv_flush_guest_mapping(root_hpa));
>> +
> 
> HvCallFlushGuestPhysicalAddressSpace sounds like a heavy operation as it
> affects all processors. Is there any visible perfomance impact of this
> change when there are no migrations (e.g. with vCPU pinning)? Or do we
> believe that Hyper-V actually handles invept-context the exact same way?
> 
I'm going to have to do some more investigation to answer that - do you have an
idea of a workload that would be sensitive to tlb flushes that I could compare
this on?

In terms of cost, Hyper-V needs to invalidate the VMs shadow page table for a root
and do the tlb flush. The first part is CPU intensive but is the same in both cases
(hypercall and invept-single). The tlb flush part will require a bit more work for
the hypercall as it needs to happen on all cores, and the tlb will now be empty
for that root.

My assumption is that these local tlb flushes are rather rare as they will
only happen when:
- new root is allocated
- we need to switch to a special root

So not very frequent post vm boot (with or without pinning). And the effect of the
tlb being empty for that root on other CPUs should be a neutral, as users of the
root would have performed the same local flush at a later point in time (when using it).

All the other mmu updates use kvm_flush_remote_tlbs* which already go through the
hypercall.

Jeremi


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
  2025-07-02 16:11     ` Jeremi Piotrowski
@ 2025-07-09 15:46       ` Vitaly Kuznetsov
  2025-07-15  0:39         ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Vitaly Kuznetsov @ 2025-07-09 15:46 UTC (permalink / raw)
  To: Jeremi Piotrowski
  Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma,
	andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, linux-hyperv, Sean Christopherson,
	Paolo Bonzini, kvm

Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:

> On 27/06/2025 10:31, Vitaly Kuznetsov wrote:
>> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:
>> 
>>> Use Hyper-V's HvCallFlushGuestPhysicalAddressSpace for local TLB flushes.
>>> This makes any KVM_REQ_TLB_FLUSH_CURRENT (such as on root alloc) visible to
>>> all CPUs which means we no longer need to do a KVM_REQ_TLB_FLUSH on CPU
>>> migration.
>>>
>>> The goal is to avoid invept-global in KVM_REQ_TLB_FLUSH. Hyper-V uses a
>>> shadow page table for the nested hypervisor (KVM) and has to invalidate all
>>> EPT roots when invept-global is issued. This has a performance impact on
>>> all nested VMs.  KVM issues KVM_REQ_TLB_FLUSH on CPU migration, and under
>>> load the performance hit causes vCPUs to use up more of their slice of CPU
>>> time, leading to more CPU migrations. This has a snowball effect and causes
>>> CPU usage spikes.
>>>
>>> By issuing the hypercall we are now guaranteed that any root modification
>>> that requires a local TLB flush becomes visible to all CPUs. The same
>>> hypercall is already used in kvm_arch_flush_remote_tlbs and
>>> kvm_arch_flush_remote_tlbs_range.  The KVM expectation is that roots are
>>> flushed locally on alloc and we achieve consistency on migration by
>>> flushing all roots - the new behavior of achieving consistency on alloc on
>>> Hyper-V is a superset of the expected guarantees. This makes the
>>> KVM_REQ_TLB_FLUSH on CPU migration no longer necessary on Hyper-V.
>> 
>> Sounds reasonable overall, my only concern (not sure if valid or not) is
>> that using the hypercall for local flushes is going to be more expensive
>> than invept-context we do today and thus while the performance is
>> improved for the scenario when vCPUs are migrating a lot, we will take a
>> hit in other cases.
>> 
>

Sorry for delayed reply!

....

>>>  		return;
>>>  
>>> -	if (enable_ept)
>>> +	if (enable_ept) {
>>> +		/*
>>> +		 * hyperv_flush_guest_mapping() has the semantics of
>>> +		 * invept-single across all pCPUs. This makes root
>>> +		 * modifications consistent across pCPUs, so an invept-global
>>> +		 * on migration is no longer required.
>>> +		 */
>>> +		if (vmx_hv_use_flush_guest_mapping(vcpu))
>>> +			return (void)WARN_ON_ONCE(hyperv_flush_guest_mapping(root_hpa));
>>> +
>> 
>> HvCallFlushGuestPhysicalAddressSpace sounds like a heavy operation as it
>> affects all processors. Is there any visible perfomance impact of this
>> change when there are no migrations (e.g. with vCPU pinning)? Or do we
>> believe that Hyper-V actually handles invept-context the exact same way?
>> 
> I'm going to have to do some more investigation to answer that - do you have an
> idea of a workload that would be sensitive to tlb flushes that I could compare
> this on?
>
> In terms of cost, Hyper-V needs to invalidate the VMs shadow page table for a root
> and do the tlb flush. The first part is CPU intensive but is the same in both cases
> (hypercall and invept-single). The tlb flush part will require a bit more work for
> the hypercall as it needs to happen on all cores, and the tlb will now be empty
> for that root.
>
> My assumption is that these local tlb flushes are rather rare as they will
> only happen when:
> - new root is allocated
> - we need to switch to a special root
>

KVM's MMU is an amazing maze so I'd appreciate if someone more
knowledgeble corrects me;t my understanding is that we call
*_flush_tlb_current() from two places:

kvm_mmu_load() and this covers the two cases above. These should not be
common under normal circumstances but can be frequent in some special
cases, e.g. when running a nested setup. Given that we're already
running on top of Hyper-V, this means 3+ level nesting which I don't
believe anyone really cares about.

kvm_vcpu_flush_tlb_current() from KVM_REQ_TLB_FLUSH_CURRENT. These are
things like some CR4 writes, APIC mode changes, ... which also shouldn't
be that common but VM boot time can be affected. So I'd suggest to test
big VM startup time, i.e. take the biggest available instance type on
Azure and measure how much time it takes to boot a VM which has the same
vCPU count. Honestly, I don't expect to see a significant change but I
guess it's still worth checking.

> So not very frequent post vm boot (with or without pinning). And the effect of the
> tlb being empty for that root on other CPUs should be a neutral, as users of the
> root would have performed the same local flush at a later point in
> time (when using it).
>
> All the other mmu updates use kvm_flush_remote_tlbs* which already go
> through the hypercall.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
  2025-07-09 15:46       ` Vitaly Kuznetsov
@ 2025-07-15  0:39         ` Sean Christopherson
  2025-08-04 15:49           ` Vitaly Kuznetsov
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-07-15  0:39 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Jeremi Piotrowski, Dave Hansen, linux-kernel, alanjiang,
	chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini,
	kvm

On Wed, Jul 09, 2025, Vitaly Kuznetsov wrote:
> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:
> 
> > On 27/06/2025 10:31, Vitaly Kuznetsov wrote:
> >> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:
> >> 
> >>> Use Hyper-V's HvCallFlushGuestPhysicalAddressSpace for local TLB flushes.
> >>> This makes any KVM_REQ_TLB_FLUSH_CURRENT (such as on root alloc) visible to
> >>> all CPUs which means we no longer need to do a KVM_REQ_TLB_FLUSH on CPU
> >>> migration.
> >>>
> >>> The goal is to avoid invept-global in KVM_REQ_TLB_FLUSH. Hyper-V uses a
> >>> shadow page table for the nested hypervisor (KVM) and has to invalidate all
> >>> EPT roots when invept-global is issued.

What all does "invalidate" mean here?  E.g. is this "just" a hardware TLB flush,
or is Hyper-V blasting away and rebuilding page tables?  If it's the latter, that
seems like a Hyper-V bug/problem.

> >>> This has a performance impact on all nested VMs.  KVM issues
> >>> KVM_REQ_TLB_FLUSH on CPU migration, and under load the performance hit
> >>> causes vCPUs to use up more of their slice of CPU time, leading to more
> >>> CPU migrations. This has a snowball effect and causes CPU usage spikes.

Is the performance hit due to flushing *hardware* TLBs, or due to Hyper-V needing
to rebuilding shadow page tables?

> >>> By issuing the hypercall we are now guaranteed that any root modification
> >>> that requires a local TLB flush becomes visible to all CPUs. The same
> >>> hypercall is already used in kvm_arch_flush_remote_tlbs and
> >>> kvm_arch_flush_remote_tlbs_range.  The KVM expectation is that roots are
> >>> flushed locally on alloc and we achieve consistency on migration by
> >>> flushing all roots - the new behavior of achieving consistency on alloc on
> >>> Hyper-V is a superset of the expected guarantees. This makes the
> >>> KVM_REQ_TLB_FLUSH on CPU migration no longer necessary on Hyper-V.
> >> 
> >> Sounds reasonable overall, my only concern (not sure if valid or not) is
> >> that using the hypercall for local flushes is going to be more expensive
> >> than invept-context we do today and thus while the performance is
> >> improved for the scenario when vCPUs are migrating a lot, we will take a
> >> hit in other cases.
> >> 
> >
> 
> Sorry for delayed reply!
> 
> ....
> 
> >>>  		return;
> >>>  
> >>> -	if (enable_ept)
> >>> +	if (enable_ept) {
> >>> +		/*
> >>> +		 * hyperv_flush_guest_mapping() has the semantics of
> >>> +		 * invept-single across all pCPUs. This makes root
> >>> +		 * modifications consistent across pCPUs, so an invept-global
> >>> +		 * on migration is no longer required.

Unfortunately, this isn't quite right.  If vCPU0 and vCPU1 share an EPT root,
APIC virtualization is enabled, and vCPU0 is running with x2APIC but vCPU1 is
running with xAPIC, then KVM needs to flush TLBs if vCPU1 is loaded on a "new"
pCPU, because vCPU0 could have inserted non-vAPIC TLB entries for that pCPU.

Hrm, but KVM doesn't actually handle that properly.  KVM only forces a TLB flush
if the vCPU wasn't already loaded, so if vCPU0 and vCPU1 are running on the same
pCPU, i.e. vCPU1 isn't being migrated to the pCPU that was previously running
vCPU0, then I believe vCPU1 could consume stale TLB entries.

Setting that aside for the moment, I would much prefer to elide this TLB flush
whenver possible, irrespective of whether KVM is running on bare metal or in a
VM, and irrespective of the host hypervisor.  And then if/when SVM is converted
to use per-vCPU ASIDs[*], give SVM the exact same treatment.  More below.

[*] https://lore.kernel.org/all/aFXrFKvZcJ3dN4k_@google.com

> >> HvCallFlushGuestPhysicalAddressSpace sounds like a heavy operation as it
> >> affects all processors. Is there any visible perfomance impact of this
> >> change when there are no migrations (e.g. with vCPU pinning)? Or do we
> >> believe that Hyper-V actually handles invept-context the exact same way?
> >> 
> > I'm going to have to do some more investigation to answer that - do you have an
> > idea of a workload that would be sensitive to tlb flushes that I could compare
> > this on?
> >
> > In terms of cost, Hyper-V needs to invalidate the VMs shadow page table for a root
> > and do the tlb flush. The first part is CPU intensive but is the same in both cases
> > (hypercall and invept-single). The tlb flush part will require a bit more work for
> > the hypercall as it needs to happen on all cores, and the tlb will now be empty
> > for that root.
> >
> > My assumption is that these local tlb flushes are rather rare as they will
> > only happen when:
> > - new root is allocated
> > - we need to switch to a special root
> >
> 
> KVM's MMU is an amazing maze so I'd appreciate if someone more
> knowledgeble corrects me;t my understanding is that we call
> *_flush_tlb_current() from two places:
> 
> kvm_mmu_load() and this covers the two cases above. These should not be
> common under normal circumstances but can be frequent in some special
> cases, e.g. when running a nested setup. Given that we're already
> running on top of Hyper-V, this means 3+ level nesting which I don't
> believe anyone really cares about.

Heh, don't be too sure about that.  People just love running "containers" inside
VMs, without thinking too hard about what they're doing :-)

In general, I don't like effectively turning KVM_REQ_TLB_FLUSH_CURRENT into
kvm_flush_remote_tlbs(), and I *really* don't like doing so for one specific
setup.  It's hard enough to capture the differences between KVM's various TLB
flushes hooks/requests, and special casing KVM-on-Hyper-V is just asking for
unique bugs.

Conceptually, I _think_ this is pretty straightforward: when a root is allocated,
flush the root on all *pCPUs*.  KVM currently flushes the root when a vCPU first
uses a root, which necessitates flushing on migration.

Alternatively, KVM could track which pCPUs a vCPU has run on, but that would get
expensive, and blasting a flush on alloc should be much simpler.

The two wrinkles I can think of are the x2APIC vs. xAPIC problem above (which I
think needs to be handled no matter what), and CPU hotplug (which is easy enough
to handle, I just didn't type it up).

It'll take more work than the below, e.g. to have VMX's construct_eptp() pull the
level and A/D bits from kvm_mmu_page (vendor code can get at the kvm_mmu_page with
root_to_sp()), but for the core concept/skeleton, I think this is it?

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6e838cb6c9e1..298130445182 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3839,6 +3839,37 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots);
 
+struct kvm_tlb_flush_root {
+       struct kvm *kvm;
+       hpa_t root;
+};
+
+static void kvm_flush_tlb_root(void *__data)
+{
+       struct kvm_tlb_flush_root *data = __data;
+
+       kvm_x86_call(flush_tlb_root)(data->kvm, data->root);
+}
+
+void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+       struct kvm_tlb_flush_root data = {
+               .kvm = kvm,
+               .root = __pa(root->spt),
+       };
+
+       /*
+        * Flush any TLB entries for the new root, the provenance of the root
+        * is unknown.  Even if KVM ensures there are no stale TLB entries
+        * for a freed root, in theory another hypervisor could have left
+        * stale entries.  Flushing on alloc also allows KVM to skip the TLB
+        * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing
+        * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is
+        * migrated to a different pCPU.
+        */
+       on_each_cpu(kvm_flush_tlb_root, &data, 1);
+}
+
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
                            u8 level)
 {
@@ -3852,7 +3883,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
        WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
 
        sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
-       ++sp->root_count;
+       if (!sp->root_count++)
+               kvm_mmu_flush_all_tlbs_root(vcpu->kvm, sp);
 
        return __pa(sp->spt);
 }
@@ -5961,15 +5993,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
        kvm_mmu_sync_roots(vcpu);
 
        kvm_mmu_load_pgd(vcpu);
-
-       /*
-        * Flush any TLB entries for the new root, the provenance of the root
-        * is unknown.  Even if KVM ensures there are no stale TLB entries
-        * for a freed root, in theory another hypervisor could have left
-        * stale entries.  Flushing on alloc also allows KVM to skip the TLB
-        * flush when freeing a root (see kvm_tdp_mmu_put_root()).
-        */
-       kvm_x86_call(flush_tlb_current)(vcpu);
 out:
        return r;
 }
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 65f3c89d7c5d..3cbf0d612f5e 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -167,6 +167,8 @@ static inline bool is_mirror_sp(const struct kvm_mmu_page *sp)
        return sp->role.is_mirror;
 }
 
+void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root);
+
 static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
        /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1..3ff36d09b4fa 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -302,6 +302,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
         */
        refcount_set(&root->tdp_mmu_root_count, 2);
        list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
+       kvm_mmu_flush_all_tlbs_root(vcpu->kvm, root);
 
 out_spin_unlock:
        spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
  2025-07-15  0:39         ` Sean Christopherson
@ 2025-08-04 15:49           ` Vitaly Kuznetsov
  2025-08-04 23:09             ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Vitaly Kuznetsov @ 2025-08-04 15:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jeremi Piotrowski, Dave Hansen, linux-kernel, alanjiang,
	chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini,
	kvm

Sean Christopherson <seanjc@google.com> writes:

...

>
> It'll take more work than the below, e.g. to have VMX's construct_eptp() pull the
> level and A/D bits from kvm_mmu_page (vendor code can get at the kvm_mmu_page with
> root_to_sp()), but for the core concept/skeleton, I think this is it?
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6e838cb6c9e1..298130445182 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3839,6 +3839,37 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots);
>  
> +struct kvm_tlb_flush_root {
> +       struct kvm *kvm;
> +       hpa_t root;
> +};
> +
> +static void kvm_flush_tlb_root(void *__data)
> +{
> +       struct kvm_tlb_flush_root *data = __data;
> +
> +       kvm_x86_call(flush_tlb_root)(data->kvm, data->root);
> +}
> +
> +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root)
> +{
> +       struct kvm_tlb_flush_root data = {
> +               .kvm = kvm,
> +               .root = __pa(root->spt),
> +       };
> +
> +       /*
> +        * Flush any TLB entries for the new root, the provenance of the root
> +        * is unknown.  Even if KVM ensures there are no stale TLB entries
> +        * for a freed root, in theory another hypervisor could have left
> +        * stale entries.  Flushing on alloc also allows KVM to skip the TLB
> +        * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing
> +        * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is
> +        * migrated to a different pCPU.
> +        */
> +       on_each_cpu(kvm_flush_tlb_root, &data, 1);

Would it make sense to complement this with e.g. a CPU mask tracking all
the pCPUs where the VM has ever been seen running (+ a flush when a new
one is added to it)?

I'm worried about the potential performance impact for a case when a
huge host is running a lot of small VMs in 'partitioning' mode
(i.e. when all vCPUs are pinned). Additionally, this may have a negative
impact on RT use-cases where each unnecessary interruption can be seen
problematic. 

> +}
> +
>  static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
>                             u8 level)
>  {
> @@ -3852,7 +3883,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
>         WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
>  
>         sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
> -       ++sp->root_count;
> +       if (!sp->root_count++)
> +               kvm_mmu_flush_all_tlbs_root(vcpu->kvm, sp);
>  
>         return __pa(sp->spt);
>  }
> @@ -5961,15 +5993,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>         kvm_mmu_sync_roots(vcpu);
>  
>         kvm_mmu_load_pgd(vcpu);
> -
> -       /*
> -        * Flush any TLB entries for the new root, the provenance of the root
> -        * is unknown.  Even if KVM ensures there are no stale TLB entries
> -        * for a freed root, in theory another hypervisor could have left
> -        * stale entries.  Flushing on alloc also allows KVM to skip the TLB
> -        * flush when freeing a root (see kvm_tdp_mmu_put_root()).
> -        */
> -       kvm_x86_call(flush_tlb_current)(vcpu);
>  out:
>         return r;
>  }
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 65f3c89d7c5d..3cbf0d612f5e 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -167,6 +167,8 @@ static inline bool is_mirror_sp(const struct kvm_mmu_page *sp)
>         return sp->role.is_mirror;
>  }
>  
> +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root);
> +
>  static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  {
>         /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7f3d7229b2c1..3ff36d09b4fa 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -302,6 +302,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
>          */
>         refcount_set(&root->tdp_mmu_root_count, 2);
>         list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
> +       kvm_mmu_flush_all_tlbs_root(vcpu->kvm, root);
>  
>  out_spin_unlock:
>         spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
  2025-08-04 15:49           ` Vitaly Kuznetsov
@ 2025-08-04 23:09             ` Sean Christopherson
  2025-08-05 18:04               ` Jeremi Piotrowski
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-08-04 23:09 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Jeremi Piotrowski, Dave Hansen, linux-kernel, alanjiang,
	chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini,
	kvm

On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > It'll take more work than the below, e.g. to have VMX's construct_eptp() pull the
> > level and A/D bits from kvm_mmu_page (vendor code can get at the kvm_mmu_page with
> > root_to_sp()), but for the core concept/skeleton, I think this is it?
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6e838cb6c9e1..298130445182 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3839,6 +3839,37 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots);
> >  
> > +struct kvm_tlb_flush_root {
> > +       struct kvm *kvm;
> > +       hpa_t root;
> > +};
> > +
> > +static void kvm_flush_tlb_root(void *__data)
> > +{
> > +       struct kvm_tlb_flush_root *data = __data;
> > +
> > +       kvm_x86_call(flush_tlb_root)(data->kvm, data->root);
> > +}
> > +
> > +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root)
> > +{
> > +       struct kvm_tlb_flush_root data = {
> > +               .kvm = kvm,
> > +               .root = __pa(root->spt),
> > +       };
> > +
> > +       /*
> > +        * Flush any TLB entries for the new root, the provenance of the root
> > +        * is unknown.  Even if KVM ensures there are no stale TLB entries
> > +        * for a freed root, in theory another hypervisor could have left
> > +        * stale entries.  Flushing on alloc also allows KVM to skip the TLB
> > +        * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing
> > +        * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is
> > +        * migrated to a different pCPU.
> > +        */
> > +       on_each_cpu(kvm_flush_tlb_root, &data, 1);
> 
> Would it make sense to complement this with e.g. a CPU mask tracking all
> the pCPUs where the VM has ever been seen running (+ a flush when a new
> one is added to it)?
> 
> I'm worried about the potential performance impact for a case when a
> huge host is running a lot of small VMs in 'partitioning' mode
> (i.e. when all vCPUs are pinned). Additionally, this may have a negative
> impact on RT use-cases where each unnecessary interruption can be seen
> problematic. 

Oof, right.  And it's not even a VM-to-VM noisy neighbor problem, e.g. a few
vCPUs using nested TDP could generate a lot of noist IRQs through a VM.  Hrm.

So I think the basic idea is so flawed/garbage that even enhancing it with per-VM
pCPU tracking wouldn't work.  I do think you've got the right idea with a pCPU mask
though, but instead of using a mask to scope IPIs, use it to elide TLB flushes.

With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time:
SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2.  Allocating a cpumask for each
TDP MMU root seems reasonable.  Then on task migration, instead of doing a global
INVEPT, only INVEPT the current and prev_roots (because getting a new root will
trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU
has already done a flush for the root.

Or we could do the optimized tracking for all roots.  x86 supports at most 8192
CPUs, which means 1KiB per root.  That doesn't seem at all painful given that
each shadow pages consumes 4KiB...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
  2025-08-04 23:09             ` Sean Christopherson
@ 2025-08-05 18:04               ` Jeremi Piotrowski
  2025-08-05 23:42                 ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Jeremi Piotrowski @ 2025-08-05 18:04 UTC (permalink / raw)
  To: Sean Christopherson, Vitaly Kuznetsov
  Cc: Dave Hansen, linux-kernel, alanjiang, chinang.ma,
	andrea.pellegrini, Kevin Tian, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini, kvm

On 05/08/2025 01:09, Sean Christopherson wrote:
> On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>> It'll take more work than the below, e.g. to have VMX's construct_eptp() pull the
>>> level and A/D bits from kvm_mmu_page (vendor code can get at the kvm_mmu_page with
>>> root_to_sp()), but for the core concept/skeleton, I think this is it?
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 6e838cb6c9e1..298130445182 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -3839,6 +3839,37 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
>>>  }
>>>  EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots);
>>>  
>>> +struct kvm_tlb_flush_root {
>>> +       struct kvm *kvm;
>>> +       hpa_t root;
>>> +};
>>> +
>>> +static void kvm_flush_tlb_root(void *__data)
>>> +{
>>> +       struct kvm_tlb_flush_root *data = __data;
>>> +
>>> +       kvm_x86_call(flush_tlb_root)(data->kvm, data->root);
>>> +}
>>> +
>>> +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root)
>>> +{
>>> +       struct kvm_tlb_flush_root data = {
>>> +               .kvm = kvm,
>>> +               .root = __pa(root->spt),
>>> +       };
>>> +
>>> +       /*
>>> +        * Flush any TLB entries for the new root, the provenance of the root
>>> +        * is unknown.  Even if KVM ensures there are no stale TLB entries
>>> +        * for a freed root, in theory another hypervisor could have left
>>> +        * stale entries.  Flushing on alloc also allows KVM to skip the TLB
>>> +        * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing
>>> +        * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is
>>> +        * migrated to a different pCPU.
>>> +        */
>>> +       on_each_cpu(kvm_flush_tlb_root, &data, 1);
>>
>> Would it make sense to complement this with e.g. a CPU mask tracking all
>> the pCPUs where the VM has ever been seen running (+ a flush when a new
>> one is added to it)?
>>
>> I'm worried about the potential performance impact for a case when a
>> huge host is running a lot of small VMs in 'partitioning' mode
>> (i.e. when all vCPUs are pinned). Additionally, this may have a negative
>> impact on RT use-cases where each unnecessary interruption can be seen
>> problematic. 
> 
> Oof, right.  And it's not even a VM-to-VM noisy neighbor problem, e.g. a few
> vCPUs using nested TDP could generate a lot of noist IRQs through a VM.  Hrm.
> 
> So I think the basic idea is so flawed/garbage that even enhancing it with per-VM
> pCPU tracking wouldn't work.  I do think you've got the right idea with a pCPU mask
> though, but instead of using a mask to scope IPIs, use it to elide TLB flushes.

Sorry for the delay in replying, I've been sidetracked a bit.

I like this idea more, not special casing the TLB flushing approach per hypervisor is
preferable.

> 
> With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time:
> SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2.  Allocating a cpumask for each
> TDP MMU root seems reasonable.  Then on task migration, instead of doing a global
> INVEPT, only INVEPT the current and prev_roots (because getting a new root will
> trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU
> has already done a flush for the root.

Just to make sure I follow: current+prev_roots do you mean literally those (i.e. cached prev roots)
or all roots on kvm->arch.tdp_mmu_roots?

So this would mean: on pCPU migration, check if current mmu has is_tdp_mmu_active()
and then perform the INVEPT-single over roots instead of INVEPT-global. Otherwise stick
to the KVM_REQ_TLB_FLUSH.

Would there need to be a check for is_guest_mode(), or that the switch is coming from
the vmx/nested.c? I suppose not because nested doesn't seem to use TDP MMU.
> 
> Or we could do the optimized tracking for all roots.  x86 supports at most 8192
> CPUs, which means 1KiB per root.  That doesn't seem at all painful given that
> each shadow pages consumes 4KiB...

Similar question here: which all roots would need to be tracked+flushed for shadow
paging? pae_roots?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
  2025-08-05 18:04               ` Jeremi Piotrowski
@ 2025-08-05 23:42                 ` Sean Christopherson
  2025-08-15 13:49                   ` Jeremi Piotrowski
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-08-05 23:42 UTC (permalink / raw)
  To: Jeremi Piotrowski
  Cc: Vitaly Kuznetsov, Dave Hansen, linux-kernel, alanjiang,
	chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini,
	kvm

[-- Attachment #1: Type: text/plain, Size: 5181 bytes --]

On Tue, Aug 05, 2025, Jeremi Piotrowski wrote:
> On 05/08/2025 01:09, Sean Christopherson wrote:
> > On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote:
> >> Sean Christopherson <seanjc@google.com> writes:
> >>> +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root)
> >>> +{
> >>> +       struct kvm_tlb_flush_root data = {
> >>> +               .kvm = kvm,
> >>> +               .root = __pa(root->spt),
> >>> +       };
> >>> +
> >>> +       /*
> >>> +        * Flush any TLB entries for the new root, the provenance of the root
> >>> +        * is unknown.  Even if KVM ensures there are no stale TLB entries
> >>> +        * for a freed root, in theory another hypervisor could have left
> >>> +        * stale entries.  Flushing on alloc also allows KVM to skip the TLB
> >>> +        * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing
> >>> +        * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is
> >>> +        * migrated to a different pCPU.
> >>> +        */
> >>> +       on_each_cpu(kvm_flush_tlb_root, &data, 1);
> >>
> >> Would it make sense to complement this with e.g. a CPU mask tracking all
> >> the pCPUs where the VM has ever been seen running (+ a flush when a new
> >> one is added to it)?
> >>
> >> I'm worried about the potential performance impact for a case when a
> >> huge host is running a lot of small VMs in 'partitioning' mode
> >> (i.e. when all vCPUs are pinned). Additionally, this may have a negative
> >> impact on RT use-cases where each unnecessary interruption can be seen
> >> problematic. 
> > 
> > Oof, right.  And it's not even a VM-to-VM noisy neighbor problem, e.g. a few
> > vCPUs using nested TDP could generate a lot of noist IRQs through a VM.  Hrm.
> > 
> > So I think the basic idea is so flawed/garbage that even enhancing it with per-VM
> > pCPU tracking wouldn't work.  I do think you've got the right idea with a pCPU mask
> > though, but instead of using a mask to scope IPIs, use it to elide TLB flushes.
> 
> Sorry for the delay in replying, I've been sidetracked a bit.

No worries, I guarantee my delays will make your delays pale in comparison :-D

> I like this idea more, not special casing the TLB flushing approach per hypervisor is
> preferable.
> 
> > 
> > With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time:
> > SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2.  Allocating a cpumask for each
> > TDP MMU root seems reasonable.  Then on task migration, instead of doing a global
> > INVEPT, only INVEPT the current and prev_roots (because getting a new root will
> > trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU
> > has already done a flush for the root.
> 
> Just to make sure I follow: current+prev_roots do you mean literally those
> (i.e. cached prev roots) or all roots on kvm->arch.tdp_mmu_roots?

The former, i.e. "root" and all "prev_roots" entries in a kvm_mmu structure.

> So this would mean: on pCPU migration, check if current mmu has is_tdp_mmu_active()
> and then perform the INVEPT-single over roots instead of INVEPT-global. Otherwise stick
> to the KVM_REQ_TLB_FLUSH.

No, KVM would still need to ensure shadow roots are flushed as well, because KVM
doesn't flush TLBs when switching to a previous root (see fast_pgd_switch()).
More at the bottom.

> Would there need to be a check for is_guest_mode(), or that the switch is
> coming from the vmx/nested.c? I suppose not because nested doesn't seem to
> use TDP MMU.

Nested can use the TDP MMU, though there's practically no code in KVM that explicitly
deals with this scenario.  If L1 is using legacy shadow paging, i.e. is NOT using
EPT/NPT, then KVM will use the TDP MMU to map L2 (with kvm_mmu_page_role.guest_mode=1
to differentiate from the L1 TDP MMU).

> > Or we could do the optimized tracking for all roots.  x86 supports at most 8192
> > CPUs, which means 1KiB per root.  That doesn't seem at all painful given that
> > each shadow pages consumes 4KiB...
> 
> Similar question here: which all roots would need to be tracked+flushed for shadow
> paging? pae_roots?

Same general answer, "root" and all "prev_roots" entries.  KVM uses up to two
"struct kvm_mmu" instances to actually map memory into the guest: root_mmu and
guest_mmu.  The third instance, nested_mmu, is used to model gva->gpa translations
for L2, i.e. is used only to walk L2 stage-1 page tables, and is never used to
map memory into the guest, i.e. can't have entries in hardware TLBs.

The basic gist is to add a cpumask in each root, and then elide TLB flushes on
pCPU migration if KVM has flushed the root at least once.  Patch 5/5 in the attached
set of patches provides a *very* rough sketch.  Hopefully its enough to convey the
core idea.

Patches 1-4 compile, but are otherwise untested.  I'll post patches 1-3 as a small
series once their tested, as those cleanups are worth doing irrespective of any
optimizations we make to pCPU migration.


P.S. everyone and their mother thinks guest_mmu and nested_mmu are terrible names,
but no one has come up with names good enough to convince everyone to get out from
behind the bikeshed :-)

[-- Attachment #2: 0001-KVM-VMX-Hoist-construct_eptp-up-in-vmx.c.patch --]
[-- Type: text/x-diff, Size: 1723 bytes --]

From 8d0e63076371b04ca018577238b6d9b4e6cb1834 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 5 Aug 2025 14:29:19 -0700
Subject: [PATCH 1/5] KVM: VMX: Hoist construct_eptp() "up" in vmx.c

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index aa157fe5b7b3..9533eabc2182 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3188,6 +3188,20 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
 	return to_vmx(vcpu)->vpid;
 }
 
+u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
+{
+	u64 eptp = VMX_EPTP_MT_WB;
+
+	eptp |= (root_level == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4;
+
+	if (enable_ept_ad_bits &&
+	    (!is_guest_mode(vcpu) || nested_ept_ad_enabled(vcpu)))
+		eptp |= VMX_EPTP_AD_ENABLE_BIT;
+	eptp |= root_hpa;
+
+	return eptp;
+}
+
 void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
@@ -3365,20 +3379,6 @@ static int vmx_get_max_ept_level(void)
 	return 4;
 }
 
-u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
-{
-	u64 eptp = VMX_EPTP_MT_WB;
-
-	eptp |= (root_level == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4;
-
-	if (enable_ept_ad_bits &&
-	    (!is_guest_mode(vcpu) || nested_ept_ad_enabled(vcpu)))
-		eptp |= VMX_EPTP_AD_ENABLE_BIT;
-	eptp |= root_hpa;
-
-	return eptp;
-}
-
 void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
 {
 	struct kvm *kvm = vcpu->kvm;

base-commit: 196d9e72c4b0bd68b74a4ec7f52d248f37d0f030
-- 
2.50.1.565.gc32cd1483b-goog


[-- Attachment #3: 0002-KVM-nVMX-Hardcode-dummy-EPTP-used-for-early-nested-c.patch --]
[-- Type: text/x-diff, Size: 2483 bytes --]

From 2ca5f9bccff0458dab303d1929b9e13e869b7c85 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 5 Aug 2025 14:29:46 -0700
Subject: [PATCH 2/5] KVM: nVMX: Hardcode dummy EPTP used for early nested
 consistency checks

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 8 +++-----
 arch/x86/kvm/vmx/vmx.c    | 2 +-
 arch/x86/kvm/vmx/vmx.h    | 1 -
 3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b8ea1969113d..f3f5da3ee2cc 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2278,13 +2278,11 @@ static void prepare_vmcs02_constant_state(struct vcpu_vmx *vmx)
 	vmx->nested.vmcs02_initialized = true;
 
 	/*
-	 * We don't care what the EPTP value is we just need to guarantee
-	 * it's valid so we don't get a false positive when doing early
-	 * consistency checks.
+	 * If early consistency checks are enabled, stuff the EPT Pointer with
+	 * dummy *legal* value to avoid false positives on bad control state.
 	 */
 	if (enable_ept && nested_early_check)
-		vmcs_write64(EPT_POINTER,
-			     construct_eptp(&vmx->vcpu, 0, PT64_ROOT_4LEVEL));
+		vmcs_write64(EPT_POINTER, VMX_EPTP_MT_WB | VMX_EPTP_PWL_4);
 
 	if (vmx->ve_info)
 		vmcs_write64(VE_INFORMATION_ADDRESS, __pa(vmx->ve_info));
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9533eabc2182..8fc114e6fa56 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3188,7 +3188,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
 	return to_vmx(vcpu)->vpid;
 }
 
-u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
+static u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
 {
 	u64 eptp = VMX_EPTP_MT_WB;
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index d3389baf3ab3..7c3f8b908c69 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -366,7 +366,6 @@ void set_cr4_guest_host_mask(struct vcpu_vmx *vmx);
 void ept_save_pdptrs(struct kvm_vcpu *vcpu);
 void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
 void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
-u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 
 bool vmx_guest_inject_ac(struct kvm_vcpu *vcpu);
 void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
-- 
2.50.1.565.gc32cd1483b-goog


[-- Attachment #4: 0003-KVM-VMX-Use-kvm_mmu_page-role-to-construct-EPTP-not-.patch --]
[-- Type: text/x-diff, Size: 2624 bytes --]

From f79f76040166e741261e5f819ed23595922a8ba2 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 5 Aug 2025 14:46:31 -0700
Subject: [PATCH 3/5] KVM: VMX: Use kvm_mmu_page role to construct EPTP, not
 current vCPU state

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 37 ++++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8fc114e6fa56..2408aae01837 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3188,20 +3188,36 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
 	return to_vmx(vcpu)->vpid;
 }
 
-static u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
+static u64 construct_eptp(hpa_t root_hpa)
 {
-	u64 eptp = VMX_EPTP_MT_WB;
+	struct kvm_mmu_page *root = root_to_sp(root_hpa);
+	u64 eptp = root_hpa | VMX_EPTP_MT_WB;
 
-	eptp |= (root_level == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4;
+	/*
+	 * EPT roots should always have an associated MMU page.  Return a "bad"
+	 * EPTP to induce VM-Fail instead of continuing on in a unknown state.
+	 */
+	if (WARN_ON_ONCE(!root))
+		return INVALID_PAGE;
 
-	if (enable_ept_ad_bits &&
-	    (!is_guest_mode(vcpu) || nested_ept_ad_enabled(vcpu)))
+	eptp |= (root->role.level == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4;
+
+	if (enable_ept_ad_bits && !root->role.ad_disabled)
 		eptp |= VMX_EPTP_AD_ENABLE_BIT;
-	eptp |= root_hpa;
 
 	return eptp;
 }
 
+static void vmx_flush_tlb_ept_root(hpa_t root_hpa)
+{
+	u64 eptp = construct_eptp(root_hpa);
+
+	if (VALID_PAGE(eptp))
+		ept_sync_context(eptp);
+	else
+		ept_sync_global();
+}
+
 void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
@@ -3212,8 +3228,7 @@ void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 		return;
 
 	if (enable_ept)
-		ept_sync_context(construct_eptp(vcpu, root_hpa,
-						mmu->root_role.level));
+		vmx_flush_tlb_ept_root(root_hpa);
 	else
 		vpid_sync_context(vmx_get_current_vpid(vcpu));
 }
@@ -3384,11 +3399,11 @@ void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
 	struct kvm *kvm = vcpu->kvm;
 	bool update_guest_cr3 = true;
 	unsigned long guest_cr3;
-	u64 eptp;
 
 	if (enable_ept) {
-		eptp = construct_eptp(vcpu, root_hpa, root_level);
-		vmcs_write64(EPT_POINTER, eptp);
+		KVM_MMU_WARN_ON(!root_to_sp(root_hpa) ||
+				root_level != root_to_sp(root_hpa)->role.level);
+		vmcs_write64(EPT_POINTER, construct_eptp(root_hpa));
 
 		hv_track_root_tdp(vcpu, root_hpa);
 
-- 
2.50.1.565.gc32cd1483b-goog


[-- Attachment #5: 0004-KVM-VMX-Flush-only-active-EPT-roots-on-pCPU-migratio.patch --]
[-- Type: text/x-diff, Size: 1998 bytes --]

From 501f4c799f207a07933279485f76205f91e4537f Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 5 Aug 2025 15:13:27 -0700
Subject: [PATCH 4/5] KVM: VMX: Flush only active EPT roots on pCPU migration

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 2408aae01837..b42747e2293d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1395,6 +1395,8 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
 	}
 }
 
+static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu);
+
 void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1431,7 +1433,12 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
 		 * Flush all EPTP/VPID contexts, the new pCPU may have stale
 		 * TLB entries from its previous association with the vCPU.
 		 */
-		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+		if (enable_ept) {
+			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu);
+			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu);
+		} else {
+			kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+		}
 
 		/*
 		 * Linux uses per-cpu TSS and GDT, so set these when switching
@@ -3254,6 +3261,24 @@ void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
 	vpid_sync_context(vmx_get_current_vpid(vcpu));
 }
 
+static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa)
+{
+	if (!VALID_PAGE(root_hpa))
+		return;
+
+	vmx_flush_tlb_ept_root(root_hpa);
+}
+
+static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu)
+{
+	int i;
+
+	__vmx_flush_ept_on_pcpu_migration(mmu->root.hpa);
+
+	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+		__vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa);
+}
+
 void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
-- 
2.50.1.565.gc32cd1483b-goog


[-- Attachment #6: 0005-KVM-VMX-Sketch-in-possible-framework-for-eliding-TLB.patch --]
[-- Type: text/x-diff, Size: 3484 bytes --]

From ca798b2e1de4d0975ee808108c7514fe738f0898 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 5 Aug 2025 15:58:13 -0700
Subject: [PATCH 5/5] KVM: VMX: Sketch in possible framework for eliding TLB
 flushes on pCPU migration

Not-Signed-off-by: Sean Christopherson <seanjc@google.com>

(anyone that makes this work deserves full credit)
---
 arch/x86/kvm/mmu/mmu.c     |  3 +++
 arch/x86/kvm/mmu/tdp_mmu.c |  2 ++
 arch/x86/kvm/vmx/vmx.c     | 21 ++++++++++++++-------
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6e838cb6c9e1..925efbaae9b9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3854,6 +3854,9 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
 	++sp->root_count;
 
+	if (level >= PT64_ROOT_4LEVEL)
+		kvm_x86_call(alloc_root_cpu_mask)(root);
+
 	return __pa(sp->spt);
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1..bf4b0b9a7816 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -293,6 +293,8 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
 	root = tdp_mmu_alloc_sp(vcpu);
 	tdp_mmu_init_sp(root, NULL, 0, role);
 
+	kvm_x86_call(alloc_root_cpu_mask)(root);
+
 	/*
 	 * TDP MMU roots are kept until they are explicitly invalidated, either
 	 * by a memslot update or by the destruction of the VM.  Initialize the
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b42747e2293d..e85830189cfc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1395,7 +1395,7 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
 	}
 }
 
-static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu);
+static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu);
 
 void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
 {
@@ -1434,8 +1434,8 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
 		 * TLB entries from its previous association with the vCPU.
 		 */
 		if (enable_ept) {
-			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu);
-			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu);
+			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu, cpu);
+			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu, cpu);
 		} else {
 			kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		}
@@ -3261,22 +3261,29 @@ void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
 	vpid_sync_context(vmx_get_current_vpid(vcpu));
 }
 
-static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa)
+static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa, int cpu)
 {
+	struct kvm_mmu_page *root;
+
 	if (!VALID_PAGE(root_hpa))
 		return;
 
+	root = root_to_sp(root_hpa);
+	if (!WARN_ON_ONCE(!root) &&
+	    test_and_set_bit(cpu, root->cpu_flushed_mask))
+		return;
+
 	vmx_flush_tlb_ept_root(root_hpa);
 }
 
-static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu)
+static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu)
 {
 	int i;
 
-	__vmx_flush_ept_on_pcpu_migration(mmu->root.hpa);
+	__vmx_flush_ept_on_pcpu_migration(mmu->root.hpa, cpu);
 
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
-		__vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa);
+		__vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa, cpu);
 }
 
 void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu)
-- 
2.50.1.565.gc32cd1483b-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
  2025-08-05 23:42                 ` Sean Christopherson
@ 2025-08-15 13:49                   ` Jeremi Piotrowski
  2025-08-19 22:50                     ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Jeremi Piotrowski @ 2025-08-15 13:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Dave Hansen, linux-kernel, alanjiang,
	chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini,
	kvm

[-- Attachment #1: Type: text/plain, Size: 4470 bytes --]

On Tue, Aug 05, 2025 at 04:42:46PM -0700, Sean Christopherson wrote:
> On Tue, Aug 05, 2025, Jeremi Piotrowski wrote:
> > On 05/08/2025 01:09, Sean Christopherson wrote:
> > > On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote:
> > >> Sean Christopherson <seanjc@google.com> writes:

(snip)

> > > 
> > > Oof, right.  And it's not even a VM-to-VM noisy neighbor problem, e.g. a few
> > > vCPUs using nested TDP could generate a lot of noist IRQs through a VM.  Hrm.
> > > 
> > > So I think the basic idea is so flawed/garbage that even enhancing it with per-VM
> > > pCPU tracking wouldn't work.  I do think you've got the right idea with a pCPU mask
> > > though, but instead of using a mask to scope IPIs, use it to elide TLB flushes.
> > 
> > Sorry for the delay in replying, I've been sidetracked a bit.
> 
> No worries, I guarantee my delays will make your delays pale in comparison :-D
> 
> > I like this idea more, not special casing the TLB flushing approach per hypervisor is
> > preferable.
> > 
> > > 
> > > With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time:
> > > SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2.  Allocating a cpumask for each
> > > TDP MMU root seems reasonable.  Then on task migration, instead of doing a global
> > > INVEPT, only INVEPT the current and prev_roots (because getting a new root will
> > > trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU
> > > has already done a flush for the root.
> > 
> > Just to make sure I follow: current+prev_roots do you mean literally those
> > (i.e. cached prev roots) or all roots on kvm->arch.tdp_mmu_roots?
> 
> The former, i.e. "root" and all "prev_roots" entries in a kvm_mmu structure.
> 
> > So this would mean: on pCPU migration, check if current mmu has is_tdp_mmu_active()
> > and then perform the INVEPT-single over roots instead of INVEPT-global. Otherwise stick
> > to the KVM_REQ_TLB_FLUSH.
> 
> No, KVM would still need to ensure shadow roots are flushed as well, because KVM
> doesn't flush TLBs when switching to a previous root (see fast_pgd_switch()).
> More at the bottom.
> 
> > Would there need to be a check for is_guest_mode(), or that the switch is
> > coming from the vmx/nested.c? I suppose not because nested doesn't seem to
> > use TDP MMU.
> 
> Nested can use the TDP MMU, though there's practically no code in KVM that explicitly
> deals with this scenario.  If L1 is using legacy shadow paging, i.e. is NOT using
> EPT/NPT, then KVM will use the TDP MMU to map L2 (with kvm_mmu_page_role.guest_mode=1
> to differentiate from the L1 TDP MMU).
> 
> > > Or we could do the optimized tracking for all roots.  x86 supports at most 8192
> > > CPUs, which means 1KiB per root.  That doesn't seem at all painful given that
> > > each shadow pages consumes 4KiB...
> > 
> > Similar question here: which all roots would need to be tracked+flushed for shadow
> > paging? pae_roots?
> 
> Same general answer, "root" and all "prev_roots" entries.  KVM uses up to two
> "struct kvm_mmu" instances to actually map memory into the guest: root_mmu and
> guest_mmu.  The third instance, nested_mmu, is used to model gva->gpa translations
> for L2, i.e. is used only to walk L2 stage-1 page tables, and is never used to
> map memory into the guest, i.e. can't have entries in hardware TLBs.
> 
> The basic gist is to add a cpumask in each root, and then elide TLB flushes on
> pCPU migration if KVM has flushed the root at least once.  Patch 5/5 in the attached
> set of patches provides a *very* rough sketch.  Hopefully its enough to convey the
> core idea.
> 
> Patches 1-4 compile, but are otherwise untested.  I'll post patches 1-3 as a small
> series once their tested, as those cleanups are worth doing irrespective of any
> optimizations we make to pCPU migration.
> 

Thanks for the detailed explanation and the patches Sean!
I started working on extending patch 5, wanted to post it here to make sure I'm
on the right track.

It works in testing so far and shows promising performance - it gets rid of all
the pathological cases I saw before.

I haven't checked whether I broke SVM yet, and I need figure out a way to
always keep the cpumask "offstack" so that we don't blow up every struct
kvm_mmu_page instance with an inline cpumask - it needs to stay optional.

I also came across kvm_mmu_is_dummy_root(), that check is included in
root_to_sp(). Can you think of any other checks that we might need to handle?

[-- Attachment #2: 0001-KVM-VMX-Sketch-in-possible-framework-for-eliding-TLB.patch --]
[-- Type: text/x-diff, Size: 8063 bytes --]

From 8fb6d18ad4cbdd1802df45be49358a6d6acf72a0 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 5 Aug 2025 15:58:13 -0700
Subject: [PATCH] KVM: VMX: Sketch in possible framework for eliding TLB
 flushes on pCPU migration

Not-Signed-off-by: Sean Christopherson <seanjc@google.com>

(anyone that makes this work deserves full credit)

Not-yet-Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  3 +++
 arch/x86/kvm/mmu/mmu.c             |  5 +++++
 arch/x86/kvm/mmu/mmu_internal.h    |  4 ++++
 arch/x86/kvm/mmu/tdp_mmu.c         |  4 ++++
 arch/x86/kvm/vmx/main.c            |  1 +
 arch/x86/kvm/vmx/vmx.c             | 28 +++++++++++++++++++++-------
 arch/x86/kvm/vmx/x86_ops.h         |  1 +
 8 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8d50e3e0a19b..60351dd22f2f 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -99,6 +99,7 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
 KVM_X86_OP_OPTIONAL(set_external_spte)
 KVM_X86_OP_OPTIONAL(free_external_spt)
 KVM_X86_OP_OPTIONAL(remove_external_spte)
+KVM_X86_OP_OPTIONAL(alloc_root_cpu_mask)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b4a391929cdb..a3d415c3ea8b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1801,6 +1801,9 @@ struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
 
+	/* Allocate per-root pCPU flush mask. */
+	void (*alloc_root_cpu_mask)(struct kvm_mmu_page *root);
+
 	/* Update external mapping with page table link. */
 	int (*link_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				void *external_spt);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4e06e2e89a8f..721ee8ea76bd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -20,6 +20,7 @@
 #include "ioapic.h"
 #include "mmu.h"
 #include "mmu_internal.h"
+#include <linux/cpumask.h>
 #include "tdp_mmu.h"
 #include "x86.h"
 #include "kvm_cache_regs.h"
@@ -1820,6 +1821,7 @@ static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	free_page((unsigned long)sp->shadowed_translation);
+	free_cpumask_var(sp->cpu_flushed_mask);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -3827,6 +3829,9 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
 	++sp->root_count;
 
+	if (level >= PT64_ROOT_4LEVEL)
+		kvm_x86_call(alloc_root_cpu_mask)(sp);
+
 	return __pa(sp->spt);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index db8f33e4de62..5acb3dd34b36 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -7,6 +7,7 @@
 #include <asm/kvm_host.h>
 
 #include "mmu.h"
+#include <linux/cpumask.h>
 
 #ifdef CONFIG_KVM_PROVE_MMU
 #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
@@ -145,6 +146,9 @@ struct kvm_mmu_page {
 	/* Used for freeing the page asynchronously if it is a TDP MMU page. */
 	struct rcu_head rcu_head;
 #endif
+
+	/* Mask tracking which host CPUs have flushed this EPT root */
+	cpumask_var_t cpu_flushed_mask;
 };
 
 extern struct kmem_cache *mmu_page_header_cache;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1..40c7f46f553c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -3,6 +3,7 @@
 
 #include "mmu.h"
 #include "mmu_internal.h"
+#include <linux/cpumask.h>
 #include "mmutrace.h"
 #include "tdp_iter.h"
 #include "tdp_mmu.h"
@@ -57,6 +58,7 @@ static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
 	free_page((unsigned long)sp->external_spt);
 	free_page((unsigned long)sp->spt);
+	free_cpumask_var(sp->cpu_flushed_mask);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -293,6 +295,8 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
 	root = tdp_mmu_alloc_sp(vcpu);
 	tdp_mmu_init_sp(root, NULL, 0, role);
 
+	kvm_x86_call(alloc_root_cpu_mask)(root);
+
 	/*
 	 * TDP MMU roots are kept until they are explicitly invalidated, either
 	 * by a memslot update or by the destruction of the VM.  Initialize the
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d1e02e567b57..ec7f6899443d 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -1005,6 +1005,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.write_tsc_multiplier = vt_op(write_tsc_multiplier),
 
 	.load_mmu_pgd = vt_op(load_mmu_pgd),
+	.alloc_root_cpu_mask = vmx_alloc_root_cpu_mask,
 
 	.check_intercept = vmx_check_intercept,
 	.handle_exit_irqoff = vmx_handle_exit_irqoff,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index eec2d866e7f1..a6d93624c2d4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -28,6 +28,7 @@
 #include <linux/slab.h>
 #include <linux/tboot.h>
 #include <linux/trace_events.h>
+#include <linux/cpumask.h>
 #include <linux/entry-kvm.h>
 
 #include <asm/apic.h>
@@ -62,6 +63,7 @@
 #include "kvm_cache_regs.h"
 #include "lapic.h"
 #include "mmu.h"
+#include "mmu/spte.h"
 #include "nested.h"
 #include "pmu.h"
 #include "sgx.h"
@@ -1450,7 +1452,7 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
 	}
 }
 
-static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu);
+static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu);
 
 void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
 {
@@ -1489,8 +1491,8 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
 		 * TLB entries from its previous association with the vCPU.
 		 */
 		if (enable_ept) {
-			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu);
-			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu);
+			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu, cpu);
+			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu, cpu);
 		} else {
 			kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		}
@@ -3307,22 +3309,34 @@ void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
 	vpid_sync_context(vmx_get_current_vpid(vcpu));
 }
 
-static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa)
+void vmx_alloc_root_cpu_mask(struct kvm_mmu_page *root)
 {
+	WARN_ON_ONCE(!zalloc_cpumask_var(&root->cpu_flushed_mask,
+					GFP_KERNEL_ACCOUNT));
+}
+
+static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa, int cpu)
+{
+	struct kvm_mmu_page *root;
+
 	if (!VALID_PAGE(root_hpa))
 		return;
 
+	root = root_to_sp(root_hpa);
+	if (!root || cpumask_test_and_set_cpu(cpu, root->cpu_flushed_mask))
+		return;
+
 	vmx_flush_tlb_ept_root(root_hpa);
 }
 
-static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu)
+static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu)
 {
 	int i;
 
-	__vmx_flush_ept_on_pcpu_migration(mmu->root.hpa);
+	__vmx_flush_ept_on_pcpu_migration(mmu->root.hpa, cpu);
 
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
-		__vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa);
+		__vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa, cpu);
 }
 
 void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index b4596f651232..4406d53e6ebe 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -84,6 +84,7 @@ void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
 void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
+void vmx_alloc_root_cpu_mask(struct kvm_mmu_page *root);
 void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
 u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
 void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
  2025-08-15 13:49                   ` Jeremi Piotrowski
@ 2025-08-19 22:50                     ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2025-08-19 22:50 UTC (permalink / raw)
  To: Jeremi Piotrowski
  Cc: Vitaly Kuznetsov, Dave Hansen, linux-kernel, alanjiang,
	chinang.ma, andrea.pellegrini, Kevin Tian, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, linux-hyperv, Paolo Bonzini,
	kvm

On Fri, Aug 15, 2025, Jeremi Piotrowski wrote:
> On Tue, Aug 05, 2025 at 04:42:46PM -0700, Sean Christopherson wrote:
> I started working on extending patch 5, wanted to post it here to make sure I'm
> on the right track.
> 
> It works in testing so far and shows promising performance - it gets rid of all
> the pathological cases I saw before.

Nice :-)

> I haven't checked whether I broke SVM yet, and I need figure out a way to
> always keep the cpumask "offstack" so that we don't blow up every struct
> kvm_mmu_page instance with an inline cpumask - it needs to stay optional.

Doh, I meant to include an idea or two for this in my earlier response.  /The
best I can come up with is 

> I also came across kvm_mmu_is_dummy_root(), that check is included in
> root_to_sp(). Can you think of any other checks that we might need to handle?

Don't think so?

> @@ -3827,6 +3829,9 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
>  	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
>  	++sp->root_count;
>  
> +	if (level >= PT64_ROOT_4LEVEL)

Was this my code?  If so, we should move this into the VMX code, because the fact
that PAE roots can be ignored is really a detail of nested EPT, not the overall
sceheme.

> +		kvm_x86_call(alloc_root_cpu_mask)(sp);

Ah shoot.  Allocating here won't work, because mmu_lock is held and allocating
might sleep.  I don't want to force an atomic allocation, because that can dip
into pools that KVM really shouldn't use.

The "standard" way KVM deals with this is to utilize a kvm_mmu_memory_cache.  If
we do that and add e.g kvm_vcpu_arch.mmu_roots_flushed_cache, then we trivially
do the allocation in mmu_topup_memory_caches().  That would eliminate the error
handling in vmx_alloc_root_cpu_mask(), and might make it slightly less awful to
deal with the "offstack" cpumask.

Hmm, and then instead of calling into VMX to do the allocation, maybe just have
a flag to communicate that vendor code wants per-root flush tracking?  I haven't
thought hard about SVM, but I wouldn't be surprised if SVM ends up wanting the
same functionality after we switch to per-vCPU ASIDs.

> +
>  	return __pa(sp->spt);
>  }

...

> @@ -3307,22 +3309,34 @@ void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
>  	vpid_sync_context(vmx_get_current_vpid(vcpu));
>  }
>  
> -static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa)
> +void vmx_alloc_root_cpu_mask(struct kvm_mmu_page *root)
>  {

This should be conditioned on enable_ept.

> +	WARN_ON_ONCE(!zalloc_cpumask_var(&root->cpu_flushed_mask,
> +					GFP_KERNEL_ACCOUNT));
> +}
> +
> +static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa, int cpu)
> +{
> +	struct kvm_mmu_page *root;
> +
>  	if (!VALID_PAGE(root_hpa))
>  		return;
>  
> +	root = root_to_sp(root_hpa);
> +	if (!root || cpumask_test_and_set_cpu(cpu, root->cpu_flushed_mask))

Hmm, this should flush if "root" is NULL, because the aforementioned "special"
roots don't have a shadow page.

But unless I'm missing an edge case (of an edge case), this particular code can
WARN_ON_ONCE() since EPT should never need to use any of the special roots.  We
might need to filter out dummy roots somewhere to avoid false positives, but that
should be easy enough.

For the mask, it's probably worth splitting test_and_set into separate operations,
as the common case will likely be that the root has been used on this pCPU.  The
test_and_set version will generate a LOCK BTS instruction, and so for the common
case where the bit is already set, KVM will generate an atomic access, which can
cause noise/bottlenecks 

E.g.

	if (WARN_ON_ONCE(!root))
		goto flush;

	if (cpumask_test_cpu(cpu, root->cpu_flushed_mask))
		return;

	cpumask_set_cpu(cpu, root->cpu_flushed_mask);

flush:
	vmx_flush_tlb_ept_root(root_hpa);

> +		return;
> +
>  	vmx_flush_tlb_ept_root(root_hpa);
>  }
>  
> -static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu)
> +static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu)
>  {
>  	int i;
>  
> -	__vmx_flush_ept_on_pcpu_migration(mmu->root.hpa);
> +	__vmx_flush_ept_on_pcpu_migration(mmu->root.hpa, cpu);
>  
>  	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> -		__vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa);
> +		__vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa, cpu);
>  }
>  
>  void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index b4596f651232..4406d53e6ebe 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -84,6 +84,7 @@ void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
>  void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
>  void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
>  void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
> +void vmx_alloc_root_cpu_mask(struct kvm_mmu_page *root);
>  void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
>  u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
>  void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
> -- 
> 2.39.5
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-08-19 22:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-20 15:39 [RFC PATCH 0/1] Tweak TLB flushing when VMX is running on Hyper-V Jeremi Piotrowski
2025-06-20 15:39 ` [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes Jeremi Piotrowski
2025-06-27  8:31   ` Vitaly Kuznetsov
2025-07-02 16:11     ` Jeremi Piotrowski
2025-07-09 15:46       ` Vitaly Kuznetsov
2025-07-15  0:39         ` Sean Christopherson
2025-08-04 15:49           ` Vitaly Kuznetsov
2025-08-04 23:09             ` Sean Christopherson
2025-08-05 18:04               ` Jeremi Piotrowski
2025-08-05 23:42                 ` Sean Christopherson
2025-08-15 13:49                   ` Jeremi Piotrowski
2025-08-19 22:50                     ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).