kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/13] Optimize nSVM TLB flushes
@ 2025-02-05 18:23 Yosry Ahmed
  2025-02-05 18:23 ` [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB Yosry Ahmed
                   ` (12 more replies)
  0 siblings, 13 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Currently KVM does a TLB flush and an MMU sync on every nested
transition (L1 <-> L2), because it uses the same ASID to run both L1 and
L2.

This series addresses that by giving a separate ASID to L2, adding the
necessary TLB management for it, and properly virtualizing TLB flushes
for L1.

Patch 1 introduces a separate ASID for L2, althoug not properly handled
yet, so it keeps the unconditional flushes.

Patches 2 to 6 are some refactoring and groundwork.

Patches 7 to 12 add the actual TLB management for nSVM, some of which
are items on the TODO list in nested_svm_transition_tlb_flush().

Patch 13 finally stops the unconditional flushes on every nested
transition.

I tested this by booting an L2 and running some basic workloads,
including a CPUID microbenchmark to measure the performance improvement
(numbers in the last patch). I sent the RFC to get feedback on the
general approach, and meanwhile I will try to run more tests that could
exercise TLB flushing.

Yosry Ahmed (13):
  KVM: nSVM: Track the ASID per-VMCB
  KVM: nSVM: Rework svm_flush_tlb_asid() to operate on a given VMCB
  KVM: nSVM: Split nested_svm_transition_tlb_flush() into entry/exit fns
  KVM: SVM: Introduce helpers for updating TLB_CONTROL
  KVM: x86/mmu: rename __kvm_mmu_invalidate_addr()
  KVM: x86/mmu: Allow skipping the gva flush in
    kvm_mmu_invalidate_addr()
  KVM: nSVM: Handle INVLPGA interception correctly
  KVM: nSVM: Flush both L1 and L2 ASIDs on KVM_REQ_TLB_FLUSH
  KVM: nSVM: Handle nested TLB flush requests through TLB_CONTROL
  KVM: nSVM: Flush the TLB if L1 changes L2's ASID
  KVM: nSVM: Do not reset TLB_CONTROL in VMCB02 on nested entry
  KVM: nSVM: Service local TLB flushes before nested transitions
  KVM: nSVM: Stop bombing the TLB on nested transitions

 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/include/asm/svm.h      |  6 ---
 arch/x86/kvm/mmu/mmu.c          | 22 +++++---
 arch/x86/kvm/svm/nested.c       | 64 +++++++++++++++-------
 arch/x86/kvm/svm/sev.c          |  4 +-
 arch/x86/kvm/svm/svm.c          | 95 ++++++++++++++++++++++++++-------
 arch/x86/kvm/svm/svm.h          | 33 +++++++++++-
 7 files changed, 170 insertions(+), 56 deletions(-)

-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
@ 2025-02-05 18:23 ` Yosry Ahmed
  2025-03-01  0:03   ` Sean Christopherson
  2025-03-01  1:23   ` Maxim Levitsky
  2025-02-05 18:23 ` [RFC PATCH 02/13] KVM: nSVM: Rework svm_flush_tlb_asid() to operate on a given VMCB Yosry Ahmed
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

The ASID is currently tracked per-vCPU, because the same ASID is used by
L1 and L2. That ASID is flushed on every transition between L1 and L2.

Track the ASID separately for each VMCB (similar to the
asid_generation), giving L2 a separate ASID. This is in preparation for
doing fine-grained TLB flushes on nested transitions instead of
unconditional full flushes.

The ASIDs are still not fully maintained (e.g. a remote flush will only
flush the current ASID), so keep the TLB flush on every transition until
this is sorted out.

L1's ASID will be flushed on KVM_REQ_TLB_FLUSH_GUEST if it is the
active context, so remove the TODO in nested_svm_transition_tlb_flush()
about it.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c |  1 -
 arch/x86/kvm/svm/sev.c    |  2 +-
 arch/x86/kvm/svm/svm.c    | 12 +++++++-----
 arch/x86/kvm/svm/svm.h    |  2 +-
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 04c375bf1ac2a..bbe4f3ac9f250 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -495,7 +495,6 @@ static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu)
 	 *  - Honor L1's request to flush an ASID on nested VMRUN
 	 *  - Sync nested NPT MMU on VMRUN that flushes L2's ASID[*]
 	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
-	 *  - Flush L1's ASID on KVM_REQ_TLB_FLUSH_GUEST
 	 *
 	 * [*] Unlike nested EPT, SVM's ASID management can invalidate nested
 	 *     NPT guest-physical mappings on VMRUN.
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 799f8494b599c..b0adfd0537d00 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3468,7 +3468,7 @@ void pre_sev_run(struct vcpu_svm *svm, int cpu)
 	unsigned int asid = sev_get_asid(svm->vcpu.kvm);
 
 	/* Assign the asid allocated with this SEV guest */
-	svm->asid = asid;
+	svm->current_vmcb->asid = asid;
 
 	/*
 	 * Flush guest TLB:
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7640a84e554a6..08340ae57777b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1335,8 +1335,10 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
 		save->g_pat = vcpu->arch.pat;
 		save->cr3 = 0;
 	}
-	svm->current_vmcb->asid_generation = 0;
-	svm->asid = 0;
+	svm->vmcb01.asid_generation = 0;
+	svm->vmcb01.asid = 0;
+	svm->nested.vmcb02.asid_generation = 0;
+	svm->nested.vmcb02.asid = 0;
 
 	svm->nested.vmcb12_gpa = INVALID_GPA;
 	svm->nested.last_vmcb12_gpa = INVALID_GPA;
@@ -1988,7 +1990,7 @@ static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd)
 	}
 
 	svm->current_vmcb->asid_generation = sd->asid_generation;
-	svm->asid = sd->next_asid++;
+	svm->current_vmcb->asid = sd->next_asid++;
 }
 
 static void svm_set_dr6(struct vcpu_svm *svm, unsigned long value)
@@ -4235,8 +4237,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
 
 	sync_lapic_to_cr8(vcpu);
 
-	if (unlikely(svm->asid != svm->vmcb->control.asid)) {
-		svm->vmcb->control.asid = svm->asid;
+	if (unlikely(svm->current_vmcb->asid != svm->vmcb->control.asid)) {
+		svm->vmcb->control.asid = svm->current_vmcb->asid;
 		vmcb_mark_dirty(svm->vmcb, VMCB_ASID);
 	}
 	svm->vmcb->save.cr2 = vcpu->arch.cr2;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9d7cdb8fbf872..ebbb0b1a64676 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -133,6 +133,7 @@ struct kvm_vmcb_info {
 	unsigned long pa;
 	int cpu;
 	uint64_t asid_generation;
+	u32 asid;
 };
 
 struct vmcb_save_area_cached {
@@ -247,7 +248,6 @@ struct vcpu_svm {
 	struct vmcb *vmcb;
 	struct kvm_vmcb_info vmcb01;
 	struct kvm_vmcb_info *current_vmcb;
-	u32 asid;
 	u32 sysenter_esp_hi;
 	u32 sysenter_eip_hi;
 	uint64_t tsc_aux;
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 02/13] KVM: nSVM: Rework svm_flush_tlb_asid() to operate on a given VMCB
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
  2025-02-05 18:23 ` [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB Yosry Ahmed
@ 2025-02-05 18:23 ` Yosry Ahmed
  2025-03-01  1:29   ` Maxim Levitsky
  2025-02-05 18:23 ` [RFC PATCH 03/13] KVM: nSVM: Split nested_svm_transition_tlb_flush() into entry/exit fns Yosry Ahmed
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

svm_flush_tlb_asid() currently operates on the current VMCB. In
preparation for properly tracking TLB flushes for L1 and L2 ASIDs,
refactor it to work on a given VMCB. All existing callers pass the
current VMCB.

Create a svm_flush_tlb_guest() wrapper to use as the flush_tlb_guest()
callback.

kvm_hv_vcpu_purge_flush_tlb() is only called when the current VMCB is
passed to maintain current behavior.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/svm.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 08340ae57777b..2108b48ba4959 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3954,7 +3954,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
 	svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
 }
 
-static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
+static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu, struct kvm_vmcb_info *vmcb)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -3963,7 +3963,8 @@ static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
 	 * A TLB flush for the current ASID flushes both "host" and "guest" TLB
 	 * entries, and thus is a superset of Hyper-V's fine grained flushing.
 	 */
-	kvm_hv_vcpu_purge_flush_tlb(vcpu);
+	if (vmcb == svm->current_vmcb)
+		kvm_hv_vcpu_purge_flush_tlb(vcpu);
 
 	/*
 	 * Flush only the current ASID even if the TLB flush was invoked via
@@ -3973,14 +3974,15 @@ static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
 	 * VM-Exit (via kvm_mmu_reset_context()).
 	 */
 	if (static_cpu_has(X86_FEATURE_FLUSHBYASID))
-		svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID;
+		vmcb->ptr->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID;
 	else
-		svm->current_vmcb->asid_generation--;
+		vmcb->asid_generation--;
 }
 
 static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
 {
 	hpa_t root_tdp = vcpu->arch.mmu->root.hpa;
+	struct vcpu_svm *svm = to_svm(vcpu);
 
 	/*
 	 * When running on Hyper-V with EnlightenedNptTlb enabled, explicitly
@@ -3991,11 +3993,13 @@ static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
 	if (svm_hv_is_enlightened_tlb_enabled(vcpu) && VALID_PAGE(root_tdp))
 		hyperv_flush_guest_mapping(root_tdp);
 
-	svm_flush_tlb_asid(vcpu);
+	svm_flush_tlb_asid(vcpu, svm->current_vmcb);
 }
 
 static void svm_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_svm *svm = to_svm(vcpu);
+
 	/*
 	 * When running on Hyper-V with EnlightenedNptTlb enabled, remote TLB
 	 * flushes should be routed to hv_flush_remote_tlbs() without requesting
@@ -4006,7 +4010,7 @@ static void svm_flush_tlb_all(struct kvm_vcpu *vcpu)
 	if (WARN_ON_ONCE(svm_hv_is_enlightened_tlb_enabled(vcpu)))
 		hv_flush_remote_tlbs(vcpu->kvm);
 
-	svm_flush_tlb_asid(vcpu);
+	svm_flush_tlb_asid(vcpu, svm->current_vmcb);
 }
 
 static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
@@ -4016,6 +4020,13 @@ static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
 	invlpga(gva, svm->vmcb->control.asid);
 }
 
+static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	svm_flush_tlb_asid(vcpu, svm->current_vmcb);
+}
+
 static inline void sync_cr8_to_lapic(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -5055,7 +5066,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.flush_tlb_all = svm_flush_tlb_all,
 	.flush_tlb_current = svm_flush_tlb_current,
 	.flush_tlb_gva = svm_flush_tlb_gva,
-	.flush_tlb_guest = svm_flush_tlb_asid,
+	.flush_tlb_guest = svm_flush_tlb_guest,
 
 	.vcpu_pre_run = svm_vcpu_pre_run,
 	.vcpu_run = svm_vcpu_run,
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 03/13] KVM: nSVM: Split nested_svm_transition_tlb_flush() into entry/exit fns
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
  2025-02-05 18:23 ` [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB Yosry Ahmed
  2025-02-05 18:23 ` [RFC PATCH 02/13] KVM: nSVM: Rework svm_flush_tlb_asid() to operate on a given VMCB Yosry Ahmed
@ 2025-02-05 18:23 ` Yosry Ahmed
  2025-03-01  1:34   ` Maxim Levitsky
  2025-02-05 18:23 ` [RFC PATCH 04/13] KVM: SVM: Introduce helpers for updating TLB_CONTROL Yosry Ahmed
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

The handling for the entry and exit TLB flushes will diverge
significantly in the following changes. Instead of adding an 'is_vmenter'
argument like nested_vmx_transition_tlb_flush(), just split the function
into two variants for 'entry' and 'exit'.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bbe4f3ac9f250..2eba36af44f22 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -482,7 +482,7 @@ static void nested_save_pending_event_to_vmcb12(struct vcpu_svm *svm,
 	vmcb12->control.exit_int_info = exit_int_info;
 }
 
-static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu)
+static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
 {
 	/* Handle pending Hyper-V TLB flush requests */
 	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
@@ -503,6 +503,16 @@ static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu)
 	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
 }
 
+static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
+{
+	/* Handle pending Hyper-V TLB flush requests */
+	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
+
+	/* See nested_svm_entry_tlb_flush() */
+	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
+	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
+}
+
 /*
  * Load guest's/host's cr3 on nested vmentry or vmexit. @nested_npt is true
  * if we are emulating VM-Entry into a guest with NPT enabled.
@@ -645,7 +655,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
 	u32 pause_count12;
 	u32 pause_thresh12;
 
-	nested_svm_transition_tlb_flush(vcpu);
+	nested_svm_entry_tlb_flush(vcpu);
 
 	/* Enter Guest-Mode */
 	enter_guest_mode(vcpu);
@@ -1131,7 +1141,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 
 	kvm_vcpu_unmap(vcpu, &map);
 
-	nested_svm_transition_tlb_flush(vcpu);
+	nested_svm_exit_tlb_flush(vcpu);
 
 	nested_svm_uninit_mmu_context(vcpu);
 
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 04/13] KVM: SVM: Introduce helpers for updating TLB_CONTROL
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
                   ` (2 preceding siblings ...)
  2025-02-05 18:23 ` [RFC PATCH 03/13] KVM: nSVM: Split nested_svm_transition_tlb_flush() into entry/exit fns Yosry Ahmed
@ 2025-02-05 18:23 ` Yosry Ahmed
  2025-03-01  1:37   ` Maxim Levitsky
  2025-02-05 18:23 ` [RFC PATCH 05/13] KVM: x86/mmu: rename __kvm_mmu_invalidate_addr() Yosry Ahmed
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Introduce helpers for updating TLB_CONTROL in the VMCB instead of
directly setting it. Two helpers are introduced:

- svm_add_tlb_ctl_flush(): Combines a new TLB_CONTROL value with the
  existing one.

- svm_clear_tlb_ctl_flush(): Clears the TLB_CONTROL field.

The goal is to prevent overwriting a TLB_CONTROL value with something
that results in less TLB entries being flushed. This does not currently
happen as KVM only sets TLB_CONTROL_FLUSH_ASID when servicing a flush
request, and TLB_CONTROL_FLUSH_ALL_ASID when allocating a new ASID. The
latter always happens after the former so no unsafe overwrite happens.

However, future changes may result in subtle bugs where the TLB_CONTROL
field is incorrectly overwritten. The new helpers prevent that.

A separate helper is used for clearing the TLB flush because it is
semantically different. In this case, KVM knowingly ignores the existing
value of TLB_CONTROL. Also, although svm_add_tlb_ctl_flush() would just
work for TLB_CONTROL_DO_NOTHING, the logic becomes inconsistent (use the
biggest hammer unless no hammer at all is requested).

Opportunistically move the TLB_CONTROL_* definitions to
arch/x86/kvm/svm/svm.h as they are not used outside of
arch/x86/kvm/svm/.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/include/asm/svm.h |  6 ------
 arch/x86/kvm/svm/nested.c  |  2 +-
 arch/x86/kvm/svm/sev.c     |  2 +-
 arch/x86/kvm/svm/svm.c     |  6 +++---
 arch/x86/kvm/svm/svm.h     | 29 +++++++++++++++++++++++++++++
 5 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 2b59b9951c90e..e6bccf8f90982 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -169,12 +169,6 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 	};
 };
 
-
-#define TLB_CONTROL_DO_NOTHING 0
-#define TLB_CONTROL_FLUSH_ALL_ASID 1
-#define TLB_CONTROL_FLUSH_ASID 3
-#define TLB_CONTROL_FLUSH_ASID_LOCAL 7
-
 #define V_TPR_MASK 0x0f
 
 #define V_IRQ_SHIFT 8
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 2eba36af44f22..0e9b0592c1f83 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -690,7 +690,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
 	/* Done at vmrun: asid.  */
 
 	/* Also overwritten later if necessary.  */
-	vmcb02->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
+	svm_clear_tlb_ctl_flush(vmcb02);
 
 	/* nested_cr3.  */
 	if (nested_npt_enabled(svm))
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index b0adfd0537d00..3af296d6c04f6 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3481,7 +3481,7 @@ void pre_sev_run(struct vcpu_svm *svm, int cpu)
 		return;
 
 	sd->sev_vmcbs[asid] = svm->vmcb;
-	svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID;
+	svm_add_tlb_ctl_flush(svm->vmcb, TLB_CONTROL_FLUSH_ASID);
 	vmcb_mark_dirty(svm->vmcb, VMCB_ASID);
 }
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2108b48ba4959..a2d601cd4c283 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1985,7 +1985,7 @@ static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd)
 	if (sd->next_asid > sd->max_asid) {
 		++sd->asid_generation;
 		sd->next_asid = sd->min_asid;
-		svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ALL_ASID;
+		svm_add_tlb_ctl_flush(svm->vmcb, TLB_CONTROL_FLUSH_ALL_ASID);
 		vmcb_mark_dirty(svm->vmcb, VMCB_ASID);
 	}
 
@@ -3974,7 +3974,7 @@ static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu, struct kvm_vmcb_info *vmcb
 	 * VM-Exit (via kvm_mmu_reset_context()).
 	 */
 	if (static_cpu_has(X86_FEATURE_FLUSHBYASID))
-		vmcb->ptr->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID;
+		svm_add_tlb_ctl_flush(vmcb->ptr, TLB_CONTROL_FLUSH_ASID);
 	else
 		vmcb->asid_generation--;
 }
@@ -4317,7 +4317,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
 		svm->nested.nested_run_pending = 0;
 	}
 
-	svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
+	svm_clear_tlb_ctl_flush(svm->vmcb);
 	vmcb_mark_all_clean(svm->vmcb);
 
 	/* if exit due to PF check for async PF */
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index ebbb0b1a64676..6a73d6ed1e428 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -611,6 +611,35 @@ void svm_set_x2apic_msr_interception(struct vcpu_svm *svm, bool disable);
 void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
 				     int trig_mode, int vec);
 
+#define TLB_CONTROL_DO_NOTHING 0
+#define TLB_CONTROL_FLUSH_ALL_ASID 1
+#define TLB_CONTROL_FLUSH_ASID 3
+#define TLB_CONTROL_FLUSH_ASID_LOCAL 7
+
+/*
+ * Clearing TLB flushes is done separately because combining
+ * TLB_CONTROL_DO_NOTHING with others is counter-intuitive.
+ */
+static inline void svm_add_tlb_ctl_flush(struct vmcb *vmcb, u8 tlb_ctl)
+{
+	if (WARN_ON_ONCE(tlb_ctl == TLB_CONTROL_DO_NOTHING))
+		return;
+
+	/*
+	 * Apply the least targeted (most inclusive) TLB flush. Apart from
+	 * TLB_CONTROL_DO_NOTHING, lower values of tlb_ctl are less targeted.
+	 */
+	if (vmcb->control.tlb_ctl == TLB_CONTROL_DO_NOTHING)
+		vmcb->control.tlb_ctl = tlb_ctl;
+	else
+		vmcb->control.tlb_ctl = min(vmcb->control.tlb_ctl, tlb_ctl);
+}
+
+static inline void svm_clear_tlb_ctl_flush(struct vmcb *vmcb)
+{
+	vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
+}
+
 /* nested.c */
 
 #define NESTED_EXIT_HOST	0	/* Exit handled on host level */
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 05/13] KVM: x86/mmu: rename __kvm_mmu_invalidate_addr()
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
                   ` (3 preceding siblings ...)
  2025-02-05 18:23 ` [RFC PATCH 04/13] KVM: SVM: Introduce helpers for updating TLB_CONTROL Yosry Ahmed
@ 2025-02-05 18:23 ` Yosry Ahmed
  2025-02-05 18:23 ` [RFC PATCH 06/13] KVM: x86/mmu: Allow skipping the gva flush in kvm_mmu_invalidate_addr() Yosry Ahmed
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

In preparation for creating another helper for
kvm_mmu_invalidate_addr(), rename __kvm_mmu_invalidate_addr() to
kvm_mmu_invalidate_addr_in_root().

No functional change intended.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/mmu/mmu.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3ff55a18347da..b93f560a2c0e8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6120,8 +6120,9 @@ void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_print_sptes);
 
-static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
-				      u64 addr, hpa_t root_hpa)
+static void kvm_mmu_invalidate_addr_in_root(struct kvm_vcpu *vcpu,
+					    struct kvm_mmu *mmu,
+					    u64 addr, hpa_t root_hpa)
 {
 	struct kvm_shadow_walk_iterator iterator;
 
@@ -6177,11 +6178,11 @@ void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		return;
 
 	if (roots & KVM_MMU_ROOT_CURRENT)
-		__kvm_mmu_invalidate_addr(vcpu, mmu, addr, mmu->root.hpa);
+		kvm_mmu_invalidate_addr_in_root(vcpu, mmu, addr, mmu->root.hpa);
 
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
 		if (roots & KVM_MMU_ROOT_PREVIOUS(i))
-			__kvm_mmu_invalidate_addr(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
+			kvm_mmu_invalidate_addr_in_root(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
 	}
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_invalidate_addr);
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 06/13] KVM: x86/mmu: Allow skipping the gva flush in kvm_mmu_invalidate_addr()
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
                   ` (4 preceding siblings ...)
  2025-02-05 18:23 ` [RFC PATCH 05/13] KVM: x86/mmu: rename __kvm_mmu_invalidate_addr() Yosry Ahmed
@ 2025-02-05 18:23 ` Yosry Ahmed
  2025-02-05 18:23 ` [RFC PATCH 07/13] KVM: nSVM: Handle INVLPGA interception correctly Yosry Ahmed
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Refactor a helper out of kvm_mmu_invalidate_addr() that allows skipping
the gva flush. This will be used when an invalidation is needed but the
GVA TLB translations that require invalidation are not of the current
context (e.g. when emulating INVLPGA for L1 to flush L2's translations).

No functional change intended.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/mmu/mmu.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b93f560a2c0e8..ac133abc9c173 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6158,15 +6158,15 @@ static void kvm_mmu_invalidate_addr_in_root(struct kvm_vcpu *vcpu,
 	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
-void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
-			     u64 addr, unsigned long roots)
+static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+				      u64 addr, unsigned long roots, bool gva_flush)
 {
 	int i;
 
 	WARN_ON_ONCE(roots & ~KVM_MMU_ROOTS_ALL);
 
 	/* It's actually a GPA for vcpu->arch.guest_mmu.  */
-	if (mmu != &vcpu->arch.guest_mmu) {
+	if (gva_flush && mmu != &vcpu->arch.guest_mmu) {
 		/* INVLPG on a non-canonical address is a NOP according to the SDM.  */
 		if (is_noncanonical_invlpg_address(addr, vcpu))
 			return;
@@ -6185,6 +6185,12 @@ void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			kvm_mmu_invalidate_addr_in_root(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
 	}
 }
+
+void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+			     u64 addr, unsigned long roots)
+{
+	__kvm_mmu_invalidate_addr(vcpu, mmu, addr, roots, true);
+}
 EXPORT_SYMBOL_GPL(kvm_mmu_invalidate_addr);
 
 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 07/13] KVM: nSVM: Handle INVLPGA interception correctly
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
                   ` (5 preceding siblings ...)
  2025-02-05 18:23 ` [RFC PATCH 06/13] KVM: x86/mmu: Allow skipping the gva flush in kvm_mmu_invalidate_addr() Yosry Ahmed
@ 2025-02-05 18:23 ` Yosry Ahmed
  2025-03-01  1:55   ` Maxim Levitsky
  2025-02-05 18:23 ` [RFC PATCH 08/13] KVM: nSVM: Flush both L1 and L2 ASIDs on KVM_REQ_TLB_FLUSH Yosry Ahmed
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Currently, INVPLGA interception handles it like INVLPG, which flushes
L1's TLB translations for the address. It was implemented in this way
because L1 and L2 shared an ASID. Now, L1 and L2 have separate ASIDs. It
is still harmless to flush L1's translations, but it's only correct
because all translations are flushed on nested transitions anyway.

In preparation for stopping unconditional flushes on nested transitions,
handle INVPLGA interception properly. If L1 specified zero as the ASID,
this is equivalent to INVLPG, so handle it as such. Otherwise, use
INVPLGA to flush the translations of the appropriate ASID tracked by
KVM, if any. Sync the shadow MMU as well, as L1 invalidated L2's
mappings.

Opportunistically update svm_flush_tlb_gva() to use
svm->current_vmcb->asid instead of svm->vmcb->control.asid for
consistency. The two should always be in sync except when KVM allocates
a new ASID in pre_svm_run(), and they are shortly brought back in sync
in svm_vcpu_run(). However, if future changes add more code paths where
KVM allocates a new ASID, flushing the potentially old ASID in
svm->vmcb->control.asid would be unnecessary overhead (although probably
not much different from flushing the newly allocated ASID).

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu/mmu.c          |  5 +++--
 arch/x86/kvm/svm/svm.c          | 40 ++++++++++++++++++++++++++++++---
 3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5193c3dfbce15..1e147bb2e560f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2213,6 +2213,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 		       void *insn, int insn_len);
 void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
+void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+			       u64 addr, unsigned long roots, bool gva_flush);
 void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			     u64 addr, unsigned long roots);
 void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ac133abc9c173..f5e0d2c8f4bbe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6158,8 +6158,8 @@ static void kvm_mmu_invalidate_addr_in_root(struct kvm_vcpu *vcpu,
 	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
-static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
-				      u64 addr, unsigned long roots, bool gva_flush)
+void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+			       u64 addr, unsigned long roots, bool gva_flush)
 {
 	int i;
 
@@ -6185,6 +6185,7 @@ static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu
 			kvm_mmu_invalidate_addr_in_root(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
 	}
 }
+EXPORT_SYMBOL_GPL(__kvm_mmu_invalidate_addr);
 
 void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			     u64 addr, unsigned long roots)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index a2d601cd4c283..9e29f87d3bd93 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2483,6 +2483,7 @@ static int clgi_interception(struct kvm_vcpu *vcpu)
 
 static int invlpga_interception(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_svm *svm = to_svm(vcpu);
 	gva_t gva = kvm_rax_read(vcpu);
 	u32 asid = kvm_rcx_read(vcpu);
 
@@ -2492,8 +2493,41 @@ static int invlpga_interception(struct kvm_vcpu *vcpu)
 
 	trace_kvm_invlpga(to_svm(vcpu)->vmcb->save.rip, asid, gva);
 
-	/* Let's treat INVLPGA the same as INVLPG (can be optimized!) */
-	kvm_mmu_invlpg(vcpu, gva);
+	/*
+	 * APM is silent about using INVLPGA to flush the host ASID (i.e. 0).
+	 * Do the logical thing and handle it like INVLPG.
+	 */
+	if (asid == 0) {
+		kvm_mmu_invlpg(vcpu, gva);
+		return kvm_skip_emulated_instruction(vcpu);
+	}
+
+	/*
+	 * Check if L1 specified the L2 ASID we are currently tracking. If it
+	 * isn't, do nothing as we have to handle the TLB flush when switching
+	 * to the new ASID anyway. APM mentions that INVLPGA is typically only
+	 * meaningful with shadow paging, so also do nothing if L1 is using
+	 * nested NPT.
+	 */
+	if (!nested_npt_enabled(svm) && asid == svm->nested.last_asid)
+		invlpga(gva, svm->nested.vmcb02.asid);
+
+	/*
+	 * If NPT is disabled, sync the shadow page tables as L1 is invalidating
+	 * mappings for L2. Sync all roots as ASIDs are not tracked in the MMU
+	 * role.
+	 *
+	 * As we are not flushing the current context, skip the gva flush from
+	 * __kvm_mmu_invalidate_addr(), it would flush the wrong ASID anyway.
+	 * The correct TLB flush was done above (if needed).
+	 *
+	 * This always operates on root_mmu because L1 and L2 share an MMU when
+	 * NPT is disabled. This can be optimized by invalidating guest roots
+	 * only.
+	 */
+	if (!npt_enabled)
+		__kvm_mmu_invalidate_addr(vcpu, &vcpu->arch.root_mmu, gva,
+					  KVM_MMU_ROOTS_ALL, false);
 
 	return kvm_skip_emulated_instruction(vcpu);
 }
@@ -4017,7 +4051,7 @@ static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
-	invlpga(gva, svm->vmcb->control.asid);
+	invlpga(gva, svm->current_vmcb->asid);
 }
 
 static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu)
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 08/13] KVM: nSVM: Flush both L1 and L2 ASIDs on KVM_REQ_TLB_FLUSH
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
                   ` (6 preceding siblings ...)
  2025-02-05 18:23 ` [RFC PATCH 07/13] KVM: nSVM: Handle INVLPGA interception correctly Yosry Ahmed
@ 2025-02-05 18:23 ` Yosry Ahmed
  2025-03-01  1:58   ` Maxim Levitsky
  2025-02-05 18:23 ` [RFC PATCH 09/13] KVM: nSVM: Handle nested TLB flush requests through TLB_CONTROL Yosry Ahmed
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

KVM_REQ_TLB_FLUSH is used to flush all TLB entries for all contexts
(e.g. in kvm_flush_remote_tlbs()). Flush both L1 and L2 ASIDs in
svm_flush_tlb_all() to handle it appropriately.

This is currently not required as nested transitions do unconditional
TLB flushes, but this is a step toward eliminating that.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c | 1 -
 arch/x86/kvm/svm/svm.c    | 4 +++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 0e9b0592c1f83..0735177b95a1d 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -491,7 +491,6 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
 	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
 	 * things to fix before this can be conditional:
 	 *
-	 *  - Flush TLBs for both L1 and L2 remote TLB flush
 	 *  - Honor L1's request to flush an ASID on nested VMRUN
 	 *  - Sync nested NPT MMU on VMRUN that flushes L2's ASID[*]
 	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9e29f87d3bd93..8342c7eadbba8 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4044,7 +4044,9 @@ static void svm_flush_tlb_all(struct kvm_vcpu *vcpu)
 	if (WARN_ON_ONCE(svm_hv_is_enlightened_tlb_enabled(vcpu)))
 		hv_flush_remote_tlbs(vcpu->kvm);
 
-	svm_flush_tlb_asid(vcpu, svm->current_vmcb);
+	svm_flush_tlb_asid(vcpu, &svm->vmcb01);
+	if (svm->nested.initialized)
+		svm_flush_tlb_asid(vcpu, &svm->nested.vmcb02);
 }
 
 static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 09/13] KVM: nSVM: Handle nested TLB flush requests through TLB_CONTROL
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
                   ` (7 preceding siblings ...)
  2025-02-05 18:23 ` [RFC PATCH 08/13] KVM: nSVM: Flush both L1 and L2 ASIDs on KVM_REQ_TLB_FLUSH Yosry Ahmed
@ 2025-02-05 18:23 ` Yosry Ahmed
  2025-02-05 21:45   ` Yosry Ahmed
  2025-03-01  2:06   ` Maxim Levitsky
  2025-02-05 18:23 ` [RFC PATCH 10/13] KVM: nSVM: Flush the TLB if L1 changes L2's ASID Yosry Ahmed
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Handle L1's requests to flush L2's TLB through the TLB_CONTROL field of
VMCB12. This is currently redundant because a full flush is executed on
every nested transition, but is a step towards removing that.

TLB_CONTROL_FLUSH_ALL_ASID flushes all ASIDs from L1's perspective,
including its own, so do a guest TLB flush on both transitions. Never
propagate TLB_CONTROL_FLUSH_ALL_ASID from the guest to the actual VMCB,
as this gives the guest the power to flush the entire physical TLB
(including translations for the host and other VMs).

For other cases, the TLB flush is only done when entering L2. The nested
NPT MMU is also sync'd because TLB_CONTROL also flushes NPT
guest-physical mappings.

All TLB_CONTROL values can be handled by KVM regardless of FLUSHBYASID
support on the underlying CPU, so keep advertising FLUSHBYASID to the
guest unconditionally.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c | 42 ++++++++++++++++++++++++++++++++-------
 arch/x86/kvm/svm/svm.c    |  6 +++---
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 0735177b95a1d..e2c59eb2907e8 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -484,19 +484,36 @@ static void nested_save_pending_event_to_vmcb12(struct vcpu_svm *svm,
 
 static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_svm *svm = to_svm(vcpu);
+
 	/* Handle pending Hyper-V TLB flush requests */
 	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
 
+	/*
+	 * If L1 requested a TLB flush for L2, flush L2's TLB on nested entry
+	 * and sync the nested NPT MMU, as TLB_CONTROL also flushes NPT
+	 * guest-physical mappings. We technically only need to flush guest_mode
+	 * page tables.
+	 *
+	 * KVM_REQ_TLB_FLUSH_GUEST will flush L2's ASID even if the underlying
+	 * CPU does not support FLUSHBYASID (by assigning a new ASID), so we
+	 * can handle all TLB_CONTROL values from L1 regardless.
+	 *
+	 * Note that TLB_CONTROL_FLUSH_ASID_LOCAL is handled exactly like
+	 * TLB_CONTROL_FLUSH_ASID. We can technically flush less TLB entries,
+	 * but this would require significantly more complexity.
+	 */
+	if (svm->nested.ctl.tlb_ctl != TLB_CONTROL_DO_NOTHING) {
+		if (nested_npt_enabled(svm))
+			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
+		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
+	}
+
 	/*
 	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
 	 * things to fix before this can be conditional:
 	 *
-	 *  - Honor L1's request to flush an ASID on nested VMRUN
-	 *  - Sync nested NPT MMU on VMRUN that flushes L2's ASID[*]
 	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
-	 *
-	 * [*] Unlike nested EPT, SVM's ASID management can invalidate nested
-	 *     NPT guest-physical mappings on VMRUN.
 	 */
 	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
 	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
@@ -504,9 +521,18 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
 
 static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_svm *svm = to_svm(vcpu);
+
 	/* Handle pending Hyper-V TLB flush requests */
 	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
 
+	/*
+	 * If L1 had requested a full TLB flush when entering L2, also flush its
+	 * TLB entries when exiting back to L1.
+	 */
+	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
+		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
+
 	/* See nested_svm_entry_tlb_flush() */
 	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
 	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
@@ -825,7 +851,8 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
 	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
 
 	svm_switch_vmcb(svm, &svm->nested.vmcb02);
-	nested_vmcb02_prepare_control(svm, vmcb12->save.rip, vmcb12->save.cs.base);
+	nested_vmcb02_prepare_control(svm, vmcb12->save.rip,
+				      vmcb12->save.cs.base);
 	nested_vmcb02_prepare_save(svm, vmcb12);
 
 	ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
@@ -1764,7 +1791,8 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 	nested_copy_vmcb_control_to_cache(svm, ctl);
 
 	svm_switch_vmcb(svm, &svm->nested.vmcb02);
-	nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip, svm->vmcb->save.cs.base);
+	nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip,
+				      svm->vmcb->save.cs.base);
 
 	/*
 	 * While the nested guest CR3 is already checked and set by
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 8342c7eadbba8..5e7b1c9bfa605 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5242,9 +5242,9 @@ static __init void svm_set_cpu_caps(void)
 		kvm_cpu_cap_set(X86_FEATURE_VMCBCLEAN);
 
 		/*
-		 * KVM currently flushes TLBs on *every* nested SVM transition,
-		 * and so for all intents and purposes KVM supports flushing by
-		 * ASID, i.e. KVM is guaranteed to honor every L1 ASID flush.
+		 * KVM currently handles all TLB_CONTROL values set by L1, even
+		 * if the underlying CPU does not. See
+		 * nested_svm_transition_tlb_flush().
 		 */
 		kvm_cpu_cap_set(X86_FEATURE_FLUSHBYASID);
 
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 10/13] KVM: nSVM: Flush the TLB if L1 changes L2's ASID
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
                   ` (8 preceding siblings ...)
  2025-02-05 18:23 ` [RFC PATCH 09/13] KVM: nSVM: Handle nested TLB flush requests through TLB_CONTROL Yosry Ahmed
@ 2025-02-05 18:23 ` Yosry Ahmed
  2025-03-01  2:13   ` Maxim Levitsky
  2025-02-05 18:24 ` [RFC PATCH 11/13] KVM: nSVM: Do not reset TLB_CONTROL in VMCB02 on nested entry Yosry Ahmed
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

KVM tracks a single ASID for L2 guests. L1 could change the ASID it has
assigned L2 due to switching to a different L2 guest or simply to avoid
flushing L2's existing ASID. Flush L2's TLB when this happens to avoid
reusing TLB entries from the old ASID (from L1's perspective).

Remove the comment in __nested_copy_vmcb_control_to_cache() about the
cached ASID usage, as this changes makes it stale by adding another
usage.

This is heavily inspired by nVMX's handling of last_vpid.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c | 5 ++++-
 arch/x86/kvm/svm/svm.h    | 2 ++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index e2c59eb2907e8..12bb391884299 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -368,7 +368,6 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
 	to->pause_filter_count  = from->pause_filter_count;
 	to->pause_filter_thresh = from->pause_filter_thresh;
 
-	/* Copy asid here because nested_vmcb_check_controls will check it.  */
 	to->asid           = from->asid;
 	to->msrpm_base_pa &= ~0x0fffULL;
 	to->iopm_base_pa  &= ~0x0fffULL;
@@ -509,6 +508,10 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 	}
 
+	if (svm->nested.ctl.asid != svm->nested.last_asid) {
+		svm->nested.last_asid = svm->nested.ctl.asid;
+		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
+	}
 	/*
 	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
 	 * things to fix before this can be conditional:
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 6a73d6ed1e428..f2352135b99d3 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -211,6 +211,8 @@ struct svm_nested_state {
 	 * on its side.
 	 */
 	bool force_msr_bitmap_recalc;
+
+	u32 last_asid;
 };
 
 struct vcpu_sev_es_state {
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 11/13] KVM: nSVM: Do not reset TLB_CONTROL in VMCB02 on nested entry
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
                   ` (9 preceding siblings ...)
  2025-02-05 18:23 ` [RFC PATCH 10/13] KVM: nSVM: Flush the TLB if L1 changes L2's ASID Yosry Ahmed
@ 2025-02-05 18:24 ` Yosry Ahmed
  2025-03-01  2:17   ` Maxim Levitsky
  2025-02-05 18:24 ` [RFC PATCH 12/13] KVM: nSVM: Service local TLB flushes before nested transitions Yosry Ahmed
  2025-02-05 18:24 ` [RFC PATCH 13/13] KVM: nSVM: Stop bombing the TLB on " Yosry Ahmed
  12 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:24 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

TLB_CONTROL is reset to TLB_CONTROL_DO_NOTHING on nested transitions to
L2. This is unnecessary because it should always be
TLB_CONTROL_DO_NOTHING at this point.

The flow for setting TLB_CONTROL is as follows:
1. In vcpu_enter_guest(), servicing a TLB flush request may set it to
TLB_CONTROL_FLUSH_ASID in svm_flush_tlb_asid().
2. In svm_vcpu_run() -> pre_svm_run(), it may get upgraded to
TLB_CONTROL_FLUSH_ALL_ASID when assigning a new ASID.
3. In svm_cpu_run(), it gets reset to TLB_CONTROL_DO_NOTHING after the
guest is run.

Hence, TLB_CONTROL is reset after each run and there is no need to do it
again on every nested transition to L2.

There is a TODO in nested_svm_transition_tlb_flush() about this reset
crushing pending TLB flushes. Remove it, as the reset is not really
crushing anything as explained above.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 12bb391884299..8e40ff21f7353 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -512,12 +512,7 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
 		svm->nested.last_asid = svm->nested.ctl.asid;
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 	}
-	/*
-	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
-	 * things to fix before this can be conditional:
-	 *
-	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
-	 */
+	/* TODO: optimize unconditional TLB flush/MMU sync */
 	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
 	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
 }
@@ -536,7 +531,7 @@ static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
 	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 
-	/* See nested_svm_entry_tlb_flush() */
+	/* TODO: optimize unconditional TLB flush/MMU sync */
 	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
 	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
 }
@@ -717,9 +712,6 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
 
 	/* Done at vmrun: asid.  */
 
-	/* Also overwritten later if necessary.  */
-	svm_clear_tlb_ctl_flush(vmcb02);
-
 	/* nested_cr3.  */
 	if (nested_npt_enabled(svm))
 		nested_svm_init_mmu_context(vcpu);
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 12/13] KVM: nSVM: Service local TLB flushes before nested transitions
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
                   ` (10 preceding siblings ...)
  2025-02-05 18:24 ` [RFC PATCH 11/13] KVM: nSVM: Do not reset TLB_CONTROL in VMCB02 on nested entry Yosry Ahmed
@ 2025-02-05 18:24 ` Yosry Ahmed
  2025-03-01  2:20   ` Maxim Levitsky
  2025-02-05 18:24 ` [RFC PATCH 13/13] KVM: nSVM: Stop bombing the TLB on " Yosry Ahmed
  12 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:24 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

KVM does not track TLB flush requests for L1 vs. L2. Hence, service
local flush that target the current context before switching to a new
one. Since ASIDs are tracked per-VMCB, service the flushes before every
VMCB switch.

This is conceptually similar to how nVMX calls
kvm_service_local_tlb_flush_requests() in
nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(), with the
following differences:

1. nVMX tracks the current VPID based on is_guest_mode(), so local TLB
   flushes are serviced before enter_guest_mode() and
   leave_guest_mode(). On the other hand, nSVM tracks the current ASID
   based on the current VMCB, so the TLB flushes are serviced before an
   VMCB switch.

2. nVMX only enters and leaves guest mode in
   nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(). Other paths
   like vmx_set_nested_state() and vmx_leave_nested() call into these
   two functions. On the other hand, nSVM open codes the switch in
   functions like svm_set_nested_state() and svm_leave_nested(), so
   servicing the flush in svm_switch_svm() is probably most reliable.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/svm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5e7b1c9bfa605..6daa7efa9262b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1421,6 +1421,12 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 
 void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb)
 {
+	/*
+	 * ASIDs are tracked per-VMCB. Perform any pending TLB flushes for the
+	 * current VMCB before switching to a new one.
+	 */
+	kvm_service_local_tlb_flush_requests(&svm->vcpu);
+
 	svm->current_vmcb = target_vmcb;
 	svm->vmcb = target_vmcb->ptr;
 }
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 13/13] KVM: nSVM: Stop bombing the TLB on nested transitions
  2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
                   ` (11 preceding siblings ...)
  2025-02-05 18:24 ` [RFC PATCH 12/13] KVM: nSVM: Service local TLB flushes before nested transitions Yosry Ahmed
@ 2025-02-05 18:24 ` Yosry Ahmed
  2025-03-01  2:21   ` Maxim Levitsky
  12 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 18:24 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Now that nested TLB flushes are properly tracked with a well-maintained
separate ASID for L2 and proper handling of L1's TLB flush requests,
drop the unconditional flushes and syncs on nested transitions.

On a Milan machine, an L1 and L2 guests were booted, both with a single
vCPU, and pinned to a single physical CPU to maximize TLB collisions. In
this setup, the cpuid_rate microbenchmark [1] showed the following
changes with this patch:

+--------+--------+-------------------+----------------------+
| L0     | L1     | cpuid_rate (base) | cpuid_rate (patched) |
+========+========+===================+======================+
| NPT    | NPT    | 256621            | 301113 (+17.3%)      |
| NPT    | Shadow | 180017            | 203347 (+12.96%)     |
| Shadow | Shadow | 177006            | 189150 (+6.86%)      |
+--------+--------+-------------------+----------------------+

[1]https://lore.kernel.org/kvm/20231109180646.2963718-1-khorenko@virtuozzo.com/

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 8e40ff21f7353..45a187d4c23d1 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -512,9 +512,6 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
 		svm->nested.last_asid = svm->nested.ctl.asid;
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 	}
-	/* TODO: optimize unconditional TLB flush/MMU sync */
-	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
-	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
 }
 
 static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
@@ -530,10 +527,6 @@ static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
 	 */
 	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
-
-	/* TODO: optimize unconditional TLB flush/MMU sync */
-	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
-	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
 }
 
 /*
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 09/13] KVM: nSVM: Handle nested TLB flush requests through TLB_CONTROL
  2025-02-05 18:23 ` [RFC PATCH 09/13] KVM: nSVM: Handle nested TLB flush requests through TLB_CONTROL Yosry Ahmed
@ 2025-02-05 21:45   ` Yosry Ahmed
  2025-03-01  2:06   ` Maxim Levitsky
  1 sibling, 0 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-02-05 21:45 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

On Wed, Feb 05, 2025 at 06:23:58PM +0000, Yosry Ahmed wrote:
> Handle L1's requests to flush L2's TLB through the TLB_CONTROL field of
> VMCB12. This is currently redundant because a full flush is executed on
> every nested transition, but is a step towards removing that.
> 
> TLB_CONTROL_FLUSH_ALL_ASID flushes all ASIDs from L1's perspective,
> including its own, so do a guest TLB flush on both transitions. Never
> propagate TLB_CONTROL_FLUSH_ALL_ASID from the guest to the actual VMCB,
> as this gives the guest the power to flush the entire physical TLB
> (including translations for the host and other VMs).
> 
> For other cases, the TLB flush is only done when entering L2. The nested
> NPT MMU is also sync'd because TLB_CONTROL also flushes NPT
> guest-physical mappings.
> 
> All TLB_CONTROL values can be handled by KVM regardless of FLUSHBYASID
> support on the underlying CPU, so keep advertising FLUSHBYASID to the
> guest unconditionally.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 42 ++++++++++++++++++++++++++++++++-------
>  arch/x86/kvm/svm/svm.c    |  6 +++---
>  2 files changed, 38 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 0735177b95a1d..e2c59eb2907e8 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -484,19 +484,36 @@ static void nested_save_pending_event_to_vmcb12(struct vcpu_svm *svm,
>  
>  static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
>  {
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
>  	/* Handle pending Hyper-V TLB flush requests */
>  	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
>  
> +	/*
> +	 * If L1 requested a TLB flush for L2, flush L2's TLB on nested entry
> +	 * and sync the nested NPT MMU, as TLB_CONTROL also flushes NPT
> +	 * guest-physical mappings. We technically only need to flush guest_mode
> +	 * page tables.
> +	 *
> +	 * KVM_REQ_TLB_FLUSH_GUEST will flush L2's ASID even if the underlying
> +	 * CPU does not support FLUSHBYASID (by assigning a new ASID), so we
> +	 * can handle all TLB_CONTROL values from L1 regardless.
> +	 *
> +	 * Note that TLB_CONTROL_FLUSH_ASID_LOCAL is handled exactly like
> +	 * TLB_CONTROL_FLUSH_ASID. We can technically flush less TLB entries,
> +	 * but this would require significantly more complexity.
> +	 */
> +	if (svm->nested.ctl.tlb_ctl != TLB_CONTROL_DO_NOTHING) {
> +		if (nested_npt_enabled(svm))
> +			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> +		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> +	}
> +
>  	/*
>  	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
>  	 * things to fix before this can be conditional:
>  	 *
> -	 *  - Honor L1's request to flush an ASID on nested VMRUN
> -	 *  - Sync nested NPT MMU on VMRUN that flushes L2's ASID[*]
>  	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
> -	 *
> -	 * [*] Unlike nested EPT, SVM's ASID management can invalidate nested
> -	 *     NPT guest-physical mappings on VMRUN.
>  	 */
>  	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
>  	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> @@ -504,9 +521,18 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
>  
>  static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
>  {
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
>  	/* Handle pending Hyper-V TLB flush requests */
>  	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
>  
> +	/*
> +	 * If L1 had requested a full TLB flush when entering L2, also flush its
> +	 * TLB entries when exiting back to L1.
> +	 */
> +	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
> +		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> +
>  	/* See nested_svm_entry_tlb_flush() */
>  	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
>  	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> @@ -825,7 +851,8 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
>  	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>  
>  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> -	nested_vmcb02_prepare_control(svm, vmcb12->save.rip, vmcb12->save.cs.base);
> +	nested_vmcb02_prepare_control(svm, vmcb12->save.rip,
> +				      vmcb12->save.cs.base);

These unrelated changes were unintentional, please ignore them.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB
  2025-02-05 18:23 ` [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB Yosry Ahmed
@ 2025-03-01  0:03   ` Sean Christopherson
  2025-03-03 17:51     ` Jim Mattson
  2025-03-03 19:18     ` Yosry Ahmed
  2025-03-01  1:23   ` Maxim Levitsky
  1 sibling, 2 replies; 48+ messages in thread
From: Sean Christopherson @ 2025-03-01  0:03 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson

+Jim, for his input on VPIDs.

On Wed, Feb 05, 2025, Yosry Ahmed wrote:
> The ASID is currently tracked per-vCPU, because the same ASID is used by
> L1 and L2. That ASID is flushed on every transition between L1 and L2.
> 
> Track the ASID separately for each VMCB (similar to the
> asid_generation), giving L2 a separate ASID. This is in preparation for
> doing fine-grained TLB flushes on nested transitions instead of
> unconditional full flushes.

After having some time to think about this, rather than track ASIDs per VMCB, I
think we should converge on a single approach for nVMX (VPID) and nSVM (ASID).

Per **VM**, one VPID/ASID for L1, and one VPID/ASID for L2.

For SVM, the dynamic ASID crud is a holdover from KVM's support for CPUs that
don't support FLUSHBYASID, i.e. needed to purge the entire TLB in order to flush
guest mappings.  FLUSHBYASID was added in 2010, and AFAIK has been supported by
all AMD CPUs since.

KVM already mostly keeps the same ASID, except for when a vCPU is migrated, in
which case KVM assigns a new ASID.  I suspect that following VMX's lead and
simply doing a TLB flush in this situation would be an improvement for modern
CPUs, as it would flush the entries that need to be flushed, and not pollute the
TLBs with stale, unused entries.

Using a static per-VM ASID would also allow using broadcast invalidations[*],
would simplify the SVM code base, and I think/hope would allow us to move much
of the TLB flushing logic, e.g. for task migration, to common code.

For VPIDs, maybe it's because it's Friday afternoon, but for the life of me I
can't think of any reason why KVM needs to assign VPIDs per vCPU.  Especially
since KVM is ridiculously conservative and flushes _all_ EPT/VPID contexts when
running a different vCPU on a pCPU (which I suspect we can trim down?).

Am I forgetting something?

[*] https://lore.kernel.org/all/Z8HdBg3wj8M7a4ts@google.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB
  2025-02-05 18:23 ` [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB Yosry Ahmed
  2025-03-01  0:03   ` Sean Christopherson
@ 2025-03-01  1:23   ` Maxim Levitsky
  2025-03-03 19:31     ` Yosry Ahmed
  1 sibling, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-01  1:23 UTC (permalink / raw)
  To: Yosry Ahmed, Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> The ASID is currently tracked per-vCPU, because the same ASID is used by
> L1 and L2. That ASID is flushed on every transition between L1 and L2.
> 
> Track the ASID separately for each VMCB (similar to the
> asid_generation), giving L2 a separate ASID. This is in preparation for
> doing fine-grained TLB flushes on nested transitions instead of
> unconditional full flushes.
> 
> The ASIDs are still not fully maintained (e.g. a remote flush will only
> flush the current ASID), so keep the TLB flush on every transition until
> this is sorted out.
> 
> L1's ASID will be flushed on KVM_REQ_TLB_FLUSH_GUEST if it is the
> active context, so remove the TODO in nested_svm_transition_tlb_flush()
> about it.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c |  1 -
>  arch/x86/kvm/svm/sev.c    |  2 +-
>  arch/x86/kvm/svm/svm.c    | 12 +++++++-----
>  arch/x86/kvm/svm/svm.h    |  2 +-
>  4 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 04c375bf1ac2a..bbe4f3ac9f250 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -495,7 +495,6 @@ static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu)
>  	 *  - Honor L1's request to flush an ASID on nested VMRUN
>  	 *  - Sync nested NPT MMU on VMRUN that flushes L2's ASID[*]
>  	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
> -	 *  - Flush L1's ASID on KVM_REQ_TLB_FLUSH_GUEST
>  	 *
>  	 * [*] Unlike nested EPT, SVM's ASID management can invalidate nested
>  	 *     NPT guest-physical mappings on VMRUN.
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 799f8494b599c..b0adfd0537d00 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3468,7 +3468,7 @@ void pre_sev_run(struct vcpu_svm *svm, int cpu)
>  	unsigned int asid = sev_get_asid(svm->vcpu.kvm);
>  
>  	/* Assign the asid allocated with this SEV guest */
> -	svm->asid = asid;
> +	svm->current_vmcb->asid = asid;
>  
>  	/*
>  	 * Flush guest TLB:
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7640a84e554a6..08340ae57777b 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1335,8 +1335,10 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
>  		save->g_pat = vcpu->arch.pat;
>  		save->cr3 = 0;
>  	}
> -	svm->current_vmcb->asid_generation = 0;
> -	svm->asid = 0;
> +	svm->vmcb01.asid_generation = 0;
> +	svm->vmcb01.asid = 0;
> +	svm->nested.vmcb02.asid_generation = 0;
> +	svm->nested.vmcb02.asid = 0;
>  
>  	svm->nested.vmcb12_gpa = INVALID_GPA;
>  	svm->nested.last_vmcb12_gpa = INVALID_GPA;
> @@ -1988,7 +1990,7 @@ static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd)
>  	}
>  
>  	svm->current_vmcb->asid_generation = sd->asid_generation;
> -	svm->asid = sd->next_asid++;
> +	svm->current_vmcb->asid = sd->next_asid++;
>  }
>  
>  static void svm_set_dr6(struct vcpu_svm *svm, unsigned long value)
> @@ -4235,8 +4237,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>  
>  	sync_lapic_to_cr8(vcpu);
>  
> -	if (unlikely(svm->asid != svm->vmcb->control.asid)) {
> -		svm->vmcb->control.asid = svm->asid;
> +	if (unlikely(svm->current_vmcb->asid != svm->vmcb->control.asid)) {
> +		svm->vmcb->control.asid = svm->current_vmcb->asid;
>  		vmcb_mark_dirty(svm->vmcb, VMCB_ASID);
>  	}
>  	svm->vmcb->save.cr2 = vcpu->arch.cr2;
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 9d7cdb8fbf872..ebbb0b1a64676 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -133,6 +133,7 @@ struct kvm_vmcb_info {
>  	unsigned long pa;
>  	int cpu;
>  	uint64_t asid_generation;
> +	u32 asid;
>  };
>  
>  struct vmcb_save_area_cached {
> @@ -247,7 +248,6 @@ struct vcpu_svm {
>  	struct vmcb *vmcb;
>  	struct kvm_vmcb_info vmcb01;
>  	struct kvm_vmcb_info *current_vmcb;
> -	u32 asid;
>  	u32 sysenter_esp_hi;
>  	u32 sysenter_eip_hi;
>  	uint64_t tsc_aux;

Hi,


I think it should be possible to eliminate separate ASID field (current_vmcb->asid/svm->asid)
completely and instead just use the value stored in the vmcb.

When there is a need to update it, KVM can also set the corresponding dirty bit
as done in svm_vcpu_run (new_asid also already does this when the asid generation increases)

Also KVM already sets the tlb_ctl directly in the vmcb.

What do you think?

Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 02/13] KVM: nSVM: Rework svm_flush_tlb_asid() to operate on a given VMCB
  2025-02-05 18:23 ` [RFC PATCH 02/13] KVM: nSVM: Rework svm_flush_tlb_asid() to operate on a given VMCB Yosry Ahmed
@ 2025-03-01  1:29   ` Maxim Levitsky
  2025-03-03 21:58     ` Yosry Ahmed
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-01  1:29 UTC (permalink / raw)
  To: Yosry Ahmed, Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov
  Cc: kvm, linux-kernel

On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> svm_flush_tlb_asid() currently operates on the current VMCB. In
> preparation for properly tracking TLB flushes for L1 and L2 ASIDs,
> refactor it to work on a given VMCB. All existing callers pass the
> current VMCB.
> 
> Create a svm_flush_tlb_guest() wrapper to use as the flush_tlb_guest()
> callback.
> 
> kvm_hv_vcpu_purge_flush_tlb() is only called when the current VMCB is
> passed to maintain current behavior.
> 
> No functional change intended.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/svm.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 08340ae57777b..2108b48ba4959 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3954,7 +3954,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
>  	svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
>  }
>  
> -static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
> +static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu, struct kvm_vmcb_info *vmcb)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  
> @@ -3963,7 +3963,8 @@ static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
>  	 * A TLB flush for the current ASID flushes both "host" and "guest" TLB
>  	 * entries, and thus is a superset of Hyper-V's fine grained flushing.
>  	 */
> -	kvm_hv_vcpu_purge_flush_tlb(vcpu);
> +	if (vmcb == svm->current_vmcb)
> +		kvm_hv_vcpu_purge_flush_tlb(vcpu);

This is hyperv PV feature that should be looked upon very carefully.

To recap, 
each vCPU has 2 queues of pending TLB flush requests that target only small range of
memory pages. 

One is for L1 and one for L2, because now KVM supports a mode where L2
can ask L0 to do a tlb flush on its behalf, and KVM will figure out to which L1 vCPUs
to send this flush request.

Requests arrive from other vCPUs.

Here we purge the TLB request queue because we flushed a super-set of the requests,
which used to contain both L1 and L2 TLB, but soon that won't be true.

So I think it might make sense to also add vmcb to kvm_hv_vcpu_purge_flush_tlb, and then
depending if it is vmcb01 or vmcb02, purge the correct queue.
I don't know if this is theoretical or actual bug but it is better to be safe IMHO.


>  
>  	/*
>  	 * Flush only the current ASID even if the TLB flush was invoked via
> @@ -3973,14 +3974,15 @@ static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
>  	 * VM-Exit (via kvm_mmu_reset_context()).
>  	 */
>  	if (static_cpu_has(X86_FEATURE_FLUSHBYASID))
> -		svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID;
> +		vmcb->ptr->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID;
>  	else
> -		svm->current_vmcb->asid_generation--;
> +		vmcb->asid_generation--;
>  }
>  
>  static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
>  {
>  	hpa_t root_tdp = vcpu->arch.mmu->root.hpa;
> +	struct vcpu_svm *svm = to_svm(vcpu);
>  
>  	/*
>  	 * When running on Hyper-V with EnlightenedNptTlb enabled, explicitly
> @@ -3991,11 +3993,13 @@ static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
>  	if (svm_hv_is_enlightened_tlb_enabled(vcpu) && VALID_PAGE(root_tdp))
>  		hyperv_flush_guest_mapping(root_tdp);
>  
> -	svm_flush_tlb_asid(vcpu);
> +	svm_flush_tlb_asid(vcpu, svm->current_vmcb);
>  }
>  
>  static void svm_flush_tlb_all(struct kvm_vcpu *vcpu)
>  {
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
>  	/*
>  	 * When running on Hyper-V with EnlightenedNptTlb enabled, remote TLB
>  	 * flushes should be routed to hv_flush_remote_tlbs() without requesting
> @@ -4006,7 +4010,7 @@ static void svm_flush_tlb_all(struct kvm_vcpu *vcpu)
>  	if (WARN_ON_ONCE(svm_hv_is_enlightened_tlb_enabled(vcpu)))
>  		hv_flush_remote_tlbs(vcpu->kvm);
>  
> -	svm_flush_tlb_asid(vcpu);
> +	svm_flush_tlb_asid(vcpu, svm->current_vmcb);
>  }
>  
>  static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
> @@ -4016,6 +4020,13 @@ static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
>  	invlpga(gva, svm->vmcb->control.asid);
>  }
>  
> +static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
> +	svm_flush_tlb_asid(vcpu, svm->current_vmcb);
> +}
> +

Small nitpick: I think that this change is still unrelated and thus
probably should go to a separate patch.


Best regards,
	Maxim Levitsky

>  static inline void sync_cr8_to_lapic(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -5055,7 +5066,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>  	.flush_tlb_all = svm_flush_tlb_all,
>  	.flush_tlb_current = svm_flush_tlb_current,
>  	.flush_tlb_gva = svm_flush_tlb_gva,
> -	.flush_tlb_guest = svm_flush_tlb_asid,
> +	.flush_tlb_guest = svm_flush_tlb_guest,
>  
>  	.vcpu_pre_run = svm_vcpu_pre_run,
>  	.vcpu_run = svm_vcpu_run,



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 03/13] KVM: nSVM: Split nested_svm_transition_tlb_flush() into entry/exit fns
  2025-02-05 18:23 ` [RFC PATCH 03/13] KVM: nSVM: Split nested_svm_transition_tlb_flush() into entry/exit fns Yosry Ahmed
@ 2025-03-01  1:34   ` Maxim Levitsky
  0 siblings, 0 replies; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-01  1:34 UTC (permalink / raw)
  To: Yosry Ahmed, Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> The handling for the entry and exit TLB flushes will diverge
> significantly in the following changes. Instead of adding an 'is_vmenter'
> argument like nested_vmx_transition_tlb_flush(), just split the function
> into two variants for 'entry' and 'exit'.
> 
> No functional change intended.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index bbe4f3ac9f250..2eba36af44f22 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -482,7 +482,7 @@ static void nested_save_pending_event_to_vmcb12(struct vcpu_svm *svm,
>  	vmcb12->control.exit_int_info = exit_int_info;
>  }
>  
> -static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu)
> +static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
>  {
>  	/* Handle pending Hyper-V TLB flush requests */
>  	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
> @@ -503,6 +503,16 @@ static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu)
>  	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>  }
>  
> +static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
> +{
> +	/* Handle pending Hyper-V TLB flush requests */
> +	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
> +
> +	/* See nested_svm_entry_tlb_flush() */
> +	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> +	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> +}
> +
>  /*
>   * Load guest's/host's cr3 on nested vmentry or vmexit. @nested_npt is true
>   * if we are emulating VM-Entry into a guest with NPT enabled.
> @@ -645,7 +655,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
>  	u32 pause_count12;
>  	u32 pause_thresh12;
>  
> -	nested_svm_transition_tlb_flush(vcpu);
> +	nested_svm_entry_tlb_flush(vcpu);
>  
>  	/* Enter Guest-Mode */
>  	enter_guest_mode(vcpu);
> @@ -1131,7 +1141,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>  
>  	kvm_vcpu_unmap(vcpu, &map);
>  
> -	nested_svm_transition_tlb_flush(vcpu);
> +	nested_svm_exit_tlb_flush(vcpu);
>  
>  	nested_svm_uninit_mmu_context(vcpu);
>  

Looks reasonable,

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 04/13] KVM: SVM: Introduce helpers for updating TLB_CONTROL
  2025-02-05 18:23 ` [RFC PATCH 04/13] KVM: SVM: Introduce helpers for updating TLB_CONTROL Yosry Ahmed
@ 2025-03-01  1:37   ` Maxim Levitsky
  0 siblings, 0 replies; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-01  1:37 UTC (permalink / raw)
  To: Yosry Ahmed, Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> Introduce helpers for updating TLB_CONTROL in the VMCB instead of
> directly setting it. Two helpers are introduced:
> 
> - svm_add_tlb_ctl_flush(): Combines a new TLB_CONTROL value with the
>   existing one.
> 
> - svm_clear_tlb_ctl_flush(): Clears the TLB_CONTROL field.
> 
> The goal is to prevent overwriting a TLB_CONTROL value with something
> that results in less TLB entries being flushed. This does not currently
> happen as KVM only sets TLB_CONTROL_FLUSH_ASID when servicing a flush
> request, and TLB_CONTROL_FLUSH_ALL_ASID when allocating a new ASID. The
> latter always happens after the former so no unsafe overwrite happens.
> 
> However, future changes may result in subtle bugs where the TLB_CONTROL
> field is incorrectly overwritten. The new helpers prevent that.
> 
> A separate helper is used for clearing the TLB flush because it is
> semantically different. In this case, KVM knowingly ignores the existing
> value of TLB_CONTROL. Also, although svm_add_tlb_ctl_flush() would just
> work for TLB_CONTROL_DO_NOTHING, the logic becomes inconsistent (use the
> biggest hammer unless no hammer at all is requested).
> 
> Opportunistically move the TLB_CONTROL_* definitions to
> arch/x86/kvm/svm/svm.h as they are not used outside of
> arch/x86/kvm/svm/.
> 
> No functional change intended.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/include/asm/svm.h |  6 ------
>  arch/x86/kvm/svm/nested.c  |  2 +-
>  arch/x86/kvm/svm/sev.c     |  2 +-
>  arch/x86/kvm/svm/svm.c     |  6 +++---
>  arch/x86/kvm/svm/svm.h     | 29 +++++++++++++++++++++++++++++
>  5 files changed, 34 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index 2b59b9951c90e..e6bccf8f90982 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -169,12 +169,6 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
>  	};
>  };
>  
> -
> -#define TLB_CONTROL_DO_NOTHING 0
> -#define TLB_CONTROL_FLUSH_ALL_ASID 1
> -#define TLB_CONTROL_FLUSH_ASID 3
> -#define TLB_CONTROL_FLUSH_ASID_LOCAL 7
> -
>  #define V_TPR_MASK 0x0f
>  
>  #define V_IRQ_SHIFT 8
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 2eba36af44f22..0e9b0592c1f83 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -690,7 +690,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
>  	/* Done at vmrun: asid.  */
>  
>  	/* Also overwritten later if necessary.  */
> -	vmcb02->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
> +	svm_clear_tlb_ctl_flush(vmcb02);
>  
>  	/* nested_cr3.  */
>  	if (nested_npt_enabled(svm))
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index b0adfd0537d00..3af296d6c04f6 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3481,7 +3481,7 @@ void pre_sev_run(struct vcpu_svm *svm, int cpu)
>  		return;
>  
>  	sd->sev_vmcbs[asid] = svm->vmcb;
> -	svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID;
> +	svm_add_tlb_ctl_flush(svm->vmcb, TLB_CONTROL_FLUSH_ASID);
>  	vmcb_mark_dirty(svm->vmcb, VMCB_ASID);
>  }
>  
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 2108b48ba4959..a2d601cd4c283 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1985,7 +1985,7 @@ static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd)
>  	if (sd->next_asid > sd->max_asid) {
>  		++sd->asid_generation;
>  		sd->next_asid = sd->min_asid;
> -		svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ALL_ASID;
> +		svm_add_tlb_ctl_flush(svm->vmcb, TLB_CONTROL_FLUSH_ALL_ASID);
>  		vmcb_mark_dirty(svm->vmcb, VMCB_ASID);
>  	}
>  
> @@ -3974,7 +3974,7 @@ static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu, struct kvm_vmcb_info *vmcb
>  	 * VM-Exit (via kvm_mmu_reset_context()).
>  	 */
>  	if (static_cpu_has(X86_FEATURE_FLUSHBYASID))
> -		vmcb->ptr->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID;
> +		svm_add_tlb_ctl_flush(vmcb->ptr, TLB_CONTROL_FLUSH_ASID);
>  	else
>  		vmcb->asid_generation--;
>  }
> @@ -4317,7 +4317,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>  		svm->nested.nested_run_pending = 0;
>  	}
>  
> -	svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
> +	svm_clear_tlb_ctl_flush(svm->vmcb);
>  	vmcb_mark_all_clean(svm->vmcb);
>  
>  	/* if exit due to PF check for async PF */
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index ebbb0b1a64676..6a73d6ed1e428 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -611,6 +611,35 @@ void svm_set_x2apic_msr_interception(struct vcpu_svm *svm, bool disable);
>  void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
>  				     int trig_mode, int vec);
>  
> +#define TLB_CONTROL_DO_NOTHING 0
> +#define TLB_CONTROL_FLUSH_ALL_ASID 1
> +#define TLB_CONTROL_FLUSH_ASID 3
> +#define TLB_CONTROL_FLUSH_ASID_LOCAL 7
> +
> +/*
> + * Clearing TLB flushes is done separately because combining
> + * TLB_CONTROL_DO_NOTHING with others is counter-intuitive.
> + */
> +static inline void svm_add_tlb_ctl_flush(struct vmcb *vmcb, u8 tlb_ctl)
> +{
> +	if (WARN_ON_ONCE(tlb_ctl == TLB_CONTROL_DO_NOTHING))
> +		return;
> +
> +	/*
> +	 * Apply the least targeted (most inclusive) TLB flush. Apart from
> +	 * TLB_CONTROL_DO_NOTHING, lower values of tlb_ctl are less targeted.
> +	 */
> +	if (vmcb->control.tlb_ctl == TLB_CONTROL_DO_NOTHING)
> +		vmcb->control.tlb_ctl = tlb_ctl;
> +	else
> +		vmcb->control.tlb_ctl = min(vmcb->control.tlb_ctl, tlb_ctl);
> +}
> +
> +static inline void svm_clear_tlb_ctl_flush(struct vmcb *vmcb)
> +{
> +	vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
> +}
> +
>  /* nested.c */
>  
>  #define NESTED_EXIT_HOST	0	/* Exit handled on host level */


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 07/13] KVM: nSVM: Handle INVLPGA interception correctly
  2025-02-05 18:23 ` [RFC PATCH 07/13] KVM: nSVM: Handle INVLPGA interception correctly Yosry Ahmed
@ 2025-03-01  1:55   ` Maxim Levitsky
  2025-03-03 22:05     ` Yosry Ahmed
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-01  1:55 UTC (permalink / raw)
  To: Yosry Ahmed, Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> Currently, INVPLGA interception handles it like INVLPG, which flushes
> L1's TLB translations for the address. It was implemented in this way
> because L1 and L2 shared an ASID. Now, L1 and L2 have separate ASIDs. It
> is still harmless to flush L1's translations, but it's only correct
> because all translations are flushed on nested transitions anyway.
> 
> In preparation for stopping unconditional flushes on nested transitions,
> handle INVPLGA interception properly. If L1 specified zero as the ASID,
> this is equivalent to INVLPG, so handle it as such. Otherwise, use
> INVPLGA to flush the translations of the appropriate ASID tracked by
> KVM, if any. Sync the shadow MMU as well, as L1 invalidated L2's
> mappings.
> 
> Opportunistically update svm_flush_tlb_gva() to use
> svm->current_vmcb->asid instead of svm->vmcb->control.asid for
> consistency. The two should always be in sync except when KVM allocates
> a new ASID in pre_svm_run(), and they are shortly brought back in sync
> in svm_vcpu_run(). However, if future changes add more code paths where
> KVM allocates a new ASID, flushing the potentially old ASID in
> svm->vmcb->control.asid would be unnecessary overhead (although probably
> not much different from flushing the newly allocated ASID).
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu/mmu.c          |  5 +++--
>  arch/x86/kvm/svm/svm.c          | 40 ++++++++++++++++++++++++++++++---
>  3 files changed, 42 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5193c3dfbce15..1e147bb2e560f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2213,6 +2213,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
>  		       void *insn, int insn_len);
>  void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
>  void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
> +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> +			       u64 addr, unsigned long roots, bool gva_flush);
>  void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  			     u64 addr, unsigned long roots);
>  void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ac133abc9c173..f5e0d2c8f4bbe 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6158,8 +6158,8 @@ static void kvm_mmu_invalidate_addr_in_root(struct kvm_vcpu *vcpu,
>  	write_unlock(&vcpu->kvm->mmu_lock);
>  }
>  
> -static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> -				      u64 addr, unsigned long roots, bool gva_flush)
> +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> +			       u64 addr, unsigned long roots, bool gva_flush)
>  {
>  	int i;
>  
> @@ -6185,6 +6185,7 @@ static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu
>  			kvm_mmu_invalidate_addr_in_root(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
>  	}
>  }
> +EXPORT_SYMBOL_GPL(__kvm_mmu_invalidate_addr);
>  
>  void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  			     u64 addr, unsigned long roots)
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index a2d601cd4c283..9e29f87d3bd93 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2483,6 +2483,7 @@ static int clgi_interception(struct kvm_vcpu *vcpu)
>  
>  static int invlpga_interception(struct kvm_vcpu *vcpu)
>  {
> +	struct vcpu_svm *svm = to_svm(vcpu);
>  	gva_t gva = kvm_rax_read(vcpu);
>  	u32 asid = kvm_rcx_read(vcpu);
>  
> @@ -2492,8 +2493,41 @@ static int invlpga_interception(struct kvm_vcpu *vcpu)
>  
>  	trace_kvm_invlpga(to_svm(vcpu)->vmcb->save.rip, asid, gva);
>  
> -	/* Let's treat INVLPGA the same as INVLPG (can be optimized!) */
> -	kvm_mmu_invlpg(vcpu, gva);
> +	/*
> +	 * APM is silent about using INVLPGA to flush the host ASID (i.e. 0).
> +	 * Do the logical thing and handle it like INVLPG.
> +	 */
> +	if (asid == 0) {
> +		kvm_mmu_invlpg(vcpu, gva);
> +		return kvm_skip_emulated_instruction(vcpu);
> +	}
> +
> +	/*
> +	 * Check if L1 specified the L2 ASID we are currently tracking. If it
> +	 * isn't, do nothing as we have to handle the TLB flush when switching
> +	 * to the new ASID anyway. APM mentions that INVLPGA is typically only
> +	 * meaningful with shadow paging, so also do nothing if L1 is using
> +	 * nested NPT.
> +	 */
> +	if (!nested_npt_enabled(svm) && asid == svm->nested.last_asid)
> +		invlpga(gva, svm->nested.vmcb02.asid);


Hi, 

IMHO we can't just NOP the INVLPGA because it is not useful in nested NPT case.

If I understand the APM correctly, the CPU will honor the INVLPGA
request, even when NPT is enabled, and so KVM must do this as well.

It is not useful for the hypervisor because it needs GVA, which in case of NPT,
the hypervisor won't usually track, but we can't completely rule out that some
hypervisor uses this in some cases.


Also, there is out of order patch here: last_asid isn't yet declared.
It is added in patch 10.

Best regards,
	Maxim Levitsky


> +
> +	/*
> +	 * If NPT is disabled, sync the shadow page tables as L1 is invalidating
> +	 * mappings for L2. Sync all roots as ASIDs are not tracked in the MMU
> +	 * role.
> +	 *
> +	 * As we are not flushing the current context, skip the gva flush from
> +	 * __kvm_mmu_invalidate_addr(), it would flush the wrong ASID anyway.
> +	 * The correct TLB flush was done above (if needed).
> +	 *
> +	 * This always operates on root_mmu because L1 and L2 share an MMU when
> +	 * NPT is disabled. This can be optimized by invalidating guest roots
> +	 * only.
> +	 */
> +	if (!npt_enabled)
> +		__kvm_mmu_invalidate_addr(vcpu, &vcpu->arch.root_mmu, gva,
> +					  KVM_MMU_ROOTS_ALL, false);
>  
>  	return kvm_skip_emulated_instruction(vcpu);
>  }
> @@ -4017,7 +4051,7 @@ static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  
> -	invlpga(gva, svm->vmcb->control.asid);
> +	invlpga(gva, svm->current_vmcb->asid);
>  }
>  
>  static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu)



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/13] KVM: nSVM: Flush both L1 and L2 ASIDs on KVM_REQ_TLB_FLUSH
  2025-02-05 18:23 ` [RFC PATCH 08/13] KVM: nSVM: Flush both L1 and L2 ASIDs on KVM_REQ_TLB_FLUSH Yosry Ahmed
@ 2025-03-01  1:58   ` Maxim Levitsky
  2025-03-03 22:06     ` Yosry Ahmed
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-01  1:58 UTC (permalink / raw)
  To: Yosry Ahmed, Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> KVM_REQ_TLB_FLUSH is used to flush all TLB entries for all contexts
> (e.g. in kvm_flush_remote_tlbs()). Flush both L1 and L2 ASIDs in
> svm_flush_tlb_all() to handle it appropriately.
> 
> This is currently not required as nested transitions do unconditional
> TLB flushes, but this is a step toward eliminating that.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 1 -
>  arch/x86/kvm/svm/svm.c    | 4 +++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 0e9b0592c1f83..0735177b95a1d 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -491,7 +491,6 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
>  	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
>  	 * things to fix before this can be conditional:
>  	 *
> -	 *  - Flush TLBs for both L1 and L2 remote TLB flush
>  	 *  - Honor L1's request to flush an ASID on nested VMRUN
>  	 *  - Sync nested NPT MMU on VMRUN that flushes L2's ASID[*]
>  	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 9e29f87d3bd93..8342c7eadbba8 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4044,7 +4044,9 @@ static void svm_flush_tlb_all(struct kvm_vcpu *vcpu)
>  	if (WARN_ON_ONCE(svm_hv_is_enlightened_tlb_enabled(vcpu)))
>  		hv_flush_remote_tlbs(vcpu->kvm);
>  
> -	svm_flush_tlb_asid(vcpu, svm->current_vmcb);
> +	svm_flush_tlb_asid(vcpu, &svm->vmcb01);
> +	if (svm->nested.initialized)
> +		svm_flush_tlb_asid(vcpu, &svm->nested.vmcb02);
>  }

This makes sense.

Note that this doesn't really flush the ASID used, but rather ensures
that we will flush it on next entry via that vmcb. (because of new asid,
that will be picked, or because we set tlb_ctl in that vmcb)

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky

>  
>  static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 09/13] KVM: nSVM: Handle nested TLB flush requests through TLB_CONTROL
  2025-02-05 18:23 ` [RFC PATCH 09/13] KVM: nSVM: Handle nested TLB flush requests through TLB_CONTROL Yosry Ahmed
  2025-02-05 21:45   ` Yosry Ahmed
@ 2025-03-01  2:06   ` Maxim Levitsky
  2025-03-03 22:10     ` Yosry Ahmed
  1 sibling, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-01  2:06 UTC (permalink / raw)
  To: Yosry Ahmed, Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> Handle L1's requests to flush L2's TLB through the TLB_CONTROL field of
> VMCB12. This is currently redundant because a full flush is executed on
> every nested transition, but is a step towards removing that.
> 
> TLB_CONTROL_FLUSH_ALL_ASID flushes all ASIDs from L1's perspective,
> including its own, so do a guest TLB flush on both transitions. Never
> propagate TLB_CONTROL_FLUSH_ALL_ASID from the guest to the actual VMCB,
> as this gives the guest the power to flush the entire physical TLB
> (including translations for the host and other VMs).
> 
> For other cases, the TLB flush is only done when entering L2. The nested
> NPT MMU is also sync'd because TLB_CONTROL also flushes NPT
> guest-physical mappings.
> 
> All TLB_CONTROL values can be handled by KVM regardless of FLUSHBYASID
> support on the underlying CPU, so keep advertising FLUSHBYASID to the
> guest unconditionally.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 42 ++++++++++++++++++++++++++++++++-------
>  arch/x86/kvm/svm/svm.c    |  6 +++---
>  2 files changed, 38 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 0735177b95a1d..e2c59eb2907e8 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -484,19 +484,36 @@ static void nested_save_pending_event_to_vmcb12(struct vcpu_svm *svm,
>  
>  static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
>  {
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
>  	/* Handle pending Hyper-V TLB flush requests */
>  	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
>  
> +	/*
> +	 * If L1 requested a TLB flush for L2, flush L2's TLB on nested entry
> +	 * and sync the nested NPT MMU, as TLB_CONTROL also flushes NPT
> +	 * guest-physical mappings. We technically only need to flush guest_mode
> +	 * page tables.
> +	 *
> +	 * KVM_REQ_TLB_FLUSH_GUEST will flush L2's ASID even if the underlying
> +	 * CPU does not support FLUSHBYASID (by assigning a new ASID), so we
> +	 * can handle all TLB_CONTROL values from L1 regardless.
> +	 *
> +	 * Note that TLB_CONTROL_FLUSH_ASID_LOCAL is handled exactly like
> +	 * TLB_CONTROL_FLUSH_ASID. We can technically flush less TLB entries,
> +	 * but this would require significantly more complexity.
> +	 */

I think it might make sense to note that we in essence support only one non zero ASID
in L1, the one that it picks for the nested guest.


Thus when asked to TLB_CONTROL_FLUSH_ALL_ASID 
we need to flush the L2's real asid and L1 asid only.


> +	if (svm->nested.ctl.tlb_ctl != TLB_CONTROL_DO_NOTHING) {
> +		if (nested_npt_enabled(svm))
> +			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> +		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> +	}
> +
>  	/*
>  	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
>  	 * things to fix before this can be conditional:
>  	 *
> -	 *  - Honor L1's request to flush an ASID on nested VMRUN
> -	 *  - Sync nested NPT MMU on VMRUN that flushes L2's ASID[*]
>  	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
> -	 *
> -	 * [*] Unlike nested EPT, SVM's ASID management can invalidate nested
> -	 *     NPT guest-physical mappings on VMRUN.
>  	 */
>  	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
>  	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> @@ -504,9 +521,18 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
>  
>  static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
>  {
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
>  	/* Handle pending Hyper-V TLB flush requests */
>  	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
>  
> +	/*
> +	 * If L1 had requested a full TLB flush when entering L2, also flush its
> +	 * TLB entries when exiting back to L1.
> +	 */
> +	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
> +		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);

Makes sense.

> +
>  	/* See nested_svm_entry_tlb_flush() */
>  	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
>  	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> @@ -825,7 +851,8 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
>  	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>  
>  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> -	nested_vmcb02_prepare_control(svm, vmcb12->save.rip, vmcb12->save.cs.base);
> +	nested_vmcb02_prepare_control(svm, vmcb12->save.rip,
> +				      vmcb12->save.cs.base);
>  	nested_vmcb02_prepare_save(svm, vmcb12);
>  
>  	ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
> @@ -1764,7 +1791,8 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>  	nested_copy_vmcb_control_to_cache(svm, ctl);
>  
>  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> -	nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip, svm->vmcb->save.cs.base);
> +	nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip,
> +				      svm->vmcb->save.cs.base);
>  
>  	/*
>  	 * While the nested guest CR3 is already checked and set by
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 8342c7eadbba8..5e7b1c9bfa605 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -5242,9 +5242,9 @@ static __init void svm_set_cpu_caps(void)
>  		kvm_cpu_cap_set(X86_FEATURE_VMCBCLEAN);
>  
>  		/*
> -		 * KVM currently flushes TLBs on *every* nested SVM transition,
> -		 * and so for all intents and purposes KVM supports flushing by
> -		 * ASID, i.e. KVM is guaranteed to honor every L1 ASID flush.
> +		 * KVM currently handles all TLB_CONTROL values set by L1, even
> +		 * if the underlying CPU does not. See
> +		 * nested_svm_transition_tlb_flush().
>  		 */
>  		kvm_cpu_cap_set(X86_FEATURE_FLUSHBYASID);
>  

Patch looks OK, but I can't be 100% sure about this.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 10/13] KVM: nSVM: Flush the TLB if L1 changes L2's ASID
  2025-02-05 18:23 ` [RFC PATCH 10/13] KVM: nSVM: Flush the TLB if L1 changes L2's ASID Yosry Ahmed
@ 2025-03-01  2:13   ` Maxim Levitsky
  0 siblings, 0 replies; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-01  2:13 UTC (permalink / raw)
  To: Yosry Ahmed, Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> KVM tracks a single ASID for L2 guests. L1 could change the ASID it has
> assigned L2 due to switching to a different L2 guest or simply to avoid
> flushing L2's existing ASID. Flush L2's TLB when this happens to avoid
> reusing TLB entries from the old ASID (from L1's perspective).
> 
> Remove the comment in __nested_copy_vmcb_control_to_cache() about the
> cached ASID usage, as this changes makes it stale by adding another
> usage.
> 
> This is heavily inspired by nVMX's handling of last_vpid.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 5 ++++-
>  arch/x86/kvm/svm/svm.h    | 2 ++
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index e2c59eb2907e8..12bb391884299 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -368,7 +368,6 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
>  	to->pause_filter_count  = from->pause_filter_count;
>  	to->pause_filter_thresh = from->pause_filter_thresh;
>  
> -	/* Copy asid here because nested_vmcb_check_controls will check it.  */
>  	to->asid           = from->asid;
>  	to->msrpm_base_pa &= ~0x0fffULL;
>  	to->iopm_base_pa  &= ~0x0fffULL;
> @@ -509,6 +508,10 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
>  		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>  	}
>  
> +	if (svm->nested.ctl.asid != svm->nested.last_asid) {
> +		svm->nested.last_asid = svm->nested.ctl.asid;
> +		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> +	}

>  	/*
>  	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
>  	 * things to fix before this can be conditional:
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 6a73d6ed1e428..f2352135b99d3 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -211,6 +211,8 @@ struct svm_nested_state {
>  	 * on its side.
>  	 */
>  	bool force_msr_bitmap_recalc;
> +
> +	u32 last_asid;
>  };
>  
>  struct vcpu_sev_es_state {


I can't be 100% sure but overall the patch looks correct to me.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 11/13] KVM: nSVM: Do not reset TLB_CONTROL in VMCB02 on nested entry
  2025-02-05 18:24 ` [RFC PATCH 11/13] KVM: nSVM: Do not reset TLB_CONTROL in VMCB02 on nested entry Yosry Ahmed
@ 2025-03-01  2:17   ` Maxim Levitsky
  2025-03-03 22:14     ` Yosry Ahmed
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-01  2:17 UTC (permalink / raw)
  To: Yosry Ahmed, Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> TLB_CONTROL is reset to TLB_CONTROL_DO_NOTHING on nested transitions to
> L2. This is unnecessary because it should always be
> TLB_CONTROL_DO_NOTHING at this point.
> 
> The flow for setting TLB_CONTROL is as follows:
> 1. In vcpu_enter_guest(), servicing a TLB flush request may set it to
> TLB_CONTROL_FLUSH_ASID in svm_flush_tlb_asid().
> 2. In svm_vcpu_run() -> pre_svm_run(), it may get upgraded to
> TLB_CONTROL_FLUSH_ALL_ASID when assigning a new ASID.
> 3. In svm_cpu_run(), it gets reset to TLB_CONTROL_DO_NOTHING after the
> guest is run.
> 
> Hence, TLB_CONTROL is reset after each run and there is no need to do it
> again on every nested transition to L2.
> 
> There is a TODO in nested_svm_transition_tlb_flush() about this reset
> crushing pending TLB flushes. Remove it, as the reset is not really
> crushing anything as explained above.

I am not sure that we don't crush a pending tlb request: 

svm_flush_tlb_asid can also be called by KVM_REQ_TLB_FLUSH
and set the flush request in both vmcbs, thus later the nested_svm_exit_tlb_flush
can crush this request.

But the patch itself does look OK to me, although I might be mistaken.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>


Best regards,
	Maxim Levitsky


> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 12bb391884299..8e40ff21f7353 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -512,12 +512,7 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
>  		svm->nested.last_asid = svm->nested.ctl.asid;
>  		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>  	}
> -	/*
> -	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
> -	 * things to fix before this can be conditional:
> -	 *
> -	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
> -	 */
> +	/* TODO: optimize unconditional TLB flush/MMU sync */
>  	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
>  	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>  }
> @@ -536,7 +531,7 @@ static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
>  	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
>  		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>  
> -	/* See nested_svm_entry_tlb_flush() */
> +	/* TODO: optimize unconditional TLB flush/MMU sync */
>  	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
>  	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>  }
> @@ -717,9 +712,6 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
>  
>  	/* Done at vmrun: asid.  */
>  
> -	/* Also overwritten later if necessary.  */
> -	svm_clear_tlb_ctl_flush(vmcb02);
> -
>  	/* nested_cr3.  */
>  	if (nested_npt_enabled(svm))
>  		nested_svm_init_mmu_context(vcpu);



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 12/13] KVM: nSVM: Service local TLB flushes before nested transitions
  2025-02-05 18:24 ` [RFC PATCH 12/13] KVM: nSVM: Service local TLB flushes before nested transitions Yosry Ahmed
@ 2025-03-01  2:20   ` Maxim Levitsky
  2025-03-03 22:18     ` Yosry Ahmed
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-01  2:20 UTC (permalink / raw)
  To: Yosry Ahmed, Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> KVM does not track TLB flush requests for L1 vs. L2. Hence, service
> local flush that target the current context before switching to a new
> one. Since ASIDs are tracked per-VMCB, service the flushes before every
> VMCB switch.
> 
> This is conceptually similar to how nVMX calls
> kvm_service_local_tlb_flush_requests() in
> nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(), with the
> following differences:
> 
> 1. nVMX tracks the current VPID based on is_guest_mode(), so local TLB
>    flushes are serviced before enter_guest_mode() and
>    leave_guest_mode(). On the other hand, nSVM tracks the current ASID
>    based on the current VMCB, so the TLB flushes are serviced before an
>    VMCB switch.
> 
> 2. nVMX only enters and leaves guest mode in
>    nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(). Other paths
>    like vmx_set_nested_state() and vmx_leave_nested() call into these
>    two functions. On the other hand, nSVM open codes the switch in
>    functions like svm_set_nested_state() and svm_leave_nested(), so
>    servicing the flush in svm_switch_svm() is probably most reliable.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/svm.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 5e7b1c9bfa605..6daa7efa9262b 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1421,6 +1421,12 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  
>  void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb)
>  {
> +	/*
> +	 * ASIDs are tracked per-VMCB. Perform any pending TLB flushes for the
> +	 * current VMCB before switching to a new one.
> +	 */
> +	kvm_service_local_tlb_flush_requests(&svm->vcpu);
> +
>  	svm->current_vmcb = target_vmcb;
>  	svm->vmcb = target_vmcb->ptr;
>  }


Note that another difference between SVM and VMX is that this code will only set tlb_ctl
in the current vmcb, the actual flush can happen much later, when we do VM entry with this vmcb,
e.g if we are now in L2, the flush will happen when we enter L2 again.

I think that this is correct but I might be mistaken.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>


Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 13/13] KVM: nSVM: Stop bombing the TLB on nested transitions
  2025-02-05 18:24 ` [RFC PATCH 13/13] KVM: nSVM: Stop bombing the TLB on " Yosry Ahmed
@ 2025-03-01  2:21   ` Maxim Levitsky
  2025-03-03 22:21     ` Yosry Ahmed
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-01  2:21 UTC (permalink / raw)
  To: Yosry Ahmed, Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> Now that nested TLB flushes are properly tracked with a well-maintained
> separate ASID for L2 and proper handling of L1's TLB flush requests,
> drop the unconditional flushes and syncs on nested transitions.
> 
> On a Milan machine, an L1 and L2 guests were booted, both with a single
> vCPU, and pinned to a single physical CPU to maximize TLB collisions. In
> this setup, the cpuid_rate microbenchmark [1] showed the following
> changes with this patch:
> 
> +--------+--------+-------------------+----------------------+
> > L0     | L1     | cpuid_rate (base) | cpuid_rate (patched) |
> +========+========+===================+======================+
> > NPT    | NPT    | 256621            | 301113 (+17.3%)      |
> > NPT    | Shadow | 180017            | 203347 (+12.96%)     |
> > Shadow | Shadow | 177006            | 189150 (+6.86%)      |
> +--------+--------+-------------------+----------------------+
> 
> [1]https://lore.kernel.org/kvm/20231109180646.2963718-1-khorenko@virtuozzo.com/
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 8e40ff21f7353..45a187d4c23d1 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -512,9 +512,6 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
>  		svm->nested.last_asid = svm->nested.ctl.asid;
>  		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>  	}
> -	/* TODO: optimize unconditional TLB flush/MMU sync */
> -	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> -	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>  }
>  
>  static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
> @@ -530,10 +527,6 @@ static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
>  	 */
>  	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
>  		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> -
> -	/* TODO: optimize unconditional TLB flush/MMU sync */
> -	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> -	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>  }
>  
>  /*


Assuming that all previous patches are correct this one should work as well.

However only a very heavy stress testing, including hyperv, windows guests
of various types, etc can give me confidence that there is no some ugly bug lurking
somewhere.

TLB management can be very tricky, so I can't be 100% sure that I haven't missed something.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB
  2025-03-01  0:03   ` Sean Christopherson
@ 2025-03-03 17:51     ` Jim Mattson
  2025-03-03 18:53       ` Sean Christopherson
  2025-03-03 19:18     ` Yosry Ahmed
  1 sibling, 1 reply; 48+ messages in thread
From: Jim Mattson @ 2025-03-03 17:51 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Yosry Ahmed, Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 28, 2025 at 4:03 PM Sean Christopherson <seanjc@google.com> wrote:
>
> +Jim, for his input on VPIDs.
>
> On Wed, Feb 05, 2025, Yosry Ahmed wrote:
> > The ASID is currently tracked per-vCPU, because the same ASID is used by
> > L1 and L2. That ASID is flushed on every transition between L1 and L2.
> >
> > Track the ASID separately for each VMCB (similar to the
> > asid_generation), giving L2 a separate ASID. This is in preparation for
> > doing fine-grained TLB flushes on nested transitions instead of
> > unconditional full flushes.
>
> After having some time to think about this, rather than track ASIDs per VMCB, I
> think we should converge on a single approach for nVMX (VPID) and nSVM (ASID).
>
> Per **VM**, one VPID/ASID for L1, and one VPID/ASID for L2.

When using EPT on VMX, there is probably no advantage to using one
VPID per VM. The physical ASID is determined by <EPTRTA, VPID, PCID>.
Two different VMs are not going to share an EPTRTA, so they already
have different ASIDs, even if they have the same VPID.

> For SVM, the dynamic ASID crud is a holdover from KVM's support for CPUs that
> don't support FLUSHBYASID, i.e. needed to purge the entire TLB in order to flush
> guest mappings.  FLUSHBYASID was added in 2010, and AFAIK has been supported by
> all AMD CPUs since.

> KVM already mostly keeps the same ASID, except for when a vCPU is migrated, in
> which case KVM assigns a new ASID.  I suspect that following VMX's lead and
> simply doing a TLB flush in this situation would be an improvement for modern
> CPUs, as it would flush the entries that need to be flushed, and not pollute the
> TLBs with stale, unused entries.
>
> Using a static per-VM ASID would also allow using broadcast invalidations[*],
> would simplify the SVM code base, and I think/hope would allow us to move much
> of the TLB flushing logic, e.g. for task migration, to common code.
>
> For VPIDs, maybe it's because it's Friday afternoon, but for the life of me I
> can't think of any reason why KVM needs to assign VPIDs per vCPU.  Especially
> since KVM is ridiculously conservative and flushes _all_ EPT/VPID contexts when
> running a different vCPU on a pCPU (which I suspect we can trim down?).
>
> Am I forgetting something?

TDX? IIRC, TDX requires a unique VPID for each vCPU in a VM.

> [*] https://lore.kernel.org/all/Z8HdBg3wj8M7a4ts@google.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB
  2025-03-03 17:51     ` Jim Mattson
@ 2025-03-03 18:53       ` Sean Christopherson
  0 siblings, 0 replies; 48+ messages in thread
From: Sean Christopherson @ 2025-03-03 18:53 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Yosry Ahmed, Paolo Bonzini, kvm, linux-kernel

On Mon, Mar 03, 2025, Jim Mattson wrote:
> On Fri, Feb 28, 2025 at 4:03 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > +Jim, for his input on VPIDs.
> >
> > On Wed, Feb 05, 2025, Yosry Ahmed wrote:
> > > The ASID is currently tracked per-vCPU, because the same ASID is used by
> > > L1 and L2. That ASID is flushed on every transition between L1 and L2.
> > >
> > > Track the ASID separately for each VMCB (similar to the
> > > asid_generation), giving L2 a separate ASID. This is in preparation for
> > > doing fine-grained TLB flushes on nested transitions instead of
> > > unconditional full flushes.
> >
> > After having some time to think about this, rather than track ASIDs per VMCB, I
> > think we should converge on a single approach for nVMX (VPID) and nSVM (ASID).
> >
> > Per **VM**, one VPID/ASID for L1, and one VPID/ASID for L2.
> 
> When using EPT on VMX, there is probably no advantage to using one
> VPID per VM. The physical ASID is determined by <EPTRTA, VPID, PCID>.
> Two different VMs are not going to share an EPTRTA, so they already
> have different ASIDs, even if they have the same VPID.

For posterity, which the SDM says this:

  Linear mappings may be created. They are derived from the paging structures
  referenced (directly or indirectly) by the current value of CR3 and are associated
  with the current VPID and the current PCID.

it explicitly disallows creating or using linear mappings when EPT is enabled:

  No linear mappings are created while EPT is in use.

  no linear mappings are used while EPT is in use.

I think it's still worth assigning a unique VPID though, e.g. it would provide
some amount of defense in depth.  I.e. two different VMs *shouldn't* share an
EPTRTA :-)

> > For SVM, the dynamic ASID crud is a holdover from KVM's support for CPUs that
> > don't support FLUSHBYASID, i.e. needed to purge the entire TLB in order to flush
> > guest mappings.  FLUSHBYASID was added in 2010, and AFAIK has been supported by
> > all AMD CPUs since.
> 
> > KVM already mostly keeps the same ASID, except for when a vCPU is migrated, in
> > which case KVM assigns a new ASID.  I suspect that following VMX's lead and
> > simply doing a TLB flush in this situation would be an improvement for modern
> > CPUs, as it would flush the entries that need to be flushed, and not pollute the
> > TLBs with stale, unused entries.
> >
> > Using a static per-VM ASID would also allow using broadcast invalidations[*],
> > would simplify the SVM code base, and I think/hope would allow us to move much
> > of the TLB flushing logic, e.g. for task migration, to common code.
> >
> > For VPIDs, maybe it's because it's Friday afternoon, but for the life of me I
> > can't think of any reason why KVM needs to assign VPIDs per vCPU.  Especially
> > since KVM is ridiculously conservative and flushes _all_ EPT/VPID contexts when
> > running a different vCPU on a pCPU (which I suspect we can trim down?).
> >
> > Am I forgetting something?
> 
> TDX? IIRC, TDX requires a unique VPID for each vCPU in a VM.

Ha!  Nope, the TDX module actually does what I'm suggesting, and uses a per-VM
VPID.  So if I'm forgetting some TLB edge case, TDX is already hosed.

FWIW, the hypervisor, i.e. KVM, has no control over the VPID used by the TDX
module.  Intel incorporated SEAM mode into the ASID tag to prevent TLB collisions
between the hypervisor and the TDX module, and that also conveniently provides
separation between VPIDs for non-TDX VMs and TDX VMs (and now I'm curious if TDX
enabling does the "right" thing and skips VPID allocation).

FWIW, TDX's scheme would match what I'm proposing almost exactly.  TDX "composes"
the VPID using the HKID (guaranteed unique per VM) and then a "VM identifier",
which at a glance differentiates L1 from L2.

> > [*] https://lore.kernel.org/all/Z8HdBg3wj8M7a4ts@google.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB
  2025-03-01  0:03   ` Sean Christopherson
  2025-03-03 17:51     ` Jim Mattson
@ 2025-03-03 19:18     ` Yosry Ahmed
  1 sibling, 0 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-03 19:18 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson

On Fri, Feb 28, 2025 at 04:03:08PM -0800, Sean Christopherson wrote:
> +Jim, for his input on VPIDs.
> 
> On Wed, Feb 05, 2025, Yosry Ahmed wrote:
> > The ASID is currently tracked per-vCPU, because the same ASID is used by
> > L1 and L2. That ASID is flushed on every transition between L1 and L2.
> > 
> > Track the ASID separately for each VMCB (similar to the
> > asid_generation), giving L2 a separate ASID. This is in preparation for
> > doing fine-grained TLB flushes on nested transitions instead of
> > unconditional full flushes.
> 
> After having some time to think about this, rather than track ASIDs per VMCB, I
> think we should converge on a single approach for nVMX (VPID) and nSVM (ASID).
> 
> Per **VM**, one VPID/ASID for L1, and one VPID/ASID for L2.
> 
> For SVM, the dynamic ASID crud is a holdover from KVM's support for CPUs that
> don't support FLUSHBYASID, i.e. needed to purge the entire TLB in order to flush
> guest mappings.  FLUSHBYASID was added in 2010, and AFAIK has been supported by
> all AMD CPUs since.

This means that for these old CPUs, every TLB flush done for the guest
will also flush the TLB entries of all other guests and the host IIUC. I
am not sure what CPUs around do not support FLUSHBYASID, but this sounds
like a big regression for them.

I am all for simplifying the code and converging nVMX and nSVM, but I am
a bit worried about this. Sounds like you are not though, so maybe I am
missing something :P

I initially that that the ASID space is too small, but it turns out I
was confused by the ASID messages from the SEV code. The max number of
ASIDs seems to be (1 << 15) on Rome, Milan, and Genoa CPUs. That's half
of VMX_NR_VPIDS, and probably good enough.

> 
> KVM already mostly keeps the same ASID, except for when a vCPU is migrated, in
> which case KVM assigns a new ASID.  I suspect that following VMX's lead and
> simply doing a TLB flush in this situation would be an improvement for modern
> CPUs, as it would flush the entries that need to be flushed, and not pollute the
> TLBs with stale, unused entries.
> 
> Using a static per-VM ASID would also allow using broadcast invalidations[*],
> would simplify the SVM code base, and I think/hope would allow us to move much
> of the TLB flushing logic, e.g. for task migration, to common code.
> 
> For VPIDs, maybe it's because it's Friday afternoon, but for the life of me I
> can't think of any reason why KVM needs to assign VPIDs per vCPU.  Especially
> since KVM is ridiculously conservative and flushes _all_ EPT/VPID contexts when
> running a different vCPU on a pCPU (which I suspect we can trim down?).

I think for the purpose of this series we can switch SVM to use one ASID
per vCPU to match the current nVMX behavior and simplify things. Moving
both nSVM and nVMX to use a single ASID per VM instead of per vCPU, and
potentially moving some of the logic to the common code, could be a
separate followup effort (maybe something that I can work on later this
year if no one picks it up :) ).

WDYT?

> 
> Am I forgetting something?
> 
> [*] https://lore.kernel.org/all/Z8HdBg3wj8M7a4ts@google.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB
  2025-03-01  1:23   ` Maxim Levitsky
@ 2025-03-03 19:31     ` Yosry Ahmed
  0 siblings, 0 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-03 19:31 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 28, 2025 at 08:23:48PM -0500, Maxim Levitsky wrote:
> On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> > The ASID is currently tracked per-vCPU, because the same ASID is used by
> > L1 and L2. That ASID is flushed on every transition between L1 and L2.
> > 
> > Track the ASID separately for each VMCB (similar to the
> > asid_generation), giving L2 a separate ASID. This is in preparation for
> > doing fine-grained TLB flushes on nested transitions instead of
> > unconditional full flushes.
> > 
> > The ASIDs are still not fully maintained (e.g. a remote flush will only
> > flush the current ASID), so keep the TLB flush on every transition until
> > this is sorted out.
> > 
> > L1's ASID will be flushed on KVM_REQ_TLB_FLUSH_GUEST if it is the
> > active context, so remove the TODO in nested_svm_transition_tlb_flush()
> > about it.
> > 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/nested.c |  1 -
> >  arch/x86/kvm/svm/sev.c    |  2 +-
> >  arch/x86/kvm/svm/svm.c    | 12 +++++++-----
> >  arch/x86/kvm/svm/svm.h    |  2 +-
> >  4 files changed, 9 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 04c375bf1ac2a..bbe4f3ac9f250 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -495,7 +495,6 @@ static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu)
> >  	 *  - Honor L1's request to flush an ASID on nested VMRUN
> >  	 *  - Sync nested NPT MMU on VMRUN that flushes L2's ASID[*]
> >  	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
> > -	 *  - Flush L1's ASID on KVM_REQ_TLB_FLUSH_GUEST
> >  	 *
> >  	 * [*] Unlike nested EPT, SVM's ASID management can invalidate nested
> >  	 *     NPT guest-physical mappings on VMRUN.
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index 799f8494b599c..b0adfd0537d00 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -3468,7 +3468,7 @@ void pre_sev_run(struct vcpu_svm *svm, int cpu)
> >  	unsigned int asid = sev_get_asid(svm->vcpu.kvm);
> >  
> >  	/* Assign the asid allocated with this SEV guest */
> > -	svm->asid = asid;
> > +	svm->current_vmcb->asid = asid;
> >  
> >  	/*
> >  	 * Flush guest TLB:
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 7640a84e554a6..08340ae57777b 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -1335,8 +1335,10 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
> >  		save->g_pat = vcpu->arch.pat;
> >  		save->cr3 = 0;
> >  	}
> > -	svm->current_vmcb->asid_generation = 0;
> > -	svm->asid = 0;
> > +	svm->vmcb01.asid_generation = 0;
> > +	svm->vmcb01.asid = 0;
> > +	svm->nested.vmcb02.asid_generation = 0;
> > +	svm->nested.vmcb02.asid = 0;
> >  
> >  	svm->nested.vmcb12_gpa = INVALID_GPA;
> >  	svm->nested.last_vmcb12_gpa = INVALID_GPA;
> > @@ -1988,7 +1990,7 @@ static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd)
> >  	}
> >  
> >  	svm->current_vmcb->asid_generation = sd->asid_generation;
> > -	svm->asid = sd->next_asid++;
> > +	svm->current_vmcb->asid = sd->next_asid++;
> >  }
> >  
> >  static void svm_set_dr6(struct vcpu_svm *svm, unsigned long value)
> > @@ -4235,8 +4237,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
> >  
> >  	sync_lapic_to_cr8(vcpu);
> >  
> > -	if (unlikely(svm->asid != svm->vmcb->control.asid)) {
> > -		svm->vmcb->control.asid = svm->asid;
> > +	if (unlikely(svm->current_vmcb->asid != svm->vmcb->control.asid)) {
> > +		svm->vmcb->control.asid = svm->current_vmcb->asid;
> >  		vmcb_mark_dirty(svm->vmcb, VMCB_ASID);
> >  	}
> >  	svm->vmcb->save.cr2 = vcpu->arch.cr2;
> > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > index 9d7cdb8fbf872..ebbb0b1a64676 100644
> > --- a/arch/x86/kvm/svm/svm.h
> > +++ b/arch/x86/kvm/svm/svm.h
> > @@ -133,6 +133,7 @@ struct kvm_vmcb_info {
> >  	unsigned long pa;
> >  	int cpu;
> >  	uint64_t asid_generation;
> > +	u32 asid;
> >  };
> >  
> >  struct vmcb_save_area_cached {
> > @@ -247,7 +248,6 @@ struct vcpu_svm {
> >  	struct vmcb *vmcb;
> >  	struct kvm_vmcb_info vmcb01;
> >  	struct kvm_vmcb_info *current_vmcb;
> > -	u32 asid;
> >  	u32 sysenter_esp_hi;
> >  	u32 sysenter_eip_hi;
> >  	uint64_t tsc_aux;
> 
> Hi,
> 

Hi,

Thanks for taking a look! 

> 
> I think it should be possible to eliminate separate ASID field (current_vmcb->asid/svm->asid)
> completely and instead just use the value stored in the vmcb.
> 
> When there is a need to update it, KVM can also set the corresponding dirty bit
> as done in svm_vcpu_run (new_asid also already does this when the asid generation increases)
> 
> Also KVM already sets the tlb_ctl directly in the vmcb.
> 
> What do you think?

Yeah I think we can do that, although if we go with Sean's suggestion of
a per VM or a per vCPU ASID, this will change anyway. If we use a per
vCPU ASID, I think it would be nice to have it directly in svm->asid and
svm->nested.asid02 to be consistent with VMX.

I will see how the code turns out to be after taking Sean's suggestion
and go from there.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 02/13] KVM: nSVM: Rework svm_flush_tlb_asid() to operate on a given VMCB
  2025-03-01  1:29   ` Maxim Levitsky
@ 2025-03-03 21:58     ` Yosry Ahmed
  2025-03-05  2:52       ` Maxim Levitsky
  0 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-03 21:58 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, kvm,
	linux-kernel

On Fri, Feb 28, 2025 at 08:29:34PM -0500, Maxim Levitsky wrote:
> On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> > svm_flush_tlb_asid() currently operates on the current VMCB. In
> > preparation for properly tracking TLB flushes for L1 and L2 ASIDs,
> > refactor it to work on a given VMCB. All existing callers pass the
> > current VMCB.
> > 
> > Create a svm_flush_tlb_guest() wrapper to use as the flush_tlb_guest()
> > callback.
> > 
> > kvm_hv_vcpu_purge_flush_tlb() is only called when the current VMCB is
> > passed to maintain current behavior.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/svm.c | 25 ++++++++++++++++++-------
> >  1 file changed, 18 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 08340ae57777b..2108b48ba4959 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -3954,7 +3954,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
> >  	svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
> >  }
> >  
> > -static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
> > +static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu, struct kvm_vmcb_info *vmcb)
> >  {
> >  	struct vcpu_svm *svm = to_svm(vcpu);
> >  
> > @@ -3963,7 +3963,8 @@ static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
> >  	 * A TLB flush for the current ASID flushes both "host" and "guest" TLB
> >  	 * entries, and thus is a superset of Hyper-V's fine grained flushing.
> >  	 */
> > -	kvm_hv_vcpu_purge_flush_tlb(vcpu);
> > +	if (vmcb == svm->current_vmcb)
> > +		kvm_hv_vcpu_purge_flush_tlb(vcpu);
> 
> This is hyperv PV feature that should be looked upon very carefully.
> 
> To recap, 
> each vCPU has 2 queues of pending TLB flush requests that target only small range of
> memory pages. 

Thanks for pointing this out, I missed this.

> 
> One is for L1 and one for L2, because now KVM supports a mode where L2
> can ask L0 to do a tlb flush on its behalf, and KVM will figure out to which L1 vCPUs
> to send this flush request.
> 
> Requests arrive from other vCPUs.
> 
> Here we purge the TLB request queue because we flushed a super-set of the requests,
> which used to contain both L1 and L2 TLB, but soon that won't be true.
> 
> So I think it might make sense to also add vmcb to kvm_hv_vcpu_purge_flush_tlb, and then
> depending if it is vmcb01 or vmcb02, purge the correct queue.
> I don't know if this is theoretical or actual bug but it is better to be safe IMHO.

But I think we are already purging the right queue here. We purge the
TLB flush requests only if we are flushing the current VMCB. Within
kvm_hv_vcpu_purge_flush_tlb(), we choose the queue based on
is_guest_mode(vcpu).

svm_flush_tlb_asid() is called when servicing a TLB flush request, at
which point IIUC the current VMCB and whether the vCPU is in guest mode
should be in sync. So we will be correctly purging the L1 or L2 queue
based on the current VMCB.

That being said, it's a bit confusing that svm_flush_tlb_asid() uses the
VMCB to differentiate L1 and L2 ,while kvm_hv_vcpu_purge_flush_tlb()
uses is_guest_mode(). We also miss the opportunity to purge both queues
when called from svm_flush_tlb_all().

However, we cannot pass the VMCB to kvm_hv_vcpu_purge_flush_tlb() as it
is also called from common code. So I think we can make
kvm_hv_vcpu_purge_flush_tlb() take is_guest_mode as a parameter and pass
it here based on which VMCB is passed in.

WDYT?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 07/13] KVM: nSVM: Handle INVLPGA interception correctly
  2025-03-01  1:55   ` Maxim Levitsky
@ 2025-03-03 22:05     ` Yosry Ahmed
  2025-03-05  2:54       ` Maxim Levitsky
  0 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-03 22:05 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 28, 2025 at 08:55:18PM -0500, Maxim Levitsky wrote:
> On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> > Currently, INVPLGA interception handles it like INVLPG, which flushes
> > L1's TLB translations for the address. It was implemented in this way
> > because L1 and L2 shared an ASID. Now, L1 and L2 have separate ASIDs. It
> > is still harmless to flush L1's translations, but it's only correct
> > because all translations are flushed on nested transitions anyway.
> > 
> > In preparation for stopping unconditional flushes on nested transitions,
> > handle INVPLGA interception properly. If L1 specified zero as the ASID,
> > this is equivalent to INVLPG, so handle it as such. Otherwise, use
> > INVPLGA to flush the translations of the appropriate ASID tracked by
> > KVM, if any. Sync the shadow MMU as well, as L1 invalidated L2's
> > mappings.
> > 
> > Opportunistically update svm_flush_tlb_gva() to use
> > svm->current_vmcb->asid instead of svm->vmcb->control.asid for
> > consistency. The two should always be in sync except when KVM allocates
> > a new ASID in pre_svm_run(), and they are shortly brought back in sync
> > in svm_vcpu_run(). However, if future changes add more code paths where
> > KVM allocates a new ASID, flushing the potentially old ASID in
> > svm->vmcb->control.asid would be unnecessary overhead (although probably
> > not much different from flushing the newly allocated ASID).
> > 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 ++
> >  arch/x86/kvm/mmu/mmu.c          |  5 +++--
> >  arch/x86/kvm/svm/svm.c          | 40 ++++++++++++++++++++++++++++++---
> >  3 files changed, 42 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 5193c3dfbce15..1e147bb2e560f 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -2213,6 +2213,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> >  		       void *insn, int insn_len);
> >  void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
> >  void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
> > +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > +			       u64 addr, unsigned long roots, bool gva_flush);
> >  void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >  			     u64 addr, unsigned long roots);
> >  void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ac133abc9c173..f5e0d2c8f4bbe 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6158,8 +6158,8 @@ static void kvm_mmu_invalidate_addr_in_root(struct kvm_vcpu *vcpu,
> >  	write_unlock(&vcpu->kvm->mmu_lock);
> >  }
> >  
> > -static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > -				      u64 addr, unsigned long roots, bool gva_flush)
> > +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > +			       u64 addr, unsigned long roots, bool gva_flush)
> >  {
> >  	int i;
> >  
> > @@ -6185,6 +6185,7 @@ static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu
> >  			kvm_mmu_invalidate_addr_in_root(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
> >  	}
> >  }
> > +EXPORT_SYMBOL_GPL(__kvm_mmu_invalidate_addr);
> >  
> >  void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >  			     u64 addr, unsigned long roots)
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index a2d601cd4c283..9e29f87d3bd93 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -2483,6 +2483,7 @@ static int clgi_interception(struct kvm_vcpu *vcpu)
> >  
> >  static int invlpga_interception(struct kvm_vcpu *vcpu)
> >  {
> > +	struct vcpu_svm *svm = to_svm(vcpu);
> >  	gva_t gva = kvm_rax_read(vcpu);
> >  	u32 asid = kvm_rcx_read(vcpu);
> >  
> > @@ -2492,8 +2493,41 @@ static int invlpga_interception(struct kvm_vcpu *vcpu)
> >  
> >  	trace_kvm_invlpga(to_svm(vcpu)->vmcb->save.rip, asid, gva);
> >  
> > -	/* Let's treat INVLPGA the same as INVLPG (can be optimized!) */
> > -	kvm_mmu_invlpg(vcpu, gva);
> > +	/*
> > +	 * APM is silent about using INVLPGA to flush the host ASID (i.e. 0).
> > +	 * Do the logical thing and handle it like INVLPG.
> > +	 */
> > +	if (asid == 0) {
> > +		kvm_mmu_invlpg(vcpu, gva);
> > +		return kvm_skip_emulated_instruction(vcpu);
> > +	}
> > +
> > +	/*
> > +	 * Check if L1 specified the L2 ASID we are currently tracking. If it
> > +	 * isn't, do nothing as we have to handle the TLB flush when switching
> > +	 * to the new ASID anyway. APM mentions that INVLPGA is typically only
> > +	 * meaningful with shadow paging, so also do nothing if L1 is using
> > +	 * nested NPT.
> > +	 */
> > +	if (!nested_npt_enabled(svm) && asid == svm->nested.last_asid)
> > +		invlpga(gva, svm->nested.vmcb02.asid);
> 
> 
> Hi, 
> 
> IMHO we can't just NOP the INVLPGA because it is not useful in nested NPT case.
> 
> If I understand the APM correctly, the CPU will honor the INVLPGA
> request, even when NPT is enabled, and so KVM must do this as well.
> 
> It is not useful for the hypervisor because it needs GVA, which in case of NPT,
> the hypervisor won't usually track, but we can't completely rule out that some
> hypervisor uses this in some cases.

Yeah I knew this was going to be a contention point, was mainly waiting
to see what others think here.

I guess we can just map the ASID passed by L1 to the actual ASID we use
for L2 and execute the INVLPGA as-is with the gva passed by L1.

> 
> 
> Also, there is out of order patch here: last_asid isn't yet declared.
> It is added in patch 10.

Good catch, I will fix that, thanks!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/13] KVM: nSVM: Flush both L1 and L2 ASIDs on KVM_REQ_TLB_FLUSH
  2025-03-01  1:58   ` Maxim Levitsky
@ 2025-03-03 22:06     ` Yosry Ahmed
  0 siblings, 0 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-03 22:06 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 28, 2025 at 08:58:04PM -0500, Maxim Levitsky wrote:
> On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> > KVM_REQ_TLB_FLUSH is used to flush all TLB entries for all contexts
> > (e.g. in kvm_flush_remote_tlbs()). Flush both L1 and L2 ASIDs in
> > svm_flush_tlb_all() to handle it appropriately.
> > 
> > This is currently not required as nested transitions do unconditional
> > TLB flushes, but this is a step toward eliminating that.
> > 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/nested.c | 1 -
> >  arch/x86/kvm/svm/svm.c    | 4 +++-
> >  2 files changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 0e9b0592c1f83..0735177b95a1d 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -491,7 +491,6 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
> >  	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
> >  	 * things to fix before this can be conditional:
> >  	 *
> > -	 *  - Flush TLBs for both L1 and L2 remote TLB flush
> >  	 *  - Honor L1's request to flush an ASID on nested VMRUN
> >  	 *  - Sync nested NPT MMU on VMRUN that flushes L2's ASID[*]
> >  	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 9e29f87d3bd93..8342c7eadbba8 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -4044,7 +4044,9 @@ static void svm_flush_tlb_all(struct kvm_vcpu *vcpu)
> >  	if (WARN_ON_ONCE(svm_hv_is_enlightened_tlb_enabled(vcpu)))
> >  		hv_flush_remote_tlbs(vcpu->kvm);
> >  
> > -	svm_flush_tlb_asid(vcpu, svm->current_vmcb);
> > +	svm_flush_tlb_asid(vcpu, &svm->vmcb01);
> > +	if (svm->nested.initialized)
> > +		svm_flush_tlb_asid(vcpu, &svm->nested.vmcb02);
> >  }
> 
> This makes sense.
> 
> Note that this doesn't really flush the ASID used, but rather ensures
> that we will flush it on next entry via that vmcb. (because of new asid,
> that will be picked, or because we set tlb_ctl in that vmcb)

Right, what I mean by 'flush' here is to setup the flush. For SVM all
flushes are done on VM-enter anyway.

> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 09/13] KVM: nSVM: Handle nested TLB flush requests through TLB_CONTROL
  2025-03-01  2:06   ` Maxim Levitsky
@ 2025-03-03 22:10     ` Yosry Ahmed
  2025-03-05  2:57       ` Maxim Levitsky
  0 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-03 22:10 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 28, 2025 at 09:06:18PM -0500, Maxim Levitsky wrote:
> On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> > Handle L1's requests to flush L2's TLB through the TLB_CONTROL field of
> > VMCB12. This is currently redundant because a full flush is executed on
> > every nested transition, but is a step towards removing that.
> > 
> > TLB_CONTROL_FLUSH_ALL_ASID flushes all ASIDs from L1's perspective,
> > including its own, so do a guest TLB flush on both transitions. Never
> > propagate TLB_CONTROL_FLUSH_ALL_ASID from the guest to the actual VMCB,
> > as this gives the guest the power to flush the entire physical TLB
> > (including translations for the host and other VMs).
> > 
> > For other cases, the TLB flush is only done when entering L2. The nested
> > NPT MMU is also sync'd because TLB_CONTROL also flushes NPT
> > guest-physical mappings.
> > 
> > All TLB_CONTROL values can be handled by KVM regardless of FLUSHBYASID
> > support on the underlying CPU, so keep advertising FLUSHBYASID to the
> > guest unconditionally.
> > 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/nested.c | 42 ++++++++++++++++++++++++++++++++-------
> >  arch/x86/kvm/svm/svm.c    |  6 +++---
> >  2 files changed, 38 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 0735177b95a1d..e2c59eb2907e8 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -484,19 +484,36 @@ static void nested_save_pending_event_to_vmcb12(struct vcpu_svm *svm,
> >  
> >  static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
> >  {
> > +	struct vcpu_svm *svm = to_svm(vcpu);
> > +
> >  	/* Handle pending Hyper-V TLB flush requests */
> >  	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
> >  
> > +	/*
> > +	 * If L1 requested a TLB flush for L2, flush L2's TLB on nested entry
> > +	 * and sync the nested NPT MMU, as TLB_CONTROL also flushes NPT
> > +	 * guest-physical mappings. We technically only need to flush guest_mode
> > +	 * page tables.
> > +	 *
> > +	 * KVM_REQ_TLB_FLUSH_GUEST will flush L2's ASID even if the underlying
> > +	 * CPU does not support FLUSHBYASID (by assigning a new ASID), so we
> > +	 * can handle all TLB_CONTROL values from L1 regardless.
> > +	 *
> > +	 * Note that TLB_CONTROL_FLUSH_ASID_LOCAL is handled exactly like
> > +	 * TLB_CONTROL_FLUSH_ASID. We can technically flush less TLB entries,
> > +	 * but this would require significantly more complexity.
> > +	 */
> 
> I think it might make sense to note that we in essence support only one non zero ASID
> in L1, the one that it picks for the nested guest.
> 
> 
> Thus when asked to TLB_CONTROL_FLUSH_ALL_ASID 
> we need to flush the L2's real asid and L1 asid only.

This is described by the comment in nested_svm_exit_tlb_flush(). Do you
mean that we should also mention that here?

I guess one way to make things clearer is to describe the behavior for
all values of TLB_CONTROL here, and in nested_svm_exit_tlb_flush() just
say /* see nested_svm_entry_tlb_flush() */. Would that improve things?

> 
> 
> > +	if (svm->nested.ctl.tlb_ctl != TLB_CONTROL_DO_NOTHING) {
> > +		if (nested_npt_enabled(svm))
> > +			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > +		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> > +	}
> > +
> >  	/*
> >  	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
> >  	 * things to fix before this can be conditional:
> >  	 *
> > -	 *  - Honor L1's request to flush an ASID on nested VMRUN
> > -	 *  - Sync nested NPT MMU on VMRUN that flushes L2's ASID[*]
> >  	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
> > -	 *
> > -	 * [*] Unlike nested EPT, SVM's ASID management can invalidate nested
> > -	 *     NPT guest-physical mappings on VMRUN.
> >  	 */
> >  	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> >  	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> > @@ -504,9 +521,18 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
> >  
> >  static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
> >  {
> > +	struct vcpu_svm *svm = to_svm(vcpu);
> > +
> >  	/* Handle pending Hyper-V TLB flush requests */
> >  	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
> >  
> > +	/*
> > +	 * If L1 had requested a full TLB flush when entering L2, also flush its
> > +	 * TLB entries when exiting back to L1.
> > +	 */
> > +	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
> > +		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> 
> Makes sense.
> 
> > +
> >  	/* See nested_svm_entry_tlb_flush() */
> >  	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> >  	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> > @@ -825,7 +851,8 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
> >  	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
> >  
> >  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> > -	nested_vmcb02_prepare_control(svm, vmcb12->save.rip, vmcb12->save.cs.base);
> > +	nested_vmcb02_prepare_control(svm, vmcb12->save.rip,
> > +				      vmcb12->save.cs.base);
> >  	nested_vmcb02_prepare_save(svm, vmcb12);
> >  
> >  	ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
> > @@ -1764,7 +1791,8 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> >  	nested_copy_vmcb_control_to_cache(svm, ctl);
> >  
> >  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> > -	nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip, svm->vmcb->save.cs.base);
> > +	nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip,
> > +				      svm->vmcb->save.cs.base);
> >  
> >  	/*
> >  	 * While the nested guest CR3 is already checked and set by
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 8342c7eadbba8..5e7b1c9bfa605 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -5242,9 +5242,9 @@ static __init void svm_set_cpu_caps(void)
> >  		kvm_cpu_cap_set(X86_FEATURE_VMCBCLEAN);
> >  
> >  		/*
> > -		 * KVM currently flushes TLBs on *every* nested SVM transition,
> > -		 * and so for all intents and purposes KVM supports flushing by
> > -		 * ASID, i.e. KVM is guaranteed to honor every L1 ASID flush.
> > +		 * KVM currently handles all TLB_CONTROL values set by L1, even
> > +		 * if the underlying CPU does not. See
> > +		 * nested_svm_transition_tlb_flush().
> >  		 */
> >  		kvm_cpu_cap_set(X86_FEATURE_FLUSHBYASID);
> >  
> 
> Patch looks OK, but I can't be 100% sure about this.
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Best regards,
> 	Maxim Levitsky
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 11/13] KVM: nSVM: Do not reset TLB_CONTROL in VMCB02 on nested entry
  2025-03-01  2:17   ` Maxim Levitsky
@ 2025-03-03 22:14     ` Yosry Ahmed
  2025-03-05  3:01       ` Maxim Levitsky
  0 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-03 22:14 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 28, 2025 at 09:17:52PM -0500, Maxim Levitsky wrote:
> On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> > TLB_CONTROL is reset to TLB_CONTROL_DO_NOTHING on nested transitions to
> > L2. This is unnecessary because it should always be
> > TLB_CONTROL_DO_NOTHING at this point.
> > 
> > The flow for setting TLB_CONTROL is as follows:
> > 1. In vcpu_enter_guest(), servicing a TLB flush request may set it to
> > TLB_CONTROL_FLUSH_ASID in svm_flush_tlb_asid().
> > 2. In svm_vcpu_run() -> pre_svm_run(), it may get upgraded to
> > TLB_CONTROL_FLUSH_ALL_ASID when assigning a new ASID.
> > 3. In svm_cpu_run(), it gets reset to TLB_CONTROL_DO_NOTHING after the
> > guest is run.
> > 
> > Hence, TLB_CONTROL is reset after each run and there is no need to do it
> > again on every nested transition to L2.
> > 
> > There is a TODO in nested_svm_transition_tlb_flush() about this reset
> > crushing pending TLB flushes. Remove it, as the reset is not really
> > crushing anything as explained above.
> 
> I am not sure that we don't crush a pending tlb request: 
> 
> svm_flush_tlb_asid can also be called by KVM_REQ_TLB_FLUSH
> and set the flush request in both vmcbs, thus later the nested_svm_exit_tlb_flush
> can crush this request.

How so?

nested_svm_exit_tlb_flush() makes a KVM_REQ_TLB_FLUSH_GUEST request.
svm_flush_tlb_asid() is called when servicing KVM_REQ_TLB_* requests.

So svm_flush_tlb_asid() does not make a request in the sense of
KVM_REQ_*, it sets TLB_CONTROL or invalidates the ASID, which is can
more-or-less be described as "requesting" a TLB flush on VM-enter, but
is not the same thing as KVM_REQ_TLB_FLUSH.

So I am not sure there are any requests being crushed here.

> 
> But the patch itself does look OK to me, although I might be mistaken.
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Thanks!

> 
> 
> Best regards,
> 	Maxim Levitsky

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 12/13] KVM: nSVM: Service local TLB flushes before nested transitions
  2025-03-01  2:20   ` Maxim Levitsky
@ 2025-03-03 22:18     ` Yosry Ahmed
  2025-03-05  3:03       ` Maxim Levitsky
  0 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-03 22:18 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 28, 2025 at 09:20:18PM -0500, Maxim Levitsky wrote:
> On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> > KVM does not track TLB flush requests for L1 vs. L2. Hence, service
> > local flush that target the current context before switching to a new
> > one. Since ASIDs are tracked per-VMCB, service the flushes before every
> > VMCB switch.
> > 
> > This is conceptually similar to how nVMX calls
> > kvm_service_local_tlb_flush_requests() in
> > nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(), with the
> > following differences:
> > 
> > 1. nVMX tracks the current VPID based on is_guest_mode(), so local TLB
> >    flushes are serviced before enter_guest_mode() and
> >    leave_guest_mode(). On the other hand, nSVM tracks the current ASID
> >    based on the current VMCB, so the TLB flushes are serviced before an
> >    VMCB switch.
> > 
> > 2. nVMX only enters and leaves guest mode in
> >    nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(). Other paths
> >    like vmx_set_nested_state() and vmx_leave_nested() call into these
> >    two functions. On the other hand, nSVM open codes the switch in
> >    functions like svm_set_nested_state() and svm_leave_nested(), so
> >    servicing the flush in svm_switch_svm() is probably most reliable.
> > 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/svm.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 5e7b1c9bfa605..6daa7efa9262b 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -1421,6 +1421,12 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >  
> >  void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb)
> >  {
> > +	/*
> > +	 * ASIDs are tracked per-VMCB. Perform any pending TLB flushes for the
> > +	 * current VMCB before switching to a new one.
> > +	 */
> > +	kvm_service_local_tlb_flush_requests(&svm->vcpu);
> > +
> >  	svm->current_vmcb = target_vmcb;
> >  	svm->vmcb = target_vmcb->ptr;
> >  }
> 
> 
> Note that another difference between SVM and VMX is that this code will only set tlb_ctl
> in the current vmcb, the actual flush can happen much later, when we do VM entry with this vmcb,
> e.g if we are now in L2, the flush will happen when we enter L2 again.

Right, but I think the internal implementation of the TLB flushes is not
relevant in this specific instance. Do you think it would be useful to
mention that here?

If we were to document the difference in TLB flush handling between VMX
and SVM I think a better place would be at kvm_vcpu_flush_tlb_*(), or
maybe in kvm_host.h where the vendor callbacks are defined? Not sure.

> 
> I think that this is correct but I might be mistaken.
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 13/13] KVM: nSVM: Stop bombing the TLB on nested transitions
  2025-03-01  2:21   ` Maxim Levitsky
@ 2025-03-03 22:21     ` Yosry Ahmed
  2025-03-05  3:14       ` Maxim Levitsky
  0 siblings, 1 reply; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-03 22:21 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 28, 2025 at 09:21:54PM -0500, Maxim Levitsky wrote:
> On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> > Now that nested TLB flushes are properly tracked with a well-maintained
> > separate ASID for L2 and proper handling of L1's TLB flush requests,
> > drop the unconditional flushes and syncs on nested transitions.
> > 
> > On a Milan machine, an L1 and L2 guests were booted, both with a single
> > vCPU, and pinned to a single physical CPU to maximize TLB collisions. In
> > this setup, the cpuid_rate microbenchmark [1] showed the following
> > changes with this patch:
> > 
> > +--------+--------+-------------------+----------------------+
> > > L0     | L1     | cpuid_rate (base) | cpuid_rate (patched) |
> > +========+========+===================+======================+
> > > NPT    | NPT    | 256621            | 301113 (+17.3%)      |
> > > NPT    | Shadow | 180017            | 203347 (+12.96%)     |
> > > Shadow | Shadow | 177006            | 189150 (+6.86%)      |
> > +--------+--------+-------------------+----------------------+
> > 
> > [1]https://lore.kernel.org/kvm/20231109180646.2963718-1-khorenko@virtuozzo.com/
> > 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/nested.c | 7 -------
> >  1 file changed, 7 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 8e40ff21f7353..45a187d4c23d1 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -512,9 +512,6 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
> >  		svm->nested.last_asid = svm->nested.ctl.asid;
> >  		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> >  	}
> > -	/* TODO: optimize unconditional TLB flush/MMU sync */
> > -	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > -	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> >  }
> >  
> >  static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
> > @@ -530,10 +527,6 @@ static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
> >  	 */
> >  	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
> >  		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> > -
> > -	/* TODO: optimize unconditional TLB flush/MMU sync */
> > -	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > -	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> >  }
> >  
> >  /*
> 
> 
> Assuming that all previous patches are correct this one should work as well.
> 
> However only a very heavy stress testing, including hyperv, windows guests
> of various types, etc can give me confidence that there is no some ugly bug lurking
> somewhere.

I tried booting an L2 and running some workloads like netperf in there.
I also tried booting an L3.

I am planning to try and run some testing with a windows L2 guest. I am
assuming this exercises the hyper-V emulation in L1, which could be
interesting.

I am not sure if I will be able to test more scenarios though,
especially Windows as an L1 (and something else as an L2).

Let me know if you have something specific in mind.

> 
> TLB management can be very tricky, so I can't be 100% sure that I haven't missed something.
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 02/13] KVM: nSVM: Rework svm_flush_tlb_asid() to operate on a given VMCB
  2025-03-03 21:58     ` Yosry Ahmed
@ 2025-03-05  2:52       ` Maxim Levitsky
  0 siblings, 0 replies; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-05  2:52 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, kvm,
	linux-kernel

On Mon, 2025-03-03 at 21:58 +0000, Yosry Ahmed wrote:
> On Fri, Feb 28, 2025 at 08:29:34PM -0500, Maxim Levitsky wrote:
> > On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> > > svm_flush_tlb_asid() currently operates on the current VMCB. In
> > > preparation for properly tracking TLB flushes for L1 and L2 ASIDs,
> > > refactor it to work on a given VMCB. All existing callers pass the
> > > current VMCB.
> > > 
> > > Create a svm_flush_tlb_guest() wrapper to use as the flush_tlb_guest()
> > > callback.
> > > 
> > > kvm_hv_vcpu_purge_flush_tlb() is only called when the current VMCB is
> > > passed to maintain current behavior.
> > > 
> > > No functional change intended.
> > > 
> > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > ---
> > >  arch/x86/kvm/svm/svm.c | 25 ++++++++++++++++++-------
> > >  1 file changed, 18 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index 08340ae57777b..2108b48ba4959 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -3954,7 +3954,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
> > >  	svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
> > >  }
> > >  
> > > -static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
> > > +static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu, struct kvm_vmcb_info *vmcb)
> > >  {
> > >  	struct vcpu_svm *svm = to_svm(vcpu);
> > >  
> > > @@ -3963,7 +3963,8 @@ static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
> > >  	 * A TLB flush for the current ASID flushes both "host" and "guest" TLB
> > >  	 * entries, and thus is a superset of Hyper-V's fine grained flushing.
> > >  	 */
> > > -	kvm_hv_vcpu_purge_flush_tlb(vcpu);
> > > +	if (vmcb == svm->current_vmcb)
> > > +		kvm_hv_vcpu_purge_flush_tlb(vcpu);
> > 
> > This is hyperv PV feature that should be looked upon very carefully.
> > 
> > To recap, 
> > each vCPU has 2 queues of pending TLB flush requests that target only small range of
> > memory pages. 
> 
> Thanks for pointing this out, I missed this.
> 
> > One is for L1 and one for L2, because now KVM supports a mode where L2
> > can ask L0 to do a tlb flush on its behalf, and KVM will figure out to which L1 vCPUs
> > to send this flush request.
> > 
> > Requests arrive from other vCPUs.
> > 
> > Here we purge the TLB request queue because we flushed a super-set of the requests,
> > which used to contain both L1 and L2 TLB, but soon that won't be true.
> > 
> > So I think it might make sense to also add vmcb to kvm_hv_vcpu_purge_flush_tlb, and then
> > depending if it is vmcb01 or vmcb02, purge the correct queue.
> > I don't know if this is theoretical or actual bug but it is better to be safe IMHO.
> 
> But I think we are already purging the right queue here. We purge the
> TLB flush requests only if we are flushing the current VMCB. Within
> kvm_hv_vcpu_purge_flush_tlb(), we choose the queue based on
> is_guest_mode(vcpu).
> 
> svm_flush_tlb_asid() is called when servicing a TLB flush request, at
> which point IIUC the current VMCB and whether the vCPU is in guest mode
> should be in sync. So we will be correctly purging the L1 or L2 queue
> based on the current VMCB.

Yes, I also think so, but to harden this code from a potential bug IMHO
it makes sense to ensure that svm_flush_tlb_asid works only on a given
vmcb without any hidden assumptions.

> 
> That being said, it's a bit confusing that svm_flush_tlb_asid() uses the
> VMCB to differentiate L1 and L2 ,while kvm_hv_vcpu_purge_flush_tlb()
> uses is_guest_mode(). We also miss the opportunity to purge both queues
> when called from svm_flush_tlb_all().

Yes, I noticed that too.

> 
> However, we cannot pass the VMCB to kvm_hv_vcpu_purge_flush_tlb() as it
> is also called from common code. So I think we can make
> kvm_hv_vcpu_purge_flush_tlb() take is_guest_mode as a parameter and pass
> it here based on which VMCB is passed in.



> 
> WDYT?
> 


Looking at this again, I see that kvm_hv_vcpu_purge_flush_tlb() can't really work
on a vmcb, so maybe the better solution is to remove the call to 
kvm_hv_vcpu_purge_flush_tlb() from svm_flush_tlb_asid_vmcb at all
and instead let the caller call both svm_flush_tlb_asid() and kvm_hv_vcpu_purge_flush_tlb()?

This might lead to some code duplication but this will put emphasis that svm_flush_tlb_asid_vmcb
can work on any vmcb regardless of which one is loaded and such.

As long as it works though, I won't strongly object to whatever code that works.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 07/13] KVM: nSVM: Handle INVLPGA interception correctly
  2025-03-03 22:05     ` Yosry Ahmed
@ 2025-03-05  2:54       ` Maxim Levitsky
  2025-03-05  6:20         ` Yosry Ahmed
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-05  2:54 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Mon, 2025-03-03 at 22:05 +0000, Yosry Ahmed wrote:
> On Fri, Feb 28, 2025 at 08:55:18PM -0500, Maxim Levitsky wrote:
> > On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> > > Currently, INVPLGA interception handles it like INVLPG, which flushes
> > > L1's TLB translations for the address. It was implemented in this way
> > > because L1 and L2 shared an ASID. Now, L1 and L2 have separate ASIDs. It
> > > is still harmless to flush L1's translations, but it's only correct
> > > because all translations are flushed on nested transitions anyway.
> > > 
> > > In preparation for stopping unconditional flushes on nested transitions,
> > > handle INVPLGA interception properly. If L1 specified zero as the ASID,
> > > this is equivalent to INVLPG, so handle it as such. Otherwise, use
> > > INVPLGA to flush the translations of the appropriate ASID tracked by
> > > KVM, if any. Sync the shadow MMU as well, as L1 invalidated L2's
> > > mappings.
> > > 
> > > Opportunistically update svm_flush_tlb_gva() to use
> > > svm->current_vmcb->asid instead of svm->vmcb->control.asid for
> > > consistency. The two should always be in sync except when KVM allocates
> > > a new ASID in pre_svm_run(), and they are shortly brought back in sync
> > > in svm_vcpu_run(). However, if future changes add more code paths where
> > > KVM allocates a new ASID, flushing the potentially old ASID in
> > > svm->vmcb->control.asid would be unnecessary overhead (although probably
> > > not much different from flushing the newly allocated ASID).
> > > 
> > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |  2 ++
> > >  arch/x86/kvm/mmu/mmu.c          |  5 +++--
> > >  arch/x86/kvm/svm/svm.c          | 40 ++++++++++++++++++++++++++++++---
> > >  3 files changed, 42 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 5193c3dfbce15..1e147bb2e560f 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -2213,6 +2213,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> > >  		       void *insn, int insn_len);
> > >  void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
> > >  void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
> > > +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > +			       u64 addr, unsigned long roots, bool gva_flush);
> > >  void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > >  			     u64 addr, unsigned long roots);
> > >  void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index ac133abc9c173..f5e0d2c8f4bbe 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -6158,8 +6158,8 @@ static void kvm_mmu_invalidate_addr_in_root(struct kvm_vcpu *vcpu,
> > >  	write_unlock(&vcpu->kvm->mmu_lock);
> > >  }
> > >  
> > > -static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > -				      u64 addr, unsigned long roots, bool gva_flush)
> > > +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > +			       u64 addr, unsigned long roots, bool gva_flush)
> > >  {
> > >  	int i;
> > >  
> > > @@ -6185,6 +6185,7 @@ static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu
> > >  			kvm_mmu_invalidate_addr_in_root(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
> > >  	}
> > >  }
> > > +EXPORT_SYMBOL_GPL(__kvm_mmu_invalidate_addr);
> > >  
> > >  void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > >  			     u64 addr, unsigned long roots)
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index a2d601cd4c283..9e29f87d3bd93 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -2483,6 +2483,7 @@ static int clgi_interception(struct kvm_vcpu *vcpu)
> > >  
> > >  static int invlpga_interception(struct kvm_vcpu *vcpu)
> > >  {
> > > +	struct vcpu_svm *svm = to_svm(vcpu);
> > >  	gva_t gva = kvm_rax_read(vcpu);
> > >  	u32 asid = kvm_rcx_read(vcpu);
> > >  
> > > @@ -2492,8 +2493,41 @@ static int invlpga_interception(struct kvm_vcpu *vcpu)
> > >  
> > >  	trace_kvm_invlpga(to_svm(vcpu)->vmcb->save.rip, asid, gva);
> > >  
> > > -	/* Let's treat INVLPGA the same as INVLPG (can be optimized!) */
> > > -	kvm_mmu_invlpg(vcpu, gva);
> > > +	/*
> > > +	 * APM is silent about using INVLPGA to flush the host ASID (i.e. 0).
> > > +	 * Do the logical thing and handle it like INVLPG.
> > > +	 */
> > > +	if (asid == 0) {
> > > +		kvm_mmu_invlpg(vcpu, gva);
> > > +		return kvm_skip_emulated_instruction(vcpu);
> > > +	}
> > > +
> > > +	/*
> > > +	 * Check if L1 specified the L2 ASID we are currently tracking. If it
> > > +	 * isn't, do nothing as we have to handle the TLB flush when switching
> > > +	 * to the new ASID anyway. APM mentions that INVLPGA is typically only
> > > +	 * meaningful with shadow paging, so also do nothing if L1 is using
> > > +	 * nested NPT.
> > > +	 */
> > > +	if (!nested_npt_enabled(svm) && asid == svm->nested.last_asid)
> > > +		invlpga(gva, svm->nested.vmcb02.asid);
> > 
> > Hi, 
> > 
> > IMHO we can't just NOP the INVLPGA because it is not useful in nested NPT case.
> > 
> > If I understand the APM correctly, the CPU will honor the INVLPGA
> > request, even when NPT is enabled, and so KVM must do this as well.
> > 
> > It is not useful for the hypervisor because it needs GVA, which in case of NPT,
> > the hypervisor won't usually track, but we can't completely rule out that some
> > hypervisor uses this in some cases.
> 
> Yeah I knew this was going to be a contention point, was mainly waiting
> to see what others think here.
> 
> I guess we can just map the ASID passed by L1 to the actual ASID we use
> for L2 and execute the INVLPGA as-is with the gva passed by L1.


If I understand correctly, we in essence support only 2 nested ASIDs: 0 and the one that
L1 used last time. Anything else will get flushed on next VM entry.
 
So, if I understand this correctly all we need to do is to drop the 
'nested_npt_enabled(svm)' check above, and it should work.


Best regards,
	Maxim Levitsky

> 
> > 
> > Also, there is out of order patch here: last_asid isn't yet declared.
> > It is added in patch 10.
> 
> Good catch, I will fix that, thanks!
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 09/13] KVM: nSVM: Handle nested TLB flush requests through TLB_CONTROL
  2025-03-03 22:10     ` Yosry Ahmed
@ 2025-03-05  2:57       ` Maxim Levitsky
  0 siblings, 0 replies; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-05  2:57 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Mon, 2025-03-03 at 22:10 +0000, Yosry Ahmed wrote:
> On Fri, Feb 28, 2025 at 09:06:18PM -0500, Maxim Levitsky wrote:
> > On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> > > Handle L1's requests to flush L2's TLB through the TLB_CONTROL field of
> > > VMCB12. This is currently redundant because a full flush is executed on
> > > every nested transition, but is a step towards removing that.
> > > 
> > > TLB_CONTROL_FLUSH_ALL_ASID flushes all ASIDs from L1's perspective,
> > > including its own, so do a guest TLB flush on both transitions. Never
> > > propagate TLB_CONTROL_FLUSH_ALL_ASID from the guest to the actual VMCB,
> > > as this gives the guest the power to flush the entire physical TLB
> > > (including translations for the host and other VMs).
> > > 
> > > For other cases, the TLB flush is only done when entering L2. The nested
> > > NPT MMU is also sync'd because TLB_CONTROL also flushes NPT
> > > guest-physical mappings.
> > > 
> > > All TLB_CONTROL values can be handled by KVM regardless of FLUSHBYASID
> > > support on the underlying CPU, so keep advertising FLUSHBYASID to the
> > > guest unconditionally.
> > > 
> > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > ---
> > >  arch/x86/kvm/svm/nested.c | 42 ++++++++++++++++++++++++++++++++-------
> > >  arch/x86/kvm/svm/svm.c    |  6 +++---
> > >  2 files changed, 38 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index 0735177b95a1d..e2c59eb2907e8 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -484,19 +484,36 @@ static void nested_save_pending_event_to_vmcb12(struct vcpu_svm *svm,
> > >  
> > >  static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
> > >  {
> > > +	struct vcpu_svm *svm = to_svm(vcpu);
> > > +
> > >  	/* Handle pending Hyper-V TLB flush requests */
> > >  	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
> > >  
> > > +	/*
> > > +	 * If L1 requested a TLB flush for L2, flush L2's TLB on nested entry
> > > +	 * and sync the nested NPT MMU, as TLB_CONTROL also flushes NPT
> > > +	 * guest-physical mappings. We technically only need to flush guest_mode
> > > +	 * page tables.
> > > +	 *
> > > +	 * KVM_REQ_TLB_FLUSH_GUEST will flush L2's ASID even if the underlying
> > > +	 * CPU does not support FLUSHBYASID (by assigning a new ASID), so we
> > > +	 * can handle all TLB_CONTROL values from L1 regardless.
> > > +	 *
> > > +	 * Note that TLB_CONTROL_FLUSH_ASID_LOCAL is handled exactly like
> > > +	 * TLB_CONTROL_FLUSH_ASID. We can technically flush less TLB entries,
> > > +	 * but this would require significantly more complexity.
> > > +	 */
> > 
> > I think it might make sense to note that we in essence support only one non zero ASID
> > in L1, the one that it picks for the nested guest.
> > 
> > 
> > Thus when asked to TLB_CONTROL_FLUSH_ALL_ASID 
> > we need to flush the L2's real asid and L1 asid only.
> 
> This is described by the comment in nested_svm_exit_tlb_flush(). Do you
> mean that we should also mention that here?
> 
> I guess one way to make things clearer is to describe the behavior for
> all values of TLB_CONTROL here, and in nested_svm_exit_tlb_flush() just
> say /* see nested_svm_entry_tlb_flush() */. Would that improve things?

I guess that this will be a bit better.
This was just a tiny nitpick though, just something I wondered a bit when
reviewing.


Best regards,
	Maxim Levitsky



> 
> > 
> > > +	if (svm->nested.ctl.tlb_ctl != TLB_CONTROL_DO_NOTHING) {
> > > +		if (nested_npt_enabled(svm))
> > > +			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > > +		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> > > +	}
> > > +
> > >  	/*
> > >  	 * TODO: optimize unconditional TLB flush/MMU sync.  A partial list of
> > >  	 * things to fix before this can be conditional:
> > >  	 *
> > > -	 *  - Honor L1's request to flush an ASID on nested VMRUN
> > > -	 *  - Sync nested NPT MMU on VMRUN that flushes L2's ASID[*]
> > >  	 *  - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
> > > -	 *
> > > -	 * [*] Unlike nested EPT, SVM's ASID management can invalidate nested
> > > -	 *     NPT guest-physical mappings on VMRUN.
> > >  	 */
> > >  	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > >  	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> > > @@ -504,9 +521,18 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
> > >  
> > >  static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
> > >  {
> > > +	struct vcpu_svm *svm = to_svm(vcpu);
> > > +
> > >  	/* Handle pending Hyper-V TLB flush requests */
> > >  	kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled);
> > >  
> > > +	/*
> > > +	 * If L1 had requested a full TLB flush when entering L2, also flush its
> > > +	 * TLB entries when exiting back to L1.
> > > +	 */
> > > +	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
> > > +		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> > 
> > Makes sense.
> > 
> > > +
> > >  	/* See nested_svm_entry_tlb_flush() */
> > >  	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > >  	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> > > @@ -825,7 +851,8 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
> > >  	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
> > >  
> > >  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> > > -	nested_vmcb02_prepare_control(svm, vmcb12->save.rip, vmcb12->save.cs.base);
> > > +	nested_vmcb02_prepare_control(svm, vmcb12->save.rip,
> > > +				      vmcb12->save.cs.base);
> > >  	nested_vmcb02_prepare_save(svm, vmcb12);
> > >  
> > >  	ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
> > > @@ -1764,7 +1791,8 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> > >  	nested_copy_vmcb_control_to_cache(svm, ctl);
> > >  
> > >  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> > > -	nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip, svm->vmcb->save.cs.base);
> > > +	nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip,
> > > +				      svm->vmcb->save.cs.base);
> > >  
> > >  	/*
> > >  	 * While the nested guest CR3 is already checked and set by
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index 8342c7eadbba8..5e7b1c9bfa605 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -5242,9 +5242,9 @@ static __init void svm_set_cpu_caps(void)
> > >  		kvm_cpu_cap_set(X86_FEATURE_VMCBCLEAN);
> > >  
> > >  		/*
> > > -		 * KVM currently flushes TLBs on *every* nested SVM transition,
> > > -		 * and so for all intents and purposes KVM supports flushing by
> > > -		 * ASID, i.e. KVM is guaranteed to honor every L1 ASID flush.
> > > +		 * KVM currently handles all TLB_CONTROL values set by L1, even
> > > +		 * if the underlying CPU does not. See
> > > +		 * nested_svm_transition_tlb_flush().
> > >  		 */
> > >  		kvm_cpu_cap_set(X86_FEATURE_FLUSHBYASID);
> > >  
> > 
> > Patch looks OK, but I can't be 100% sure about this.
> > 
> > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> > 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 11/13] KVM: nSVM: Do not reset TLB_CONTROL in VMCB02 on nested entry
  2025-03-03 22:14     ` Yosry Ahmed
@ 2025-03-05  3:01       ` Maxim Levitsky
  2025-03-05  6:34         ` Yosry Ahmed
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-05  3:01 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Mon, 2025-03-03 at 22:14 +0000, Yosry Ahmed wrote:
> On Fri, Feb 28, 2025 at 09:17:52PM -0500, Maxim Levitsky wrote:
> > On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> > > TLB_CONTROL is reset to TLB_CONTROL_DO_NOTHING on nested transitions to
> > > L2. This is unnecessary because it should always be
> > > TLB_CONTROL_DO_NOTHING at this point.
> > > 
> > > The flow for setting TLB_CONTROL is as follows:
> > > 1. In vcpu_enter_guest(), servicing a TLB flush request may set it to
> > > TLB_CONTROL_FLUSH_ASID in svm_flush_tlb_asid().
> > > 2. In svm_vcpu_run() -> pre_svm_run(), it may get upgraded to
> > > TLB_CONTROL_FLUSH_ALL_ASID when assigning a new ASID.
> > > 3. In svm_cpu_run(), it gets reset to TLB_CONTROL_DO_NOTHING after the
> > > guest is run.
> > > 
> > > Hence, TLB_CONTROL is reset after each run and there is no need to do it
> > > again on every nested transition to L2.
> > > 
> > > There is a TODO in nested_svm_transition_tlb_flush() about this reset
> > > crushing pending TLB flushes. Remove it, as the reset is not really
> > > crushing anything as explained above.
> > 
> > I am not sure that we don't crush a pending tlb request: 
> > 
> > svm_flush_tlb_asid can also be called by KVM_REQ_TLB_FLUSH
> > and set the flush request in both vmcbs, thus later the nested_svm_exit_tlb_flush
> > can crush this request.
> 
> How so?
> 
> nested_svm_exit_tlb_flush() makes a KVM_REQ_TLB_FLUSH_GUEST request.
> svm_flush_tlb_asid() is called when servicing KVM_REQ_TLB_* requests.

I am probably missing something but:

Suppose KVM_REQ_TLB_FLUSH is raised and then processed while ordinary L1 entry is happening,
but nested state is allocated.

(KVM_REQ_TLB_FLUSH can be raised anytime MMU wants a 'big hammer flush everything')

In this case svm_flush_tlb_all will call svm_flush_tlb_asid on both vmcbs (see patch 8),
and that will set TLB_CONTROL_FLUSH_ASID in both vmcbs.
In particular it will be set in vmcb02.

Later, maybe even hours later in theory, L1 issues VMRUN, we reach nested_vmcb02_prepare_control,
and crush the value (TLB_CONTROL_FLUSH_ASID) set in vmcb02.

I think that this is what the removed comment referred to.


Best regards,
	Maxim Levitsky

> 
> So svm_flush_tlb_asid() does not make a request in the sense of
> KVM_REQ_*, it sets TLB_CONTROL or invalidates the ASID, which is can
> more-or-less be described as "requesting" a TLB flush on VM-enter, but
> is not the same thing as KVM_REQ_TLB_FLUSH.
> 
> So I am not sure there are any requests being crushed here.
> 
> > But the patch itself does look OK to me, although I might be mistaken.
> > 
> > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Thanks!
> 
> > 
> > Best regards,
> > 	Maxim Levitsky



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 12/13] KVM: nSVM: Service local TLB flushes before nested transitions
  2025-03-03 22:18     ` Yosry Ahmed
@ 2025-03-05  3:03       ` Maxim Levitsky
  2025-03-05  6:37         ` Yosry Ahmed
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-05  3:03 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Mon, 2025-03-03 at 22:18 +0000, Yosry Ahmed wrote:
> On Fri, Feb 28, 2025 at 09:20:18PM -0500, Maxim Levitsky wrote:
> > On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> > > KVM does not track TLB flush requests for L1 vs. L2. Hence, service
> > > local flush that target the current context before switching to a new
> > > one. Since ASIDs are tracked per-VMCB, service the flushes before every
> > > VMCB switch.
> > > 
> > > This is conceptually similar to how nVMX calls
> > > kvm_service_local_tlb_flush_requests() in
> > > nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(), with the
> > > following differences:
> > > 
> > > 1. nVMX tracks the current VPID based on is_guest_mode(), so local TLB
> > >    flushes are serviced before enter_guest_mode() and
> > >    leave_guest_mode(). On the other hand, nSVM tracks the current ASID
> > >    based on the current VMCB, so the TLB flushes are serviced before an
> > >    VMCB switch.
> > > 
> > > 2. nVMX only enters and leaves guest mode in
> > >    nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(). Other paths
> > >    like vmx_set_nested_state() and vmx_leave_nested() call into these
> > >    two functions. On the other hand, nSVM open codes the switch in
> > >    functions like svm_set_nested_state() and svm_leave_nested(), so
> > >    servicing the flush in svm_switch_svm() is probably most reliable.
> > > 
> > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > ---
> > >  arch/x86/kvm/svm/svm.c | 6 ++++++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index 5e7b1c9bfa605..6daa7efa9262b 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -1421,6 +1421,12 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > >  
> > >  void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb)
> > >  {
> > > +	/*
> > > +	 * ASIDs are tracked per-VMCB. Perform any pending TLB flushes for the
> > > +	 * current VMCB before switching to a new one.
> > > +	 */
> > > +	kvm_service_local_tlb_flush_requests(&svm->vcpu);
> > > +
> > >  	svm->current_vmcb = target_vmcb;
> > >  	svm->vmcb = target_vmcb->ptr;
> > >  }
> > 
> > Note that another difference between SVM and VMX is that this code will only set tlb_ctl
> > in the current vmcb, the actual flush can happen much later, when we do VM entry with this vmcb,
> > e.g if we are now in L2, the flush will happen when we enter L2 again.
> 
> Right, but I think the internal implementation of the TLB flushes is not
> relevant in this specific instance. Do you think it would be useful to
> mention that here?

I am not sure to be honest, I just mentioned this because in theory there can be a difference,
in regard to the fact that we might think that we flushed the TLB while in fact we haven't yet.

I am trying my best to think about what hidden problems might lurk around and surface later.

Not directly related to the above, but I am thinking:
I really like the way SVM flush works because it ensures that redundant flushes don't cost anything.

I wonder if we can make VMX code emulate this,
by having an emulated 'tlb_control' field and then doing the flush (INVEPT) on VM entry.


Best regards,
	Maxim Levitsky

> 
> If we were to document the difference in TLB flush handling between VMX
> and SVM I think a better place would be at kvm_vcpu_flush_tlb_*(), or
> maybe in kvm_host.h where the vendor callbacks are defined? Not sure.
> 
> > I think that this is correct but I might be mistaken.
> > 
> > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Thanks!
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 13/13] KVM: nSVM: Stop bombing the TLB on nested transitions
  2025-03-03 22:21     ` Yosry Ahmed
@ 2025-03-05  3:14       ` Maxim Levitsky
  2025-03-05  6:45         ` Yosry Ahmed
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2025-03-05  3:14 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Mon, 2025-03-03 at 22:21 +0000, Yosry Ahmed wrote:
> On Fri, Feb 28, 2025 at 09:21:54PM -0500, Maxim Levitsky wrote:
> > On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> > > Now that nested TLB flushes are properly tracked with a well-maintained
> > > separate ASID for L2 and proper handling of L1's TLB flush requests,
> > > drop the unconditional flushes and syncs on nested transitions.
> > > 
> > > On a Milan machine, an L1 and L2 guests were booted, both with a single
> > > vCPU, and pinned to a single physical CPU to maximize TLB collisions. In
> > > this setup, the cpuid_rate microbenchmark [1] showed the following
> > > changes with this patch:
> > > 
> > > +--------+--------+-------------------+----------------------+
> > > > L0     | L1     | cpuid_rate (base) | cpuid_rate (patched) |
> > > +========+========+===================+======================+
> > > > NPT    | NPT    | 256621            | 301113 (+17.3%)      |
> > > > NPT    | Shadow | 180017            | 203347 (+12.96%)     |
> > > > Shadow | Shadow | 177006            | 189150 (+6.86%)      |
> > > +--------+--------+-------------------+----------------------+
> > > 
> > > [1]https://lore.kernel.org/kvm/20231109180646.2963718-1-khorenko@virtuozzo.com/
> > > 
> > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > ---
> > >  arch/x86/kvm/svm/nested.c | 7 -------
> > >  1 file changed, 7 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index 8e40ff21f7353..45a187d4c23d1 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -512,9 +512,6 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
> > >  		svm->nested.last_asid = svm->nested.ctl.asid;
> > >  		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> > >  	}
> > > -	/* TODO: optimize unconditional TLB flush/MMU sync */
> > > -	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > > -	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> > >  }
> > >  
> > >  static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
> > > @@ -530,10 +527,6 @@ static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
> > >  	 */
> > >  	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
> > >  		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> > > -
> > > -	/* TODO: optimize unconditional TLB flush/MMU sync */
> > > -	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > > -	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> > >  }
> > >  
> > >  /*
> > 
> > Assuming that all previous patches are correct this one should work as well.
> > 
> > However only a very heavy stress testing, including hyperv, windows guests
> > of various types, etc can give me confidence that there is no some ugly bug lurking
> > somewhere.
> 
> I tried booting an L2 and running some workloads like netperf in there.
> I also tried booting an L3.
> 
> I am planning to try and run some testing with a windows L2 guest. I am
> assuming this exercises the hyper-V emulation in L1, which could be
> interesting.
> 
> I am not sure if I will be able to test more scenarios though,
> especially Windows as an L1 (and something else as an L2).
> 
> Let me know if you have something specific in mind.


KVM can run itself 'under' HyperV (although in this case when it runs a guest
the guest will be L3 overall, so not really something supported but still something that might
reveal bugs).
In this case KVM/L1 can take advantage of L0's TLB flush interface.

Stress testing L3s also can be nice, although in this case from L0 POV, it doesn't see L3 at all.
Instead it sees that L1 runs two different L2s back to back, so the current code will
likely flush everything all the time.


The direct TLB flush that hyperv does, especially from L2 to L0 should also be tested,
it's a relatively new feature, so we need to check that L2 actually uses it.

KVM also has its own way of TLB flushing paravirtualization, which can in theory interfere with this.


It's also nice to run a hyperv enabled Windows as KVM guest, and run a guest in it (can be Windows or Linux or anything else)
Such guest will run two L2 VMs, Windows itself and the VM you run inside.


You can also try other L1s, like VirtualBox, VMware, running in Windows or Linux L1,
and themselves can run a windows or Linux L2. 

You can also test other OSes like BSD* and such as L1, they might have a different TLB access pattern and
might reveal something, who knows. These can also run L2s using their own hypervisors.

Running a very old (say Windows XP, or some very old Linux) as L2 might also reveal something.

(But don't try to run win95/98 - this OS is known to not flush TLB properly (it doesn't use INVLPG when it should),
so it doesn't work well on AMD at all because of this).

Finally, it might be worth it to develop a TLB stress test if one doesn't exist yet.

Best regards,
   Maxim Levitsky


> 
> > TLB management can be very tricky, so I can't be 100% sure that I haven't missed something.
> > 
> > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Thanks!
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 07/13] KVM: nSVM: Handle INVLPGA interception correctly
  2025-03-05  2:54       ` Maxim Levitsky
@ 2025-03-05  6:20         ` Yosry Ahmed
  0 siblings, 0 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-05  6:20 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Tue, Mar 04, 2025 at 09:54:54PM -0500, Maxim Levitsky wrote:
> On Mon, 2025-03-03 at 22:05 +0000, Yosry Ahmed wrote:
> > On Fri, Feb 28, 2025 at 08:55:18PM -0500, Maxim Levitsky wrote:
> > > On Wed, 2025-02-05 at 18:23 +0000, Yosry Ahmed wrote:
> > > > Currently, INVPLGA interception handles it like INVLPG, which flushes
> > > > L1's TLB translations for the address. It was implemented in this way
> > > > because L1 and L2 shared an ASID. Now, L1 and L2 have separate ASIDs. It
> > > > is still harmless to flush L1's translations, but it's only correct
> > > > because all translations are flushed on nested transitions anyway.
> > > > 
> > > > In preparation for stopping unconditional flushes on nested transitions,
> > > > handle INVPLGA interception properly. If L1 specified zero as the ASID,
> > > > this is equivalent to INVLPG, so handle it as such. Otherwise, use
> > > > INVPLGA to flush the translations of the appropriate ASID tracked by
> > > > KVM, if any. Sync the shadow MMU as well, as L1 invalidated L2's
> > > > mappings.
> > > > 
> > > > Opportunistically update svm_flush_tlb_gva() to use
> > > > svm->current_vmcb->asid instead of svm->vmcb->control.asid for
> > > > consistency. The two should always be in sync except when KVM allocates
> > > > a new ASID in pre_svm_run(), and they are shortly brought back in sync
> > > > in svm_vcpu_run(). However, if future changes add more code paths where
> > > > KVM allocates a new ASID, flushing the potentially old ASID in
> > > > svm->vmcb->control.asid would be unnecessary overhead (although probably
> > > > not much different from flushing the newly allocated ASID).
> > > > 
> > > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h |  2 ++
> > > >  arch/x86/kvm/mmu/mmu.c          |  5 +++--
> > > >  arch/x86/kvm/svm/svm.c          | 40 ++++++++++++++++++++++++++++++---
> > > >  3 files changed, 42 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 5193c3dfbce15..1e147bb2e560f 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -2213,6 +2213,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> > > >  		       void *insn, int insn_len);
> > > >  void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
> > > >  void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
> > > > +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > > +			       u64 addr, unsigned long roots, bool gva_flush);
> > > >  void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > >  			     u64 addr, unsigned long roots);
> > > >  void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index ac133abc9c173..f5e0d2c8f4bbe 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -6158,8 +6158,8 @@ static void kvm_mmu_invalidate_addr_in_root(struct kvm_vcpu *vcpu,
> > > >  	write_unlock(&vcpu->kvm->mmu_lock);
> > > >  }
> > > >  
> > > > -static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > > -				      u64 addr, unsigned long roots, bool gva_flush)
> > > > +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > > +			       u64 addr, unsigned long roots, bool gva_flush)
> > > >  {
> > > >  	int i;
> > > >  
> > > > @@ -6185,6 +6185,7 @@ static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu
> > > >  			kvm_mmu_invalidate_addr_in_root(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
> > > >  	}
> > > >  }
> > > > +EXPORT_SYMBOL_GPL(__kvm_mmu_invalidate_addr);
> > > >  
> > > >  void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > >  			     u64 addr, unsigned long roots)
> > > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > > index a2d601cd4c283..9e29f87d3bd93 100644
> > > > --- a/arch/x86/kvm/svm/svm.c
> > > > +++ b/arch/x86/kvm/svm/svm.c
> > > > @@ -2483,6 +2483,7 @@ static int clgi_interception(struct kvm_vcpu *vcpu)
> > > >  
> > > >  static int invlpga_interception(struct kvm_vcpu *vcpu)
> > > >  {
> > > > +	struct vcpu_svm *svm = to_svm(vcpu);
> > > >  	gva_t gva = kvm_rax_read(vcpu);
> > > >  	u32 asid = kvm_rcx_read(vcpu);
> > > >  
> > > > @@ -2492,8 +2493,41 @@ static int invlpga_interception(struct kvm_vcpu *vcpu)
> > > >  
> > > >  	trace_kvm_invlpga(to_svm(vcpu)->vmcb->save.rip, asid, gva);
> > > >  
> > > > -	/* Let's treat INVLPGA the same as INVLPG (can be optimized!) */
> > > > -	kvm_mmu_invlpg(vcpu, gva);
> > > > +	/*
> > > > +	 * APM is silent about using INVLPGA to flush the host ASID (i.e. 0).
> > > > +	 * Do the logical thing and handle it like INVLPG.
> > > > +	 */
> > > > +	if (asid == 0) {
> > > > +		kvm_mmu_invlpg(vcpu, gva);
> > > > +		return kvm_skip_emulated_instruction(vcpu);
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * Check if L1 specified the L2 ASID we are currently tracking. If it
> > > > +	 * isn't, do nothing as we have to handle the TLB flush when switching
> > > > +	 * to the new ASID anyway. APM mentions that INVLPGA is typically only
> > > > +	 * meaningful with shadow paging, so also do nothing if L1 is using
> > > > +	 * nested NPT.
> > > > +	 */
> > > > +	if (!nested_npt_enabled(svm) && asid == svm->nested.last_asid)
> > > > +		invlpga(gva, svm->nested.vmcb02.asid);
> > > 
> > > Hi, 
> > > 
> > > IMHO we can't just NOP the INVLPGA because it is not useful in nested NPT case.
> > > 
> > > If I understand the APM correctly, the CPU will honor the INVLPGA
> > > request, even when NPT is enabled, and so KVM must do this as well.
> > > 
> > > It is not useful for the hypervisor because it needs GVA, which in case of NPT,
> > > the hypervisor won't usually track, but we can't completely rule out that some
> > > hypervisor uses this in some cases.
> > 
> > Yeah I knew this was going to be a contention point, was mainly waiting
> > to see what others think here.
> > 
> > I guess we can just map the ASID passed by L1 to the actual ASID we use
> > for L2 and execute the INVLPGA as-is with the gva passed by L1.
> 
> 
> If I understand correctly, we in essence support only 2 nested ASIDs: 0 and the one that
> L1 used last time. Anything else will get flushed on next VM entry.
>  
> So, if I understand this correctly all we need to do is to drop the 
> 'nested_npt_enabled(svm)' check above, and it should work.

That's exactly what I did in my local tree.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 11/13] KVM: nSVM: Do not reset TLB_CONTROL in VMCB02 on nested entry
  2025-03-05  3:01       ` Maxim Levitsky
@ 2025-03-05  6:34         ` Yosry Ahmed
  0 siblings, 0 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-05  6:34 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Tue, Mar 04, 2025 at 10:01:25PM -0500, Maxim Levitsky wrote:
> On Mon, 2025-03-03 at 22:14 +0000, Yosry Ahmed wrote:
> > On Fri, Feb 28, 2025 at 09:17:52PM -0500, Maxim Levitsky wrote:
> > > On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> > > > TLB_CONTROL is reset to TLB_CONTROL_DO_NOTHING on nested transitions to
> > > > L2. This is unnecessary because it should always be
> > > > TLB_CONTROL_DO_NOTHING at this point.
> > > > 
> > > > The flow for setting TLB_CONTROL is as follows:
> > > > 1. In vcpu_enter_guest(), servicing a TLB flush request may set it to
> > > > TLB_CONTROL_FLUSH_ASID in svm_flush_tlb_asid().
> > > > 2. In svm_vcpu_run() -> pre_svm_run(), it may get upgraded to
> > > > TLB_CONTROL_FLUSH_ALL_ASID when assigning a new ASID.
> > > > 3. In svm_cpu_run(), it gets reset to TLB_CONTROL_DO_NOTHING after the
> > > > guest is run.
> > > > 
> > > > Hence, TLB_CONTROL is reset after each run and there is no need to do it
> > > > again on every nested transition to L2.
> > > > 
> > > > There is a TODO in nested_svm_transition_tlb_flush() about this reset
> > > > crushing pending TLB flushes. Remove it, as the reset is not really
> > > > crushing anything as explained above.
> > > 
> > > I am not sure that we don't crush a pending tlb request: 
> > > 
> > > svm_flush_tlb_asid can also be called by KVM_REQ_TLB_FLUSH
> > > and set the flush request in both vmcbs, thus later the nested_svm_exit_tlb_flush
> > > can crush this request.
> > 
> > How so?
> > 
> > nested_svm_exit_tlb_flush() makes a KVM_REQ_TLB_FLUSH_GUEST request.
> > svm_flush_tlb_asid() is called when servicing KVM_REQ_TLB_* requests.
> 
> I am probably missing something but:
> 
> Suppose KVM_REQ_TLB_FLUSH is raised and then processed while ordinary L1 entry is happening,
> but nested state is allocated.
> 
> (KVM_REQ_TLB_FLUSH can be raised anytime MMU wants a 'big hammer flush everything')
> 
> In this case svm_flush_tlb_all will call svm_flush_tlb_asid on both vmcbs (see patch 8),
> and that will set TLB_CONTROL_FLUSH_ASID in both vmcbs.
> In particular it will be set in vmcb02.
> 
> Later, maybe even hours later in theory, L1 issues VMRUN, we reach nested_vmcb02_prepare_control,
> and crush the value (TLB_CONTROL_FLUSH_ASID) set in vmcb02.
> 
> I think that this is what the removed comment referred to.

When KVM_REQ_TLB_FLUSH is raised, we do not call svm_flush_tlb_all()
immediately. We only call svm_flush_tlb_all() when the request is
serviced in vcpu_enter_guest():

	/*
	 * Note, the order matters here, as flushing "all" TLB entries
	 * also flushes the "current" TLB entries, i.e. servicing the
	 * flush "all" will clear any request to flush "current".
	 */
	if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
		kvm_vcpu_flush_tlb_all(vcpu);

	kvm_service_local_tlb_flush_requests(vcpu);

IIUC, vcpu_enter_guest() will be called for L2 after
nested_vmcb02_prepare_control() is called. My understanding is that the
sequence of events is as follows:
- L1 executes VMRUN, which is trapped and emulated by L0.

- KVM executes handles the VM-exit in L0 by calling
  vmrun_interception() to setup VMCB02 in prepartion for running L2.
  This will call nested_svm_vmrun() -> enter_svm_guest_mode() ->
  nested_vmcb02_prepare_control() (setting tlb_ctl to
  TLB_CONTROL_DO_NOTHING).

- Execution will pick up after the VMRUN instruction in L0, eventually
  getting to the loop in vcpu_run(), and calling vcpu_enter_guest()
  for L2. At this point any pending TLB flush requests (e.g.
  KVM_REQ_TLB_FLUSH) will be handled, and svm_flush_tlb_*() functions
  may be called to set tlb_ctl to a non-zero value (e.g.
  TLB_CONTROL_FLUSH_ASID).

- A little bit later in svm_vcpu_run() -> pre_svm_run(), tlb_ctl may be
  upgraded to TLB_CONTROL_FLUSH_ALL_ASID if a new ASID is allocated.
 
- After the guest is run, svm_cpu_run() resets tlb_ctl to TLB_CONTROL_DO_NOTHING.

So nested_vmcb02_prepare_control() setting tlb_ctl to
TLB_CONTROL_DO_NOTHING should have no effect because tlb_ctl is set
after that anyway before L2 is run, and reset back to
TLB_CONTROL_DO_NOTHING after L2 is run.

I tried to clarify this in the commit log, but I don't think it is clear
enough. I will try to add more details in the next version.

Please correct me if I am wrong.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 12/13] KVM: nSVM: Service local TLB flushes before nested transitions
  2025-03-05  3:03       ` Maxim Levitsky
@ 2025-03-05  6:37         ` Yosry Ahmed
  0 siblings, 0 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-05  6:37 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Tue, Mar 04, 2025 at 10:03:51PM -0500, Maxim Levitsky wrote:
> On Mon, 2025-03-03 at 22:18 +0000, Yosry Ahmed wrote:
> > On Fri, Feb 28, 2025 at 09:20:18PM -0500, Maxim Levitsky wrote:
> > > On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> > > > KVM does not track TLB flush requests for L1 vs. L2. Hence, service
> > > > local flush that target the current context before switching to a new
> > > > one. Since ASIDs are tracked per-VMCB, service the flushes before every
> > > > VMCB switch.
> > > > 
> > > > This is conceptually similar to how nVMX calls
> > > > kvm_service_local_tlb_flush_requests() in
> > > > nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(), with the
> > > > following differences:
> > > > 
> > > > 1. nVMX tracks the current VPID based on is_guest_mode(), so local TLB
> > > >    flushes are serviced before enter_guest_mode() and
> > > >    leave_guest_mode(). On the other hand, nSVM tracks the current ASID
> > > >    based on the current VMCB, so the TLB flushes are serviced before an
> > > >    VMCB switch.
> > > > 
> > > > 2. nVMX only enters and leaves guest mode in
> > > >    nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(). Other paths
> > > >    like vmx_set_nested_state() and vmx_leave_nested() call into these
> > > >    two functions. On the other hand, nSVM open codes the switch in
> > > >    functions like svm_set_nested_state() and svm_leave_nested(), so
> > > >    servicing the flush in svm_switch_svm() is probably most reliable.
> > > > 
> > > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > ---
> > > >  arch/x86/kvm/svm/svm.c | 6 ++++++
> > > >  1 file changed, 6 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > > index 5e7b1c9bfa605..6daa7efa9262b 100644
> > > > --- a/arch/x86/kvm/svm/svm.c
> > > > +++ b/arch/x86/kvm/svm/svm.c
> > > > @@ -1421,6 +1421,12 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > >  
> > > >  void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb)
> > > >  {
> > > > +	/*
> > > > +	 * ASIDs are tracked per-VMCB. Perform any pending TLB flushes for the
> > > > +	 * current VMCB before switching to a new one.
> > > > +	 */
> > > > +	kvm_service_local_tlb_flush_requests(&svm->vcpu);
> > > > +
> > > >  	svm->current_vmcb = target_vmcb;
> > > >  	svm->vmcb = target_vmcb->ptr;
> > > >  }
> > > 
> > > Note that another difference between SVM and VMX is that this code will only set tlb_ctl
> > > in the current vmcb, the actual flush can happen much later, when we do VM entry with this vmcb,
> > > e.g if we are now in L2, the flush will happen when we enter L2 again.
> > 
> > Right, but I think the internal implementation of the TLB flushes is not
> > relevant in this specific instance. Do you think it would be useful to
> > mention that here?
> 
> I am not sure to be honest, I just mentioned this because in theory there can be a difference,
> in regard to the fact that we might think that we flushed the TLB while in fact we haven't yet.
> 
> I am trying my best to think about what hidden problems might lurk around and surface later.
> 
> Not directly related to the above, but I am thinking:
> I really like the way SVM flush works because it ensures that redundant flushes don't cost anything.
> 
> I wonder if we can make VMX code emulate this,
> by having an emulated 'tlb_control' field and then doing the flush (INVEPT) on VM entry.

I briefly thought about this while working on this series. One way to do
this is to make TLB flush requests (e.g. KVM_REQ_TLB_FLUSH) when
emulating INVVPID and INVEPT instead of doing the flush right away. Then
we do service the flushes once before running the guest. Redundant
flushes are coalesced.

But I did not look too closely if the existing TLB flush requests
satisfy all the use cases or if we would need to create new ones. Not
sure if the complexity would be worth it, but something to think about.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 13/13] KVM: nSVM: Stop bombing the TLB on nested transitions
  2025-03-05  3:14       ` Maxim Levitsky
@ 2025-03-05  6:45         ` Yosry Ahmed
  0 siblings, 0 replies; 48+ messages in thread
From: Yosry Ahmed @ 2025-03-05  6:45 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Tue, Mar 04, 2025 at 10:14:40PM -0500, Maxim Levitsky wrote:
> On Mon, 2025-03-03 at 22:21 +0000, Yosry Ahmed wrote:
> > On Fri, Feb 28, 2025 at 09:21:54PM -0500, Maxim Levitsky wrote:
> > > On Wed, 2025-02-05 at 18:24 +0000, Yosry Ahmed wrote:
> > > > Now that nested TLB flushes are properly tracked with a well-maintained
> > > > separate ASID for L2 and proper handling of L1's TLB flush requests,
> > > > drop the unconditional flushes and syncs on nested transitions.
> > > > 
> > > > On a Milan machine, an L1 and L2 guests were booted, both with a single
> > > > vCPU, and pinned to a single physical CPU to maximize TLB collisions. In
> > > > this setup, the cpuid_rate microbenchmark [1] showed the following
> > > > changes with this patch:
> > > > 
> > > > +--------+--------+-------------------+----------------------+
> > > > > L0     | L1     | cpuid_rate (base) | cpuid_rate (patched) |
> > > > +========+========+===================+======================+
> > > > > NPT    | NPT    | 256621            | 301113 (+17.3%)      |
> > > > > NPT    | Shadow | 180017            | 203347 (+12.96%)     |
> > > > > Shadow | Shadow | 177006            | 189150 (+6.86%)      |
> > > > +--------+--------+-------------------+----------------------+
> > > > 
> > > > [1]https://lore.kernel.org/kvm/20231109180646.2963718-1-khorenko@virtuozzo.com/
> > > > 
> > > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > ---
> > > >  arch/x86/kvm/svm/nested.c | 7 -------
> > > >  1 file changed, 7 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > > index 8e40ff21f7353..45a187d4c23d1 100644
> > > > --- a/arch/x86/kvm/svm/nested.c
> > > > +++ b/arch/x86/kvm/svm/nested.c
> > > > @@ -512,9 +512,6 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
> > > >  		svm->nested.last_asid = svm->nested.ctl.asid;
> > > >  		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> > > >  	}
> > > > -	/* TODO: optimize unconditional TLB flush/MMU sync */
> > > > -	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > > > -	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> > > >  }
> > > >  
> > > >  static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
> > > > @@ -530,10 +527,6 @@ static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
> > > >  	 */
> > > >  	if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
> > > >  		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> > > > -
> > > > -	/* TODO: optimize unconditional TLB flush/MMU sync */
> > > > -	kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > > > -	kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> > > >  }
> > > >  
> > > >  /*
> > > 
> > > Assuming that all previous patches are correct this one should work as well.
> > > 
> > > However only a very heavy stress testing, including hyperv, windows guests
> > > of various types, etc can give me confidence that there is no some ugly bug lurking
> > > somewhere.
> > 
> > I tried booting an L2 and running some workloads like netperf in there.
> > I also tried booting an L3.
> > 
> > I am planning to try and run some testing with a windows L2 guest. I am
> > assuming this exercises the hyper-V emulation in L1, which could be
> > interesting.
> > 
> > I am not sure if I will be able to test more scenarios though,
> > especially Windows as an L1 (and something else as an L2).
> > 
> > Let me know if you have something specific in mind.
> 
> 
> KVM can run itself 'under' HyperV (although in this case when it runs a guest
> the guest will be L3 overall, so not really something supported but still something that might
> reveal bugs).
> In this case KVM/L1 can take advantage of L0's TLB flush interface.

I don't think I will be able to test on Hyper-V.

> 
> Stress testing L3s also can be nice, although in this case from L0 POV, it doesn't see L3 at all.
> Instead it sees that L1 runs two different L2s back to back, so the current code will
> likely flush everything all the time.

I did run an L3 in an attempt to shake out any bugs.

> 
> 
> The direct TLB flush that hyperv does, especially from L2 to L0 should also be tested,
> it's a relatively new feature, so we need to check that L2 actually uses it.

Is this when KVM is emulating Hyper-V for nested guests, or when KVM is
running on top of Hyper-V? If the latter, as I said earlier I am not
sure if I will be able to test that.

> 
> KVM also has its own way of TLB flushing paravirtualization, which can in theory interfere with this.
> 
> 
> It's also nice to run a hyperv enabled Windows as KVM guest, and run a guest in it (can be Windows or Linux or anything else)
> Such guest will run two L2 VMs, Windows itself and the VM you run inside.

Yeah that's something I intend on doing. Sean mentioned that recent
Windows versions run the OS in L1 on top of the hypervisor in L0, so I
think if I run a Windows VM I automatically get both L1 and L2. So just
running a Windows VM should exercise the TLB flushes. I will also try to
run WSL to have multiple L2 VMs. I believe that's what you are talking
about here.

> 
> 
> You can also try other L1s, like VirtualBox, VMware, running in Windows or Linux L1,
> and themselves can run a windows or Linux L2. 
> 
> You can also test other OSes like BSD* and such as L1, they might have a different TLB access pattern and
> might reveal something, who knows. These can also run L2s using their own hypervisors.
> 
> Running a very old (say Windows XP, or some very old Linux) as L2 might also reveal something.

Honestly, I don't think I have the time or resources to test other
operating systems or L1s tbh. Currently my plan is to try and exercise
more scenarios in a Linux L2 guest, and run a Windows guest as I
mentioned earlier.

> 
> (But don't try to run win95/98 - this OS is known to not flush TLB properly (it doesn't use INVLPG when it should),
> so it doesn't work well on AMD at all because of this).

Good to know :)

> 
> Finally, it might be worth it to develop a TLB stress test if one doesn't exist yet.

I also thought about this, but I think it would be very tricky to cover
all the cases, and we'd need the test to create an L1 that is
sophisticated enough to exercise different TLB flushing scenarios. I
think running an actual OS as L1 is probably exercising the TLB code
more that any test.

That being said, Sean did mention the 'access' tests in KUT, and I plan
to check how relevant they are and if they can easily extended to add
some coverage for this.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2025-03-05  6:45 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-05 18:23 [RFC PATCH 00/13] Optimize nSVM TLB flushes Yosry Ahmed
2025-02-05 18:23 ` [RFC PATCH 01/13] KVM: nSVM: Track the ASID per-VMCB Yosry Ahmed
2025-03-01  0:03   ` Sean Christopherson
2025-03-03 17:51     ` Jim Mattson
2025-03-03 18:53       ` Sean Christopherson
2025-03-03 19:18     ` Yosry Ahmed
2025-03-01  1:23   ` Maxim Levitsky
2025-03-03 19:31     ` Yosry Ahmed
2025-02-05 18:23 ` [RFC PATCH 02/13] KVM: nSVM: Rework svm_flush_tlb_asid() to operate on a given VMCB Yosry Ahmed
2025-03-01  1:29   ` Maxim Levitsky
2025-03-03 21:58     ` Yosry Ahmed
2025-03-05  2:52       ` Maxim Levitsky
2025-02-05 18:23 ` [RFC PATCH 03/13] KVM: nSVM: Split nested_svm_transition_tlb_flush() into entry/exit fns Yosry Ahmed
2025-03-01  1:34   ` Maxim Levitsky
2025-02-05 18:23 ` [RFC PATCH 04/13] KVM: SVM: Introduce helpers for updating TLB_CONTROL Yosry Ahmed
2025-03-01  1:37   ` Maxim Levitsky
2025-02-05 18:23 ` [RFC PATCH 05/13] KVM: x86/mmu: rename __kvm_mmu_invalidate_addr() Yosry Ahmed
2025-02-05 18:23 ` [RFC PATCH 06/13] KVM: x86/mmu: Allow skipping the gva flush in kvm_mmu_invalidate_addr() Yosry Ahmed
2025-02-05 18:23 ` [RFC PATCH 07/13] KVM: nSVM: Handle INVLPGA interception correctly Yosry Ahmed
2025-03-01  1:55   ` Maxim Levitsky
2025-03-03 22:05     ` Yosry Ahmed
2025-03-05  2:54       ` Maxim Levitsky
2025-03-05  6:20         ` Yosry Ahmed
2025-02-05 18:23 ` [RFC PATCH 08/13] KVM: nSVM: Flush both L1 and L2 ASIDs on KVM_REQ_TLB_FLUSH Yosry Ahmed
2025-03-01  1:58   ` Maxim Levitsky
2025-03-03 22:06     ` Yosry Ahmed
2025-02-05 18:23 ` [RFC PATCH 09/13] KVM: nSVM: Handle nested TLB flush requests through TLB_CONTROL Yosry Ahmed
2025-02-05 21:45   ` Yosry Ahmed
2025-03-01  2:06   ` Maxim Levitsky
2025-03-03 22:10     ` Yosry Ahmed
2025-03-05  2:57       ` Maxim Levitsky
2025-02-05 18:23 ` [RFC PATCH 10/13] KVM: nSVM: Flush the TLB if L1 changes L2's ASID Yosry Ahmed
2025-03-01  2:13   ` Maxim Levitsky
2025-02-05 18:24 ` [RFC PATCH 11/13] KVM: nSVM: Do not reset TLB_CONTROL in VMCB02 on nested entry Yosry Ahmed
2025-03-01  2:17   ` Maxim Levitsky
2025-03-03 22:14     ` Yosry Ahmed
2025-03-05  3:01       ` Maxim Levitsky
2025-03-05  6:34         ` Yosry Ahmed
2025-02-05 18:24 ` [RFC PATCH 12/13] KVM: nSVM: Service local TLB flushes before nested transitions Yosry Ahmed
2025-03-01  2:20   ` Maxim Levitsky
2025-03-03 22:18     ` Yosry Ahmed
2025-03-05  3:03       ` Maxim Levitsky
2025-03-05  6:37         ` Yosry Ahmed
2025-02-05 18:24 ` [RFC PATCH 13/13] KVM: nSVM: Stop bombing the TLB on " Yosry Ahmed
2025-03-01  2:21   ` Maxim Levitsky
2025-03-03 22:21     ` Yosry Ahmed
2025-03-05  3:14       ` Maxim Levitsky
2025-03-05  6:45         ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).