kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] KVM: x86: allow DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM passthrough
@ 2025-04-16  0:25 Maxim Levitsky
  2025-04-16  0:25 ` [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions Maxim Levitsky
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Maxim Levitsky @ 2025-04-16  0:25 UTC (permalink / raw)
  To: kvm
  Cc: Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Sean Christopherson, Dave Hansen, Ingo Molnar, linux-kernel,
	H. Peter Anvin, Maxim Levitsky

Currently KVM allows the guest to set IA32_DEBUGCTL to whatever value
the guest wants, only capped by a bitmask of allowed bits

(except in the nested entry where KVM apparently doesn't even check
this set of allowed bits - this patch series also fixes that)

However some IA32_DEBUGCTL bits can be useful for the host, e.g the
IA32_DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM which isolates the PMU from
the influence of the host's SMM.

Reshuffle some of the code to allow (currently only this bit) to be passed
though from its host value to the guest.

Note that host value of this bit can be toggled by writing 0 or 1 to
/sys/devices/cpu/freeze_on_smi

This was tested on a Intel(R) Xeon(R) Silver 4410Y with KVM unit tests and
kvm selftests running in parallel with tight loop writing to IO port 0xB2
which on this machine generates #SMIs.

SMI generation was also verified also by reading the MSR 0x34 which
shows the current count of #SMIs received.

Despite the flood of #SMIs, the tests survived with this patch applied.

Best regards,
     Maxim Levitsky

Maxim Levitsky (3):
  x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access
    functions
  x86: KVM: VMX: cache guest written value of MSR_IA32_DEBUGCTL
  x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the
    guest mode

 arch/x86/kvm/svm/svm.c       |  2 +
 arch/x86/kvm/vmx/nested.c    | 15 +++++--
 arch/x86/kvm/vmx/pmu_intel.c |  9 ++--
 arch/x86/kvm/vmx/vmx.c       | 87 +++++++++++++++++++++++++++---------
 arch/x86/kvm/vmx/vmx.h       |  4 ++
 arch/x86/kvm/x86.c           |  2 -
 6 files changed, 89 insertions(+), 30 deletions(-)

-- 
2.26.3



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions
  2025-04-16  0:25 [PATCH 0/3] KVM: x86: allow DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM passthrough Maxim Levitsky
@ 2025-04-16  0:25 ` Maxim Levitsky
  2025-04-22 23:33   ` Sean Christopherson
  2025-04-23  9:51   ` Mi, Dapeng
  2025-04-16  0:25 ` [PATCH 2/3] x86: KVM: VMX: cache guest written value of MSR_IA32_DEBUGCTL Maxim Levitsky
  2025-04-16  0:25 ` [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode Maxim Levitsky
  2 siblings, 2 replies; 21+ messages in thread
From: Maxim Levitsky @ 2025-04-16  0:25 UTC (permalink / raw)
  To: kvm
  Cc: Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Sean Christopherson, Dave Hansen, Ingo Molnar, linux-kernel,
	H. Peter Anvin, Maxim Levitsky

Instead of reading and writing GUEST_IA32_DEBUGCTL vmcs field directly,
wrap the logic with get/set functions.

Also move the checks that the guest's supplied value is valid to the new
'set' function.

In particular, the above change fixes a minor security issue in which L1
hypervisor could set the GUEST_IA32_DEBUGCTL, and eventually the host's
MSR_IA32_DEBUGCTL to any value by performing a VM entry to L2 with
VM_ENTRY_LOAD_DEBUG_CONTROLS set.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/nested.c    | 15 +++++++---
 arch/x86/kvm/vmx/pmu_intel.c |  9 +++---
 arch/x86/kvm/vmx/vmx.c       | 58 +++++++++++++++++++++++-------------
 arch/x86/kvm/vmx/vmx.h       |  3 ++
 4 files changed, 57 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index e073e3008b16..b7686569ee09 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2641,6 +2641,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx);
 	bool load_guest_pdptrs_vmcs12 = false;
+	u64 new_debugctl;
 
 	if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) {
 		prepare_vmcs02_rare(vmx, vmcs12);
@@ -2653,11 +2654,17 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 	if (vmx->nested.nested_run_pending &&
 	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) {
 		kvm_set_dr(vcpu, 7, vmcs12->guest_dr7);
-		vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
+		new_debugctl = vmcs12->guest_ia32_debugctl;
 	} else {
 		kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
-		vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl);
+		new_debugctl = vmx->nested.pre_vmenter_debugctl;
 	}
+
+	if (CC(!vmx_set_guest_debugctl(vcpu, new_debugctl, false))) {
+		*entry_failure_code = ENTRY_FAIL_DEFAULT;
+		return -EINVAL;
+	}
+
 	if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
 	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
 		vmcs_write64(GUEST_BNDCFGS, vmx->nested.pre_vmenter_bndcfgs);
@@ -3520,7 +3527,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 
 	if (!vmx->nested.nested_run_pending ||
 	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS))
-		vmx->nested.pre_vmenter_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+		vmx->nested.pre_vmenter_debugctl = vmx_get_guest_debugctl(vcpu);
 	if (kvm_mpx_supported() &&
 	    (!vmx->nested.nested_run_pending ||
 	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
@@ -4788,7 +4795,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	__vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR);
 
 	kvm_set_dr(vcpu, 7, 0x400);
-	vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
+	vmx_set_guest_debugctl(vcpu, 0, false);
 
 	if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr,
 				vmcs12->vm_exit_msr_load_count))
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 8a94b52c5731..f6f448adfb80 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -19,6 +19,7 @@
 #include "lapic.h"
 #include "nested.h"
 #include "pmu.h"
+#include "vmx.h"
 #include "tdx.h"
 
 /*
@@ -652,11 +653,11 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
  */
 static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu)
 {
-	u64 data = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	u64 data = vmx_get_guest_debugctl(vcpu);
 
 	if (data & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) {
 		data &= ~DEBUGCTLMSR_LBR;
-		vmcs_write64(GUEST_IA32_DEBUGCTL, data);
+		vmx_set_guest_debugctl(vcpu, data, true);
 	}
 }
 
@@ -729,7 +730,7 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu)
 
 	if (!lbr_desc->event) {
 		vmx_disable_lbr_msrs_passthrough(vcpu);
-		if (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR)
+		if (vmx_get_guest_debugctl(vcpu) & DEBUGCTLMSR_LBR)
 			goto warn;
 		if (test_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use))
 			goto warn;
@@ -751,7 +752,7 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu)
 
 static void intel_pmu_cleanup(struct kvm_vcpu *vcpu)
 {
-	if (!(vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
+	if (!(vmx_get_guest_debugctl(vcpu) & DEBUGCTLMSR_LBR))
 		intel_pmu_release_guest_lbr_event(vcpu);
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ef2d7208dd20..4237422dc4ed 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2154,7 +2154,7 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
 		break;
 	case MSR_IA32_DEBUGCTLMSR:
-		msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
+		msr_info->data = vmx_get_guest_debugctl(vcpu);
 		break;
 	default:
 	find_uret_msr:
@@ -2194,6 +2194,41 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
 	return debugctl;
 }
 
+u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
+{
+	return vmcs_read64(GUEST_IA32_DEBUGCTL);
+}
+
+static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
+{
+	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
+}
+
+bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
+{
+	u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated);
+
+	if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
+		kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data);
+		data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
+		invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
+	}
+
+	if (invalid)
+		return false;
+
+	if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls &
+					VM_EXIT_SAVE_DEBUG_CONTROLS))
+		get_vmcs12(vcpu)->guest_ia32_debugctl = data;
+
+	if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
+	    (data & DEBUGCTLMSR_LBR))
+		intel_pmu_create_guest_lbr_event(vcpu);
+
+	__vmx_set_guest_debugctl(vcpu, data);
+	return true;
+}
+
 /*
  * Writes msr value into the appropriate "register".
  * Returns 0 on success, non-0 otherwise.
@@ -2263,26 +2298,9 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		vmcs_writel(GUEST_SYSENTER_ESP, data);
 		break;
 	case MSR_IA32_DEBUGCTLMSR: {
-		u64 invalid;
-
-		invalid = data & ~vmx_get_supported_debugctl(vcpu, msr_info->host_initiated);
-		if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
-			kvm_pr_unimpl_wrmsr(vcpu, msr_index, data);
-			data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
-			invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
-		}
-
-		if (invalid)
+		if (!vmx_set_guest_debugctl(vcpu, data, msr_info->host_initiated))
 			return 1;
 
-		if (is_guest_mode(vcpu) && get_vmcs12(vcpu)->vm_exit_controls &
-						VM_EXIT_SAVE_DEBUG_CONTROLS)
-			get_vmcs12(vcpu)->guest_ia32_debugctl = data;
-
-		vmcs_write64(GUEST_IA32_DEBUGCTL, data);
-		if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
-		    (data & DEBUGCTLMSR_LBR))
-			intel_pmu_create_guest_lbr_event(vcpu);
 		return 0;
 	}
 	case MSR_IA32_BNDCFGS:
@@ -4795,7 +4813,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 	vmcs_write32(GUEST_SYSENTER_CS, 0);
 	vmcs_writel(GUEST_SYSENTER_ESP, 0);
 	vmcs_writel(GUEST_SYSENTER_EIP, 0);
-	vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
+	__vmx_set_guest_debugctl(&vmx->vcpu, 0);
 
 	if (cpu_has_vmx_tpr_shadow()) {
 		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 6d1e40ecc024..8ac46fb47abd 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -404,6 +404,9 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
 
 gva_t vmx_get_untagged_addr(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
 
+bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 value, bool host_initiated);
+u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu);
+
 static inline void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr,
 					     int type, bool value)
 {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/3] x86: KVM: VMX: cache guest written value of MSR_IA32_DEBUGCTL
  2025-04-16  0:25 [PATCH 0/3] KVM: x86: allow DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM passthrough Maxim Levitsky
  2025-04-16  0:25 ` [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions Maxim Levitsky
@ 2025-04-16  0:25 ` Maxim Levitsky
  2025-04-16  0:25 ` [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode Maxim Levitsky
  2 siblings, 0 replies; 21+ messages in thread
From: Maxim Levitsky @ 2025-04-16  0:25 UTC (permalink / raw)
  To: kvm
  Cc: Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Sean Christopherson, Dave Hansen, Ingo Molnar, linux-kernel,
	H. Peter Anvin, Maxim Levitsky

Store the guest's written value of MSR_IA32_DEBUGCTL in a software field,
insted of always reading/writing the GUEST_IA32_DEBUGCTL vmcs field.

This will allow in the future to have a different value in the
actual GUEST_IA32_DEBUGCTL from the value that the guest has set.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 5 ++++-
 arch/x86/kvm/vmx/vmx.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4237422dc4ed..c9208a4acda4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2196,11 +2196,14 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
 
 u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
 {
-	return vmcs_read64(GUEST_IA32_DEBUGCTL);
+	return to_vmx(vcpu)->msr_ia32_debugctl;
 }
 
 static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
 {
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->msr_ia32_debugctl = data;
 	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 8ac46fb47abd..699da6d2bc66 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -291,6 +291,7 @@ struct vcpu_vmx {
 	/* SGX Launch Control public key hash */
 	u64 msr_ia32_sgxlepubkeyhash[4];
 	u64 msr_ia32_mcu_opt_ctrl;
+	u64 msr_ia32_debugctl;
 	bool disable_fb_clear;
 
 	struct pt_desc pt_desc;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode
  2025-04-16  0:25 [PATCH 0/3] KVM: x86: allow DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM passthrough Maxim Levitsky
  2025-04-16  0:25 ` [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions Maxim Levitsky
  2025-04-16  0:25 ` [PATCH 2/3] x86: KVM: VMX: cache guest written value of MSR_IA32_DEBUGCTL Maxim Levitsky
@ 2025-04-16  0:25 ` Maxim Levitsky
  2025-04-22 23:41   ` Sean Christopherson
  2025-04-23 10:10   ` Mi, Dapeng
  2 siblings, 2 replies; 21+ messages in thread
From: Maxim Levitsky @ 2025-04-16  0:25 UTC (permalink / raw)
  To: kvm
  Cc: Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Sean Christopherson, Dave Hansen, Ingo Molnar, linux-kernel,
	H. Peter Anvin, Maxim Levitsky

Pass through the host's DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM to the guest
GUEST_IA32_DEBUGCTL without the guest seeing this value.

Note that in the future we might allow the guest to set this bit as well,
when we implement PMU freezing on VM own, virtual SMM entry.

Since the value of the host DEBUGCTL can in theory change between VM runs,
check if has changed, and if yes, then reload the GUEST_IA32_DEBUGCTL with
the new value of the host portion of it (currently only the
DEBUGCTLMSR_FREEZE_IN_SMM bit)

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/svm.c |  2 ++
 arch/x86/kvm/vmx/vmx.c | 28 +++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c     |  2 --
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index cc1c721ba067..fda0660236d8 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4271,6 +4271,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
 	svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
 	svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
 
+	vcpu->arch.host_debugctl = get_debugctlmsr();
+
 	/*
 	 * Disable singlestep if we're injecting an interrupt/exception.
 	 * We don't want our modified rflags to be pushed on the stack where
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c9208a4acda4..e0bc31598d60 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2194,6 +2194,17 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
 	return debugctl;
 }
 
+static u64 vmx_get_host_preserved_debugctl(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Bits of host's DEBUGCTL that we should preserve while the guest is
+	 * running.
+	 *
+	 * Some of those bits might still be emulated for the guest own use.
+	 */
+	return DEBUGCTLMSR_FREEZE_IN_SMM;
+}
+
 u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
 {
 	return to_vmx(vcpu)->msr_ia32_debugctl;
@@ -2202,9 +2213,11 @@ u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
 static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u64 host_mask = vmx_get_host_preserved_debugctl(vcpu);
 
 	vmx->msr_ia32_debugctl = data;
-	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
+	vmcs_write64(GUEST_IA32_DEBUGCTL,
+		     (vcpu->arch.host_debugctl & host_mask) | (data & ~host_mask));
 }
 
 bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
@@ -2232,6 +2245,7 @@ bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated
 	return true;
 }
 
+
 /*
  * Writes msr value into the appropriate "register".
  * Returns 0 on success, non-0 otherwise.
@@ -7349,6 +7363,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned long cr3, cr4;
+	u64 old_debugctl;
 
 	/* Record the guest's net vcpu time for enforced NMI injections. */
 	if (unlikely(!enable_vnmi &&
@@ -7379,6 +7394,17 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 		vmcs_write32(PLE_WINDOW, vmx->ple_window);
 	}
 
+	old_debugctl = vcpu->arch.host_debugctl;
+	vcpu->arch.host_debugctl = get_debugctlmsr();
+
+	/*
+	 * In case the host DEBUGCTL had changed since the last time we
+	 * read it, update the guest's GUEST_IA32_DEBUGCTL with
+	 * the host's bits.
+	 */
+	if (old_debugctl != vcpu->arch.host_debugctl)
+		__vmx_set_guest_debugctl(vcpu, vmx->msr_ia32_debugctl);
+
 	/*
 	 * We did this in prepare_switch_to_guest, because it needs to
 	 * be within srcu_read_lock.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 844e81ee1d96..05e866ed345d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11020,8 +11020,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		set_debugreg(0, 7);
 	}
 
-	vcpu->arch.host_debugctl = get_debugctlmsr();
-
 	guest_timing_enter_irqoff();
 
 	for (;;) {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions
  2025-04-16  0:25 ` [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions Maxim Levitsky
@ 2025-04-22 23:33   ` Sean Christopherson
  2025-05-01 20:35     ` mlevitsk
  2025-04-23  9:51   ` Mi, Dapeng
  1 sibling, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-04-22 23:33 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

On Tue, Apr 15, 2025, Maxim Levitsky wrote:
> Instead of reading and writing GUEST_IA32_DEBUGCTL vmcs field directly,
> wrap the logic with get/set functions.

Why?  I know why the "set" helper is being added, but it needs to called out.

Please omit the getter entirely, it does nothing more than obfuscate a very
simple line of code.

> Also move the checks that the guest's supplied value is valid to the new
> 'set' function.

Please do this in a separate patch.  There's no need to mix refactoring and
functional changes.

> In particular, the above change fixes a minor security issue in which L1

Bug, yes.  Not sure it constitutes a meaningful security issue though.

> hypervisor could set the GUEST_IA32_DEBUGCTL, and eventually the host's
> MSR_IA32_DEBUGCTL

No, the lack of a consistency check allows the guest to set the MSR in hardware,
but that is not the host's value.

> to any value by performing a VM entry to L2 with VM_ENTRY_LOAD_DEBUG_CONTROLS
> set.

Any *legal* value.  Setting completely unsupported bits will result in VM-Enter
failing with a consistency check VM-Exit.

> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c    | 15 +++++++---
>  arch/x86/kvm/vmx/pmu_intel.c |  9 +++---
>  arch/x86/kvm/vmx/vmx.c       | 58 +++++++++++++++++++++++-------------
>  arch/x86/kvm/vmx/vmx.h       |  3 ++
>  4 files changed, 57 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index e073e3008b16..b7686569ee09 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2641,6 +2641,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx);
>  	bool load_guest_pdptrs_vmcs12 = false;
> +	u64 new_debugctl;
>  
>  	if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) {
>  		prepare_vmcs02_rare(vmx, vmcs12);
> @@ -2653,11 +2654,17 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  	if (vmx->nested.nested_run_pending &&
>  	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) {
>  		kvm_set_dr(vcpu, 7, vmcs12->guest_dr7);
> -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
> +		new_debugctl = vmcs12->guest_ia32_debugctl;
>  	} else {
>  		kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
> -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl);
> +		new_debugctl = vmx->nested.pre_vmenter_debugctl;
>  	}
> +
> +	if (CC(!vmx_set_guest_debugctl(vcpu, new_debugctl, false))) {

The consistency check belongs in nested_vmx_check_guest_state(), only needs to
check the VM_ENTRY_LOAD_DEBUG_CONTROLS case, and should be posted as a separate
patch.

> +		*entry_failure_code = ENTRY_FAIL_DEFAULT;
> +		return -EINVAL;
> +	}
> +
> +static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
> +{
> +	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> +}
> +
> +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
> +{
> +	u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated);
> +
> +	if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
> +		kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data);
> +		data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> +		invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> +	}
> +
> +	if (invalid)
> +		return false;
> +
> +	if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls &
> +					VM_EXIT_SAVE_DEBUG_CONTROLS))
> +		get_vmcs12(vcpu)->guest_ia32_debugctl = data;
> +
> +	if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
> +	    (data & DEBUGCTLMSR_LBR))
> +		intel_pmu_create_guest_lbr_event(vcpu);
> +
> +	__vmx_set_guest_debugctl(vcpu, data);
> +	return true;

Return 0/-errno, not true/false.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode
  2025-04-16  0:25 ` [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode Maxim Levitsky
@ 2025-04-22 23:41   ` Sean Christopherson
  2025-05-01 20:41     ` mlevitsk
  2025-04-23 10:10   ` Mi, Dapeng
  1 sibling, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-04-22 23:41 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

On Tue, Apr 15, 2025, Maxim Levitsky wrote:
> Pass through the host's DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM to the guest
> GUEST_IA32_DEBUGCTL without the guest seeing this value.
> 
> Note that in the future we might allow the guest to set this bit as well,
> when we implement PMU freezing on VM own, virtual SMM entry.
> 
> Since the value of the host DEBUGCTL can in theory change between VM runs,
> check if has changed, and if yes, then reload the GUEST_IA32_DEBUGCTL with
> the new value of the host portion of it (currently only the
> DEBUGCTLMSR_FREEZE_IN_SMM bit)

No, it can't.  DEBUGCTLMSR_FREEZE_IN_SMM can be toggled via IPI callback, but
IRQs are disabled for the entirety of the inner run loop.  And if I'm somehow
wrong, this change movement absolutely belongs in a separate patch.

> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/svm/svm.c |  2 ++
>  arch/x86/kvm/vmx/vmx.c | 28 +++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c     |  2 --
>  3 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index cc1c721ba067..fda0660236d8 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4271,6 +4271,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>  	svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
>  	svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
>  
> +	vcpu->arch.host_debugctl = get_debugctlmsr();
> +
>  	/*
>  	 * Disable singlestep if we're injecting an interrupt/exception.
>  	 * We don't want our modified rflags to be pushed on the stack where
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c9208a4acda4..e0bc31598d60 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2194,6 +2194,17 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
>  	return debugctl;
>  }
>  
> +static u64 vmx_get_host_preserved_debugctl(struct kvm_vcpu *vcpu)

No, just open code handling DEBUGCTLMSR_FREEZE_IN_SMM, or make it a #define.
I'm not remotely convinced that we'll ever want to emulate DEBUGCTLMSR_FREEZE_IN_SMM,
and trying to plan for that possibility and adds complexity for no immediate value.

> +{
> +	/*
> +	 * Bits of host's DEBUGCTL that we should preserve while the guest is
> +	 * running.
> +	 *
> +	 * Some of those bits might still be emulated for the guest own use.
> +	 */
> +	return DEBUGCTLMSR_FREEZE_IN_SMM;
>
>  u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
>  {
>  	return to_vmx(vcpu)->msr_ia32_debugctl;
> @@ -2202,9 +2213,11 @@ u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
>  static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	u64 host_mask = vmx_get_host_preserved_debugctl(vcpu);
>  
>  	vmx->msr_ia32_debugctl = data;
> -	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> +	vmcs_write64(GUEST_IA32_DEBUGCTL,
> +		     (vcpu->arch.host_debugctl & host_mask) | (data & ~host_mask));
>  }
>  
>  bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
> @@ -2232,6 +2245,7 @@ bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated
>  	return true;
>  }
>  
> +

Spurious newline.

>  /*
>   * Writes msr value into the appropriate "register".
>   * Returns 0 on success, non-0 otherwise.
> @@ -7349,6 +7363,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	unsigned long cr3, cr4;
> +	u64 old_debugctl;
>  
>  	/* Record the guest's net vcpu time for enforced NMI injections. */
>  	if (unlikely(!enable_vnmi &&
> @@ -7379,6 +7394,17 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>  		vmcs_write32(PLE_WINDOW, vmx->ple_window);
>  	}
>  
> +	old_debugctl = vcpu->arch.host_debugctl;
> +	vcpu->arch.host_debugctl = get_debugctlmsr();
> +
> +	/*
> +	 * In case the host DEBUGCTL had changed since the last time we
> +	 * read it, update the guest's GUEST_IA32_DEBUGCTL with
> +	 * the host's bits.
> +	 */
> +	if (old_debugctl != vcpu->arch.host_debugctl)

This can and should be optimized to only do an update if a host-preserved bit
is toggled.

> +		__vmx_set_guest_debugctl(vcpu, vmx->msr_ia32_debugctl);

I would rather have a helper that explicitly writes the VMCS field, not one that
sets the guest value *and* writes the VMCS field.

The usage in init_vmcs() doesn't need to write vmx->msr_ia32_debugctl because the
vCPU is zero allocated, and this usage doesn't change vmx->msr_ia32_debugctl.
So the only path that actually needs to modify vmx->msr_ia32_debugctl is
vmx_set_guest_debugctl().

> +
>  	/*
>  	 * We did this in prepare_switch_to_guest, because it needs to
>  	 * be within srcu_read_lock.
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 844e81ee1d96..05e866ed345d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11020,8 +11020,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  		set_debugreg(0, 7);
>  	}
>  
> -	vcpu->arch.host_debugctl = get_debugctlmsr();
> -
>  	guest_timing_enter_irqoff();
>  
>  	for (;;) {
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions
  2025-04-16  0:25 ` [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions Maxim Levitsky
  2025-04-22 23:33   ` Sean Christopherson
@ 2025-04-23  9:51   ` Mi, Dapeng
  2025-05-01 20:34     ` mlevitsk
  1 sibling, 1 reply; 21+ messages in thread
From: Mi, Dapeng @ 2025-04-23  9:51 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Sean Christopherson, Dave Hansen, Ingo Molnar, linux-kernel,
	H. Peter Anvin

The shortlog "x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with
access functions" doesn't follow Sean's suggestion
(https://github.com/kvm-x86/linux/blob/next/Documentation/process/maintainer-kvm-x86.rst#shortlog).
Please modify. Thanks.


On 4/16/2025 8:25 AM, Maxim Levitsky wrote:
> Instead of reading and writing GUEST_IA32_DEBUGCTL vmcs field directly,
> wrap the logic with get/set functions.
>
> Also move the checks that the guest's supplied value is valid to the new
> 'set' function.
>
> In particular, the above change fixes a minor security issue in which L1
> hypervisor could set the GUEST_IA32_DEBUGCTL, and eventually the host's
> MSR_IA32_DEBUGCTL to any value by performing a VM entry to L2 with
> VM_ENTRY_LOAD_DEBUG_CONTROLS set.
>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c    | 15 +++++++---
>  arch/x86/kvm/vmx/pmu_intel.c |  9 +++---
>  arch/x86/kvm/vmx/vmx.c       | 58 +++++++++++++++++++++++-------------
>  arch/x86/kvm/vmx/vmx.h       |  3 ++
>  4 files changed, 57 insertions(+), 28 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index e073e3008b16..b7686569ee09 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2641,6 +2641,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx);
>  	bool load_guest_pdptrs_vmcs12 = false;
> +	u64 new_debugctl;
>  
>  	if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) {
>  		prepare_vmcs02_rare(vmx, vmcs12);
> @@ -2653,11 +2654,17 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  	if (vmx->nested.nested_run_pending &&
>  	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) {
>  		kvm_set_dr(vcpu, 7, vmcs12->guest_dr7);
> -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
> +		new_debugctl = vmcs12->guest_ia32_debugctl;
>  	} else {
>  		kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
> -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl);
> +		new_debugctl = vmx->nested.pre_vmenter_debugctl;
>  	}
> +
> +	if (CC(!vmx_set_guest_debugctl(vcpu, new_debugctl, false))) {
> +		*entry_failure_code = ENTRY_FAIL_DEFAULT;
> +		return -EINVAL;
> +	}
> +
>  	if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
>  	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
>  		vmcs_write64(GUEST_BNDCFGS, vmx->nested.pre_vmenter_bndcfgs);
> @@ -3520,7 +3527,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>  
>  	if (!vmx->nested.nested_run_pending ||
>  	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS))
> -		vmx->nested.pre_vmenter_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
> +		vmx->nested.pre_vmenter_debugctl = vmx_get_guest_debugctl(vcpu);
>  	if (kvm_mpx_supported() &&
>  	    (!vmx->nested.nested_run_pending ||
>  	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
> @@ -4788,7 +4795,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>  	__vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR);
>  
>  	kvm_set_dr(vcpu, 7, 0x400);
> -	vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
> +	vmx_set_guest_debugctl(vcpu, 0, false);
>  
>  	if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr,
>  				vmcs12->vm_exit_msr_load_count))
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 8a94b52c5731..f6f448adfb80 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -19,6 +19,7 @@
>  #include "lapic.h"
>  #include "nested.h"
>  #include "pmu.h"
> +#include "vmx.h"
>  #include "tdx.h"
>  
>  /*
> @@ -652,11 +653,11 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
>   */
>  static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu)
>  {
> -	u64 data = vmcs_read64(GUEST_IA32_DEBUGCTL);
> +	u64 data = vmx_get_guest_debugctl(vcpu);
>  
>  	if (data & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) {
>  		data &= ~DEBUGCTLMSR_LBR;
> -		vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> +		vmx_set_guest_debugctl(vcpu, data, true);

Two questions. 

1. why to call vmx_set_guest_debugctl() to do the extra check? currently
IA32_DEBUGCTL MSR is always intercepted and it's already checked at
vmx_set_msr() and seems unnecessary to check here again.

2. why the argument "host_initiated" is true? It looks the data is not from
host.


>  	}
>  }
>  
> @@ -729,7 +730,7 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu)
>  
>  	if (!lbr_desc->event) {
>  		vmx_disable_lbr_msrs_passthrough(vcpu);
> -		if (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR)
> +		if (vmx_get_guest_debugctl(vcpu) & DEBUGCTLMSR_LBR)
>  			goto warn;
>  		if (test_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use))
>  			goto warn;
> @@ -751,7 +752,7 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu)
>  
>  static void intel_pmu_cleanup(struct kvm_vcpu *vcpu)
>  {
> -	if (!(vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
> +	if (!(vmx_get_guest_debugctl(vcpu) & DEBUGCTLMSR_LBR))
>  		intel_pmu_release_guest_lbr_event(vcpu);
>  }
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ef2d7208dd20..4237422dc4ed 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2154,7 +2154,7 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  			msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
>  		break;
>  	case MSR_IA32_DEBUGCTLMSR:
> -		msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
> +		msr_info->data = vmx_get_guest_debugctl(vcpu);
>  		break;
>  	default:
>  	find_uret_msr:
> @@ -2194,6 +2194,41 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
>  	return debugctl;
>  }
>  
> +u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
> +{
> +	return vmcs_read64(GUEST_IA32_DEBUGCTL);
> +}
> +
> +static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
> +{
> +	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> +}

IMO,  it seems unnecessary to add these 2  wrappers since the original code
is already intuitive enough and simple. But if you want, please add
"inline" before these 2 wrappers.


> +
> +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)

Since most of code in this function checks guest debugctl, better to rename
it to "vmx_check_and_set_guest_debugctl".


> +{
> +	u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated);
> +
> +	if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
> +		kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data);
> +		data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> +		invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);

Add space around above 3 "|".


> +	}
> +
> +	if (invalid)
> +		return false;
> +
> +	if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls &
> +					VM_EXIT_SAVE_DEBUG_CONTROLS))
> +		get_vmcs12(vcpu)->guest_ia32_debugctl = data;
> +
> +	if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
> +	    (data & DEBUGCTLMSR_LBR))
> +		intel_pmu_create_guest_lbr_event(vcpu);
> +
> +	__vmx_set_guest_debugctl(vcpu, data);
> +	return true;
> +}
> +
>  /*
>   * Writes msr value into the appropriate "register".
>   * Returns 0 on success, non-0 otherwise.
> @@ -2263,26 +2298,9 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		vmcs_writel(GUEST_SYSENTER_ESP, data);
>  		break;
>  	case MSR_IA32_DEBUGCTLMSR: {
> -		u64 invalid;
> -
> -		invalid = data & ~vmx_get_supported_debugctl(vcpu, msr_info->host_initiated);
> -		if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
> -			kvm_pr_unimpl_wrmsr(vcpu, msr_index, data);
> -			data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> -			invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> -		}
> -
> -		if (invalid)
> +		if (!vmx_set_guest_debugctl(vcpu, data, msr_info->host_initiated))
>  			return 1;
>  
> -		if (is_guest_mode(vcpu) && get_vmcs12(vcpu)->vm_exit_controls &
> -						VM_EXIT_SAVE_DEBUG_CONTROLS)
> -			get_vmcs12(vcpu)->guest_ia32_debugctl = data;
> -
> -		vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> -		if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
> -		    (data & DEBUGCTLMSR_LBR))
> -			intel_pmu_create_guest_lbr_event(vcpu);
>  		return 0;
>  	}
>  	case MSR_IA32_BNDCFGS:
> @@ -4795,7 +4813,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  	vmcs_write32(GUEST_SYSENTER_CS, 0);
>  	vmcs_writel(GUEST_SYSENTER_ESP, 0);
>  	vmcs_writel(GUEST_SYSENTER_EIP, 0);
> -	vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
> +	__vmx_set_guest_debugctl(&vmx->vcpu, 0);
>  
>  	if (cpu_has_vmx_tpr_shadow()) {
>  		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 6d1e40ecc024..8ac46fb47abd 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -404,6 +404,9 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
>  
>  gva_t vmx_get_untagged_addr(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
>  
> +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 value, bool host_initiated);
> +u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu);
> +
>  static inline void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr,
>  					     int type, bool value)
>  {

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode
  2025-04-16  0:25 ` [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode Maxim Levitsky
  2025-04-22 23:41   ` Sean Christopherson
@ 2025-04-23 10:10   ` Mi, Dapeng
  1 sibling, 0 replies; 21+ messages in thread
From: Mi, Dapeng @ 2025-04-23 10:10 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Sean Christopherson, Dave Hansen, Ingo Molnar, linux-kernel,
	H. Peter Anvin


On 4/16/2025 8:25 AM, Maxim Levitsky wrote:
> Pass through the host's DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM to the guest
> GUEST_IA32_DEBUGCTL without the guest seeing this value.
>
> Note that in the future we might allow the guest to set this bit as well,
> when we implement PMU freezing on VM own, virtual SMM entry.
>
> Since the value of the host DEBUGCTL can in theory change between VM runs,
> check if has changed, and if yes, then reload the GUEST_IA32_DEBUGCTL with
> the new value of the host portion of it (currently only the
> DEBUGCTLMSR_FREEZE_IN_SMM bit)
>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/svm/svm.c |  2 ++
>  arch/x86/kvm/vmx/vmx.c | 28 +++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c     |  2 --
>  3 files changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index cc1c721ba067..fda0660236d8 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4271,6 +4271,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>  	svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
>  	svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
>  
> +	vcpu->arch.host_debugctl = get_debugctlmsr();
> +
>  	/*
>  	 * Disable singlestep if we're injecting an interrupt/exception.
>  	 * We don't want our modified rflags to be pushed on the stack where
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c9208a4acda4..e0bc31598d60 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2194,6 +2194,17 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
>  	return debugctl;
>  }
>  
> +static u64 vmx_get_host_preserved_debugctl(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * Bits of host's DEBUGCTL that we should preserve while the guest is
> +	 * running.
> +	 *
> +	 * Some of those bits might still be emulated for the guest own use.
> +	 */
> +	return DEBUGCTLMSR_FREEZE_IN_SMM;
> +}

Seems unnecessary to define a function here, a macro is enough.

#define HOST_DEBUGCTL_PRESERVE_BITS        (DEBUGCTLMSR_FREEZE_IN_SMM)


> +
>  u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
>  {
>  	return to_vmx(vcpu)->msr_ia32_debugctl;
> @@ -2202,9 +2213,11 @@ u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
>  static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	u64 host_mask = vmx_get_host_preserved_debugctl(vcpu);
>  
>  	vmx->msr_ia32_debugctl = data;
> -	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> +	vmcs_write64(GUEST_IA32_DEBUGCTL,
> +		     (vcpu->arch.host_debugctl & host_mask) | (data & ~host_mask));
>  }
>  
>  bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
> @@ -2232,6 +2245,7 @@ bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated
>  	return true;
>  }
>  
> +
>  /*
>   * Writes msr value into the appropriate "register".
>   * Returns 0 on success, non-0 otherwise.
> @@ -7349,6 +7363,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	unsigned long cr3, cr4;
> +	u64 old_debugctl;
>  
>  	/* Record the guest's net vcpu time for enforced NMI injections. */
>  	if (unlikely(!enable_vnmi &&
> @@ -7379,6 +7394,17 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>  		vmcs_write32(PLE_WINDOW, vmx->ple_window);
>  	}
>  
> +	old_debugctl = vcpu->arch.host_debugctl;
> +	vcpu->arch.host_debugctl = get_debugctlmsr();
> +
> +	/*
> +	 * In case the host DEBUGCTL had changed since the last time we
> +	 * read it, update the guest's GUEST_IA32_DEBUGCTL with
> +	 * the host's bits.
> +	 */
> +	if (old_debugctl != vcpu->arch.host_debugctl)
> +		__vmx_set_guest_debugctl(vcpu, vmx->msr_ia32_debugctl);
> +
>  	/*
>  	 * We did this in prepare_switch_to_guest, because it needs to
>  	 * be within srcu_read_lock.
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 844e81ee1d96..05e866ed345d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11020,8 +11020,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  		set_debugreg(0, 7);
>  	}
>  
> -	vcpu->arch.host_debugctl = get_debugctlmsr();
> -
>  	guest_timing_enter_irqoff();
>  
>  	for (;;) {

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions
  2025-04-23  9:51   ` Mi, Dapeng
@ 2025-05-01 20:34     ` mlevitsk
  2025-05-07  5:17       ` Mi, Dapeng
  0 siblings, 1 reply; 21+ messages in thread
From: mlevitsk @ 2025-05-01 20:34 UTC (permalink / raw)
  To: Mi, Dapeng, kvm
  Cc: Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Sean Christopherson, Dave Hansen, Ingo Molnar, linux-kernel,
	H. Peter Anvin

On Wed, 2025-04-23 at 17:51 +0800, Mi, Dapeng wrote:
> The shortlog "x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with
> access functions" doesn't follow Sean's suggestion
> (https://github.com/kvm-x86/linux/blob/next/Documentation/process/maintainer-kvm-x86.rst#shortlog).
> Please modify. Thanks.
> 
> 
> On 4/16/2025 8:25 AM, Maxim Levitsky wrote:
> > Instead of reading and writing GUEST_IA32_DEBUGCTL vmcs field directly,
> > wrap the logic with get/set functions.
> > 
> > Also move the checks that the guest's supplied value is valid to the new
> > 'set' function.
> > 
> > In particular, the above change fixes a minor security issue in which L1
> > hypervisor could set the GUEST_IA32_DEBUGCTL, and eventually the host's
> > MSR_IA32_DEBUGCTL to any value by performing a VM entry to L2 with
> > VM_ENTRY_LOAD_DEBUG_CONTROLS set.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/kvm/vmx/nested.c    | 15 +++++++---
> >  arch/x86/kvm/vmx/pmu_intel.c |  9 +++---
> >  arch/x86/kvm/vmx/vmx.c       | 58 +++++++++++++++++++++++-------------
> >  arch/x86/kvm/vmx/vmx.h       |  3 ++
> >  4 files changed, 57 insertions(+), 28 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index e073e3008b16..b7686569ee09 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -2641,6 +2641,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  	struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx);
> >  	bool load_guest_pdptrs_vmcs12 = false;
> > +	u64 new_debugctl;
> >  
> >  	if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) {
> >  		prepare_vmcs02_rare(vmx, vmcs12);
> > @@ -2653,11 +2654,17 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> >  	if (vmx->nested.nested_run_pending &&
> >  	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) {
> >  		kvm_set_dr(vcpu, 7, vmcs12->guest_dr7);
> > -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
> > +		new_debugctl = vmcs12->guest_ia32_debugctl;
> >  	} else {
> >  		kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
> > -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl);
> > +		new_debugctl = vmx->nested.pre_vmenter_debugctl;
> >  	}
> > +
> > +	if (CC(!vmx_set_guest_debugctl(vcpu, new_debugctl, false))) {
> > +		*entry_failure_code = ENTRY_FAIL_DEFAULT;
> > +		return -EINVAL;
> > +	}
> > +
> >  	if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
> >  	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
> >  		vmcs_write64(GUEST_BNDCFGS, vmx->nested.pre_vmenter_bndcfgs);
> > @@ -3520,7 +3527,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> >  
> >  	if (!vmx->nested.nested_run_pending ||
> >  	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS))
> > -		vmx->nested.pre_vmenter_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
> > +		vmx->nested.pre_vmenter_debugctl = vmx_get_guest_debugctl(vcpu);
> >  	if (kvm_mpx_supported() &&
> >  	    (!vmx->nested.nested_run_pending ||
> >  	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
> > @@ -4788,7 +4795,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >  	__vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR);
> >  
> >  	kvm_set_dr(vcpu, 7, 0x400);
> > -	vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
> > +	vmx_set_guest_debugctl(vcpu, 0, false);
> >  
> >  	if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr,
> >  				vmcs12->vm_exit_msr_load_count))
> > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > index 8a94b52c5731..f6f448adfb80 100644
> > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > @@ -19,6 +19,7 @@
> >  #include "lapic.h"
> >  #include "nested.h"
> >  #include "pmu.h"
> > +#include "vmx.h"
> >  #include "tdx.h"
> >  
> >  /*
> > @@ -652,11 +653,11 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
> >   */
> >  static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu)
> >  {
> > -	u64 data = vmcs_read64(GUEST_IA32_DEBUGCTL);
> > +	u64 data = vmx_get_guest_debugctl(vcpu);
> >  
> >  	if (data & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) {
> >  		data &= ~DEBUGCTLMSR_LBR;
> > -		vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> > +		vmx_set_guest_debugctl(vcpu, data, true);
> 
> Two questions. 
> 
> 1. why to call vmx_set_guest_debugctl() to do the extra check? currently
> IA32_DEBUGCTL MSR is always intercepted and it's already checked at
> vmx_set_msr() and seems unnecessary to check here again.

Hi,


I wanted this to be consistent. KVM has plenty of functions that can be both
guest triggered and internally triggered. For example kvm_set_cr4()

Besides the vmx_set_guest_debugctl also notes the value the guest wrote
to be able to return it back to the guest if we choose to overide some
bits of the MSR, so it made sense to have one common function to set the msr.

Do you think that can affect performance? 


> 
> 2. why the argument "host_initiated" is true? It looks the data is not from
> host.

This is my mistake.

> 
> 
> >  	}
> >  }
> >  
> > @@ -729,7 +730,7 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu)
> >  
> >  	if (!lbr_desc->event) {
> >  		vmx_disable_lbr_msrs_passthrough(vcpu);
> > -		if (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR)
> > +		if (vmx_get_guest_debugctl(vcpu) & DEBUGCTLMSR_LBR)
> >  			goto warn;
> >  		if (test_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use))
> >  			goto warn;
> > @@ -751,7 +752,7 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu)
> >  
> >  static void intel_pmu_cleanup(struct kvm_vcpu *vcpu)
> >  {
> > -	if (!(vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
> > +	if (!(vmx_get_guest_debugctl(vcpu) & DEBUGCTLMSR_LBR))
> >  		intel_pmu_release_guest_lbr_event(vcpu);
> >  }
> >  
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index ef2d7208dd20..4237422dc4ed 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -2154,7 +2154,7 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >  			msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
> >  		break;
> >  	case MSR_IA32_DEBUGCTLMSR:
> > -		msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
> > +		msr_info->data = vmx_get_guest_debugctl(vcpu);
> >  		break;
> >  	default:
> >  	find_uret_msr:
> > @@ -2194,6 +2194,41 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
> >  	return debugctl;
> >  }
> >  
> > +u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
> > +{
> > +	return vmcs_read64(GUEST_IA32_DEBUGCTL);
> > +}
> > +
> > +static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
> > +{
> > +	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> > +}
> 
> IMO,  it seems unnecessary to add these 2  wrappers since the original code
> is already intuitive enough and simple. But if you want, please add
> "inline" before these 2 wrappers.

The __vmx_set_guest_debugctl in the next patch will store the written value in
a field, this is why I did it this way.

The vmx_get_guest_debugctl will read this value instead, also in the next patch.

I thought it would be cleaner to first introduce the trivial wrappers and then
extend them.

> 
> 
> > +
> > +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
> 
> Since most of code in this function checks guest debugctl, better to rename
> it to "vmx_check_and_set_guest_debugctl".

I don't mind doing so.

> 
> 
> > +{
> > +	u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated);
> > +
> > +	if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
> > +		kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data);
> > +		data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> > +		invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> 
> Add space around above 3 "|".

I copied this code "as is" from the wrmsr code. I can add this though.

Best regards,
	Maxim Levitsky

> 
> 
> > +	}
> > +
> > +	if (invalid)
> > +		return false;
> > +
> > +	if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls &
> > +					VM_EXIT_SAVE_DEBUG_CONTROLS))
> > +		get_vmcs12(vcpu)->guest_ia32_debugctl = data;
> > +
> > +	if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
> > +	    (data & DEBUGCTLMSR_LBR))
> > +		intel_pmu_create_guest_lbr_event(vcpu);
> > +
> > +	__vmx_set_guest_debugctl(vcpu, data);
> > +	return true;
> > +}
> > +
> >  /*
> >   * Writes msr value into the appropriate "register".
> >   * Returns 0 on success, non-0 otherwise.
> > @@ -2263,26 +2298,9 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >  		vmcs_writel(GUEST_SYSENTER_ESP, data);
> >  		break;
> >  	case MSR_IA32_DEBUGCTLMSR: {
> > -		u64 invalid;
> > -
> > -		invalid = data & ~vmx_get_supported_debugctl(vcpu, msr_info->host_initiated);
> > -		if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
> > -			kvm_pr_unimpl_wrmsr(vcpu, msr_index, data);
> > -			data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> > -			invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> > -		}
> > -
> > -		if (invalid)
> > +		if (!vmx_set_guest_debugctl(vcpu, data, msr_info->host_initiated))
> >  			return 1;
> >  
> > -		if (is_guest_mode(vcpu) && get_vmcs12(vcpu)->vm_exit_controls &
> > -						VM_EXIT_SAVE_DEBUG_CONTROLS)
> > -			get_vmcs12(vcpu)->guest_ia32_debugctl = data;
> > -
> > -		vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> > -		if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
> > -		    (data & DEBUGCTLMSR_LBR))
> > -			intel_pmu_create_guest_lbr_event(vcpu);
> >  		return 0;
> >  	}
> >  	case MSR_IA32_BNDCFGS:
> > @@ -4795,7 +4813,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
> >  	vmcs_write32(GUEST_SYSENTER_CS, 0);
> >  	vmcs_writel(GUEST_SYSENTER_ESP, 0);
> >  	vmcs_writel(GUEST_SYSENTER_EIP, 0);
> > -	vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
> > +	__vmx_set_guest_debugctl(&vmx->vcpu, 0);
> >  
> >  	if (cpu_has_vmx_tpr_shadow()) {
> >  		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
> > diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> > index 6d1e40ecc024..8ac46fb47abd 100644
> > --- a/arch/x86/kvm/vmx/vmx.h
> > +++ b/arch/x86/kvm/vmx/vmx.h
> > @@ -404,6 +404,9 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
> >  
> >  gva_t vmx_get_untagged_addr(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
> >  
> > +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 value, bool host_initiated);
> > +u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu);
> > +
> >  static inline void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr,
> >  					     int type, bool value)
> >  {


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions
  2025-04-22 23:33   ` Sean Christopherson
@ 2025-05-01 20:35     ` mlevitsk
  2025-05-07 17:18       ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: mlevitsk @ 2025-05-01 20:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

On Tue, 2025-04-22 at 16:33 -0700, Sean Christopherson wrote:
> On Tue, Apr 15, 2025, Maxim Levitsky wrote:
> > Instead of reading and writing GUEST_IA32_DEBUGCTL vmcs field directly,
> > wrap the logic with get/set functions.
> 
> Why?  I know why the "set" helper is being added, but it needs to called out.
> 
> Please omit the getter entirely, it does nothing more than obfuscate a very
> simple line of code.

In this patch yes. But in the next patch I switch to reading from 'vmx->msr_ia32_debugctl'
You want me to open code this access? I don't mind, if you insist.

> 
> > Also move the checks that the guest's supplied value is valid to the new
> > 'set' function.
> 
> Please do this in a separate patch.  There's no need to mix refactoring and
> functional changes.

I thought that it was natural to do this in a the same patch. In this patch I introduce
a 'vmx_set_guest_debugctl' which should be used any time we set the msr given
the guest value, and VM entry is one of these cases.

I can split this if you want.

> 
> > In particular, the above change fixes a minor security issue in which L1
> 
> Bug, yes.  Not sure it constitutes a meaningful security issue though.

I also think so, but I wanted to mention this just in case.

> 
> > hypervisor could set the GUEST_IA32_DEBUGCTL, and eventually the host's
> > MSR_IA32_DEBUGCTL
> 
> No, the lack of a consistency check allows the guest to set the MSR in hardware,
> but that is not the host's value.

That's what I meant - the guest can set the real hardware MSR. Yes, after the
guest exits, the OS value is restored. I'll rephrase this in v2.

> 
> > to any value by performing a VM entry to L2 with VM_ENTRY_LOAD_DEBUG_CONTROLS
> > set.
> 
> Any *legal* value.  Setting completely unsupported bits will result in VM-Enter
> failing with a consistency check VM-Exit.

True.

> 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/kvm/vmx/nested.c    | 15 +++++++---
> >  arch/x86/kvm/vmx/pmu_intel.c |  9 +++---
> >  arch/x86/kvm/vmx/vmx.c       | 58 +++++++++++++++++++++++-------------
> >  arch/x86/kvm/vmx/vmx.h       |  3 ++
> >  4 files changed, 57 insertions(+), 28 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index e073e3008b16..b7686569ee09 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -2641,6 +2641,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  	struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx);
> >  	bool load_guest_pdptrs_vmcs12 = false;
> > +	u64 new_debugctl;
> >  
> >  	if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) {
> >  		prepare_vmcs02_rare(vmx, vmcs12);
> > @@ -2653,11 +2654,17 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> >  	if (vmx->nested.nested_run_pending &&
> >  	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) {
> >  		kvm_set_dr(vcpu, 7, vmcs12->guest_dr7);
> > -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
> > +		new_debugctl = vmcs12->guest_ia32_debugctl;
> >  	} else {
> >  		kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
> > -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl);
> > +		new_debugctl = vmx->nested.pre_vmenter_debugctl;
> >  	}
> > +
> > +	if (CC(!vmx_set_guest_debugctl(vcpu, new_debugctl, false))) {
> 
> The consistency check belongs in nested_vmx_check_guest_state(), only needs to
> check the VM_ENTRY_LOAD_DEBUG_CONTROLS case, and should be posted as a separate
> patch.

I can move it there. Can you explain why though you want this? Is it because of the
order of checks specified in the PRM?

Currently GUEST_IA32_DEBUGCTL of the host is *written* in prepare_vmcs02. 
Should I also move this write to nested_vmx_check_guest_state?

Or should I write the value blindly in prepare_vmcs02 and then check the value
of 'vmx->msr_ia32_debugctl' in nested_vmx_check_guest_state and fail if the value
contains reserved bits? 
I don't like that idea that much IMHO.


> 
> > +		*entry_failure_code = ENTRY_FAIL_DEFAULT;
> > +		return -EINVAL;
> > +	}
> > +
> > +static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
> > +{
> > +	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> > +}
> > +
> > +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
> > +{
> > +	u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated);
> > +
> > +	if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
> > +		kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data);
> > +		data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> > +		invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> > +	}
> > +
> > +	if (invalid)
> > +		return false;
> > +
> > +	if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls &
> > +					VM_EXIT_SAVE_DEBUG_CONTROLS))
> > +		get_vmcs12(vcpu)->guest_ia32_debugctl = data;
> > +
> > +	if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
> > +	    (data & DEBUGCTLMSR_LBR))
> > +		intel_pmu_create_guest_lbr_event(vcpu);
> > +
> > +	__vmx_set_guest_debugctl(vcpu, data);
> > +	return true;
> 
> Return 0/-errno, not true/false.

There are plenty of functions in this file and KVM that return boolean.

e.g: 

static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp)
static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
static bool nested_evmcs_handle_vmclear(struct kvm_vcpu *vcpu, gpa_t vmptr)

static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
						 struct vmcs12 *vmcs12)


static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp)
static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)

...


I personally think that functions that emulate hardware should return boolean values
or some hardware specific status code (e.g VMX failure code) because the real hardware
never returns -EINVAL and such.


Best regards,
	Maxim Levitsky




> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode
  2025-04-22 23:41   ` Sean Christopherson
@ 2025-05-01 20:41     ` mlevitsk
  2025-05-01 20:53       ` mlevitsk
  0 siblings, 1 reply; 21+ messages in thread
From: mlevitsk @ 2025-05-01 20:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

On Tue, 2025-04-22 at 16:41 -0700, Sean Christopherson wrote:
> On Tue, Apr 15, 2025, Maxim Levitsky wrote:
> > Pass through the host's DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM to the guest
> > GUEST_IA32_DEBUGCTL without the guest seeing this value.
> > 
> > Note that in the future we might allow the guest to set this bit as well,
> > when we implement PMU freezing on VM own, virtual SMM entry.
> > 
> > Since the value of the host DEBUGCTL can in theory change between VM runs,
> > check if has changed, and if yes, then reload the GUEST_IA32_DEBUGCTL with
> > the new value of the host portion of it (currently only the
> > DEBUGCTLMSR_FREEZE_IN_SMM bit)
> 
> No, it can't.  DEBUGCTLMSR_FREEZE_IN_SMM can be toggled via IPI callback, but
> IRQs are disabled for the entirety of the inner run loop.  And if I'm somehow
> wrong, this change movement absolutely belongs in a separate patch.
> 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/kvm/svm/svm.c |  2 ++
> >  arch/x86/kvm/vmx/vmx.c | 28 +++++++++++++++++++++++++++-
> >  arch/x86/kvm/x86.c     |  2 --
> >  3 files changed, 29 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index cc1c721ba067..fda0660236d8 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -4271,6 +4271,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
> >  	svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
> >  	svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
> >  
> > +	vcpu->arch.host_debugctl = get_debugctlmsr();
> > +
> >  	/*
> >  	 * Disable singlestep if we're injecting an interrupt/exception.
> >  	 * We don't want our modified rflags to be pushed on the stack where
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index c9208a4acda4..e0bc31598d60 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -2194,6 +2194,17 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
> >  	return debugctl;
> >  }
> >  
> > +static u64 vmx_get_host_preserved_debugctl(struct kvm_vcpu *vcpu)
> 
> No, just open code handling DEBUGCTLMSR_FREEZE_IN_SMM, or make it a #define.
> I'm not remotely convinced that we'll ever want to emulate DEBUGCTLMSR_FREEZE_IN_SMM,
> and trying to plan for that possibility and adds complexity for no immediate value.

Hi,

The problem here is a bit different: we indeed are very unlikely to emulate the
DEBUGCTLMSR_FREEZE_IN_SMM but however, when I wrote this patch I was sure that this bit is 
mandatory with PMU version of 2 or more,  but looks like it is optional after all:

"
Note that system software must check if the processor supports the IA32_DEBUGCTL.FREEZE_WHILE_SMM
control bit. IA32_DEBUGCTL.FREEZE_WHILE_SMM is supported if IA32_PERF_CAPABIL-
ITIES.FREEZE_WHILE_SMM[Bit 12] is reporting 1. See Section 20.8 for details of detecting the presence of
IA32_PERF_CAPABILITIES MSR."

KVM indeed doesn't set the bit 12 of IA32_PERF_CAPABILITIES.

However, note that the Linux kernel silently sets this bit without checking the aforementioned capability 
bit and ends up with a #GP exception, which it silently ignores.... (I checked this with a trace...)

This led me to believe that this bit should be unconditionally supported,
meaning that KVM should at least fake setting it without triggering a #GP.

Since that is not the case, I can revert to the simpler model of exclusively using GUEST_IA32_DEBUGCTL 
while hiding the bit from the guest, however I do vote to keep the guest/host separation.

> 
> > +{
> > +	/*
> > +	 * Bits of host's DEBUGCTL that we should preserve while the guest is
> > +	 * running.
> > +	 *
> > +	 * Some of those bits might still be emulated for the guest own use.
> > +	 */
> > +	return DEBUGCTLMSR_FREEZE_IN_SMM;
> > 
> >  u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
> >  {
> >  	return to_vmx(vcpu)->msr_ia32_debugctl;
> > @@ -2202,9 +2213,11 @@ u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
> >  static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
> >  {
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +	u64 host_mask = vmx_get_host_preserved_debugctl(vcpu);
> >  
> >  	vmx->msr_ia32_debugctl = data;
> > -	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> > +	vmcs_write64(GUEST_IA32_DEBUGCTL,
> > +		     (vcpu->arch.host_debugctl & host_mask) | (data & ~host_mask));
> >  }
> >  
> >  bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
> > @@ -2232,6 +2245,7 @@ bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated
> >  	return true;
> >  }
> >  
> > +
> 
> Spurious newline.
> 
> >  /*
> >   * Writes msr value into the appropriate "register".
> >   * Returns 0 on success, non-0 otherwise.
> > @@ -7349,6 +7363,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> >  {
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  	unsigned long cr3, cr4;
> > +	u64 old_debugctl;
> >  
> >  	/* Record the guest's net vcpu time for enforced NMI injections. */
> >  	if (unlikely(!enable_vnmi &&
> > @@ -7379,6 +7394,17 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> >  		vmcs_write32(PLE_WINDOW, vmx->ple_window);
> >  	}
> >  
> > +	old_debugctl = vcpu->arch.host_debugctl;
> > +	vcpu->arch.host_debugctl = get_debugctlmsr();
> > +
> > +	/*
> > +	 * In case the host DEBUGCTL had changed since the last time we
> > +	 * read it, update the guest's GUEST_IA32_DEBUGCTL with
> > +	 * the host's bits.
> > +	 */
> > +	if (old_debugctl != vcpu->arch.host_debugctl)
> 
> This can and should be optimized to only do an update if a host-preserved bit
> is toggled.

True, I will do this in the next version.

> 
> > +		__vmx_set_guest_debugctl(vcpu, vmx->msr_ia32_debugctl);
> 
> I would rather have a helper that explicitly writes the VMCS field, not one that
> sets the guest value *and* writes the VMCS field.

> 
> The usage in init_vmcs() doesn't need to write vmx->msr_ia32_debugctl because the
> vCPU is zero allocated, and this usage doesn't change vmx->msr_ia32_debugctl.
> So the only path that actually needs to modify vmx->msr_ia32_debugctl is
> vmx_set_guest_debugctl().


But what about nested entry? nested entry pretty much sets the MSR to a value given by the guest.

Also technically the intel_pmu_legacy_freezing_lbrs_on_pmi also changes the guest value by emulating what the real hardware does.

Best regards,
	Maxim Levitsky


> 
> > +
> >  	/*
> >  	 * We did this in prepare_switch_to_guest, because it needs to
> >  	 * be within srcu_read_lock.
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 844e81ee1d96..05e866ed345d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11020,8 +11020,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  		set_debugreg(0, 7);
> >  	}
> >  
> > -	vcpu->arch.host_debugctl = get_debugctlmsr();
> > -
> >  	guest_timing_enter_irqoff();
> >  
> >  	for (;;) {
> > -- 
> > 2.26.3
> > 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode
  2025-05-01 20:41     ` mlevitsk
@ 2025-05-01 20:53       ` mlevitsk
  2025-05-07  5:27         ` Mi, Dapeng
  2025-05-07 23:03         ` Sean Christopherson
  0 siblings, 2 replies; 21+ messages in thread
From: mlevitsk @ 2025-05-01 20:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

On Thu, 2025-05-01 at 16:41 -0400, mlevitsk@redhat.com wrote:
> On Tue, 2025-04-22 at 16:41 -0700, Sean Christopherson wrote:
> > On Tue, Apr 15, 2025, Maxim Levitsky wrote:
> > > Pass through the host's DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM to the guest
> > > GUEST_IA32_DEBUGCTL without the guest seeing this value.
> > > 
> > > Note that in the future we might allow the guest to set this bit as well,
> > > when we implement PMU freezing on VM own, virtual SMM entry.
> > > 
> > > Since the value of the host DEBUGCTL can in theory change between VM runs,
> > > check if has changed, and if yes, then reload the GUEST_IA32_DEBUGCTL with
> > > the new value of the host portion of it (currently only the
> > > DEBUGCTLMSR_FREEZE_IN_SMM bit)
> > 
> > No, it can't.  DEBUGCTLMSR_FREEZE_IN_SMM can be toggled via IPI callback, but
> > IRQs are disabled for the entirety of the inner run loop.  And if I'm somehow
> > wrong, this change movement absolutely belongs in a separate patch.


Hi,

You are right here - reading MSR_IA32_DEBUGCTLMSR in the inner loop is a performance
regression.


Any ideas on how to solve this then? Since currently its the common code that
reads the current value of the MSR_IA32_DEBUGCTLMSR and it doesn't leave any indication
about if it changed I can do either

1. store old value as well, something like 'vcpu->arch.host_debugctl_old' Ugly IMHO.

2. add DEBUG_CTL to the set of the 'dirty' registers, e.g add new bit for kvm_register_mark_dirty
It looks a bit overkill to me

3. Add new x86 callback for something like .sync_debugctl(). I vote for this option.

What do you think/prefer?

Best regards,
	Maxim Levitsky

> > 
> > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > ---
> > >  arch/x86/kvm/svm/svm.c |  2 ++
> > >  arch/x86/kvm/vmx/vmx.c | 28 +++++++++++++++++++++++++++-
> > >  arch/x86/kvm/x86.c     |  2 --
> > >  3 files changed, 29 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index cc1c721ba067..fda0660236d8 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -4271,6 +4271,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
> > >  	svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
> > >  	svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
> > >  
> > > +	vcpu->arch.host_debugctl = get_debugctlmsr();
> > > +
> > >  	/*
> > >  	 * Disable singlestep if we're injecting an interrupt/exception.
> > >  	 * We don't want our modified rflags to be pushed on the stack where
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index c9208a4acda4..e0bc31598d60 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -2194,6 +2194,17 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
> > >  	return debugctl;
> > >  }
> > >  
> > > +static u64 vmx_get_host_preserved_debugctl(struct kvm_vcpu *vcpu)
> > 
> > No, just open code handling DEBUGCTLMSR_FREEZE_IN_SMM, or make it a #define.
> > I'm not remotely convinced that we'll ever want to emulate DEBUGCTLMSR_FREEZE_IN_SMM,
> > and trying to plan for that possibility and adds complexity for no immediate value.
> 
> Hi,
> 
> The problem here is a bit different: we indeed are very unlikely to emulate the
> DEBUGCTLMSR_FREEZE_IN_SMM but however, when I wrote this patch I was sure that this bit is 
> mandatory with PMU version of 2 or more,  but looks like it is optional after all:
> 
> "
> Note that system software must check if the processor supports the IA32_DEBUGCTL.FREEZE_WHILE_SMM
> control bit. IA32_DEBUGCTL.FREEZE_WHILE_SMM is supported if IA32_PERF_CAPABIL-
> ITIES.FREEZE_WHILE_SMM[Bit 12] is reporting 1. See Section 20.8 for details of detecting the presence of
> IA32_PERF_CAPABILITIES MSR."
> 
> KVM indeed doesn't set the bit 12 of IA32_PERF_CAPABILITIES.
> 
> However, note that the Linux kernel silently sets this bit without checking the aforementioned capability 
> bit and ends up with a #GP exception, which it silently ignores.... (I checked this with a trace...)
> 
> This led me to believe that this bit should be unconditionally supported,
> meaning that KVM should at least fake setting it without triggering a #GP.
> 
> Since that is not the case, I can revert to the simpler model of exclusively using GUEST_IA32_DEBUGCTL 
> while hiding the bit from the guest, however I do vote to keep the guest/host separation.
> 
> > 
> > > +{
> > > +	/*
> > > +	 * Bits of host's DEBUGCTL that we should preserve while the guest is
> > > +	 * running.
> > > +	 *
> > > +	 * Some of those bits might still be emulated for the guest own use.
> > > +	 */
> > > +	return DEBUGCTLMSR_FREEZE_IN_SMM;
> > > 
> > >  u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
> > >  {
> > >  	return to_vmx(vcpu)->msr_ia32_debugctl;
> > > @@ -2202,9 +2213,11 @@ u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
> > >  static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
> > >  {
> > >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > +	u64 host_mask = vmx_get_host_preserved_debugctl(vcpu);
> > >  
> > >  	vmx->msr_ia32_debugctl = data;
> > > -	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> > > +	vmcs_write64(GUEST_IA32_DEBUGCTL,
> > > +		     (vcpu->arch.host_debugctl & host_mask) | (data & ~host_mask));
> > >  }
> > >  
> > >  bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
> > > @@ -2232,6 +2245,7 @@ bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated
> > >  	return true;
> > >  }
> > >  
> > > +
> > 
> > Spurious newline.
> > 
> > >  /*
> > >   * Writes msr value into the appropriate "register".
> > >   * Returns 0 on success, non-0 otherwise.
> > > @@ -7349,6 +7363,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> > >  {
> > >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > >  	unsigned long cr3, cr4;
> > > +	u64 old_debugctl;
> > >  
> > >  	/* Record the guest's net vcpu time for enforced NMI injections. */
> > >  	if (unlikely(!enable_vnmi &&
> > > @@ -7379,6 +7394,17 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> > >  		vmcs_write32(PLE_WINDOW, vmx->ple_window);
> > >  	}
> > >  
> > > +	old_debugctl = vcpu->arch.host_debugctl;
> > > +	vcpu->arch.host_debugctl = get_debugctlmsr();
> > > +
> > > +	/*
> > > +	 * In case the host DEBUGCTL had changed since the last time we
> > > +	 * read it, update the guest's GUEST_IA32_DEBUGCTL with
> > > +	 * the host's bits.
> > > +	 */
> > > +	if (old_debugctl != vcpu->arch.host_debugctl)
> > 
> > This can and should be optimized to only do an update if a host-preserved bit
> > is toggled.
> 
> True, I will do this in the next version.
> 
> > 
> > > +		__vmx_set_guest_debugctl(vcpu, vmx->msr_ia32_debugctl);
> > 
> > I would rather have a helper that explicitly writes the VMCS field, not one that
> > sets the guest value *and* writes the VMCS field.
> 
> > 
> > The usage in init_vmcs() doesn't need to write vmx->msr_ia32_debugctl because the
> > vCPU is zero allocated, and this usage doesn't change vmx->msr_ia32_debugctl.
> > So the only path that actually needs to modify vmx->msr_ia32_debugctl is
> > vmx_set_guest_debugctl().
> 
> 
> But what about nested entry? nested entry pretty much sets the MSR to a value given by the guest.
> 
> Also technically the intel_pmu_legacy_freezing_lbrs_on_pmi also changes the guest value by emulating what the real hardware does.
> 
> Best regards,
> 	Maxim Levitsky
> 
> 
> > 
> > > +
> > >  	/*
> > >  	 * We did this in prepare_switch_to_guest, because it needs to
> > >  	 * be within srcu_read_lock.
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 844e81ee1d96..05e866ed345d 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -11020,8 +11020,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > >  		set_debugreg(0, 7);
> > >  	}
> > >  
> > > -	vcpu->arch.host_debugctl = get_debugctlmsr();
> > > -
> > >  	guest_timing_enter_irqoff();
> > >  
> > >  	for (;;) {
> > > -- 
> > > 2.26.3
> > > 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions
  2025-05-01 20:34     ` mlevitsk
@ 2025-05-07  5:17       ` Mi, Dapeng
  0 siblings, 0 replies; 21+ messages in thread
From: Mi, Dapeng @ 2025-05-07  5:17 UTC (permalink / raw)
  To: mlevitsk, kvm
  Cc: Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Sean Christopherson, Dave Hansen, Ingo Molnar, linux-kernel,
	H. Peter Anvin


On 5/2/2025 4:34 AM, mlevitsk@redhat.com wrote:
> On Wed, 2025-04-23 at 17:51 +0800, Mi, Dapeng wrote:
>> The shortlog "x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with
>> access functions" doesn't follow Sean's suggestion
>> (https://github.com/kvm-x86/linux/blob/next/Documentation/process/maintainer-kvm-x86.rst#shortlog).
>> Please modify. Thanks.
>>
>>
>> On 4/16/2025 8:25 AM, Maxim Levitsky wrote:
>>> Instead of reading and writing GUEST_IA32_DEBUGCTL vmcs field directly,
>>> wrap the logic with get/set functions.
>>>
>>> Also move the checks that the guest's supplied value is valid to the new
>>> 'set' function.
>>>
>>> In particular, the above change fixes a minor security issue in which L1
>>> hypervisor could set the GUEST_IA32_DEBUGCTL, and eventually the host's
>>> MSR_IA32_DEBUGCTL to any value by performing a VM entry to L2 with
>>> VM_ENTRY_LOAD_DEBUG_CONTROLS set.
>>>
>>> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
>>> ---
>>>  arch/x86/kvm/vmx/nested.c    | 15 +++++++---
>>>  arch/x86/kvm/vmx/pmu_intel.c |  9 +++---
>>>  arch/x86/kvm/vmx/vmx.c       | 58 +++++++++++++++++++++++-------------
>>>  arch/x86/kvm/vmx/vmx.h       |  3 ++
>>>  4 files changed, 57 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>>> index e073e3008b16..b7686569ee09 100644
>>> --- a/arch/x86/kvm/vmx/nested.c
>>> +++ b/arch/x86/kvm/vmx/nested.c
>>> @@ -2641,6 +2641,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>>>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>  	struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx);
>>>  	bool load_guest_pdptrs_vmcs12 = false;
>>> +	u64 new_debugctl;
>>>  
>>>  	if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) {
>>>  		prepare_vmcs02_rare(vmx, vmcs12);
>>> @@ -2653,11 +2654,17 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>>>  	if (vmx->nested.nested_run_pending &&
>>>  	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) {
>>>  		kvm_set_dr(vcpu, 7, vmcs12->guest_dr7);
>>> -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
>>> +		new_debugctl = vmcs12->guest_ia32_debugctl;
>>>  	} else {
>>>  		kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
>>> -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl);
>>> +		new_debugctl = vmx->nested.pre_vmenter_debugctl;
>>>  	}
>>> +
>>> +	if (CC(!vmx_set_guest_debugctl(vcpu, new_debugctl, false))) {
>>> +		*entry_failure_code = ENTRY_FAIL_DEFAULT;
>>> +		return -EINVAL;
>>> +	}
>>> +
>>>  	if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
>>>  	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
>>>  		vmcs_write64(GUEST_BNDCFGS, vmx->nested.pre_vmenter_bndcfgs);
>>> @@ -3520,7 +3527,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>>>  
>>>  	if (!vmx->nested.nested_run_pending ||
>>>  	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS))
>>> -		vmx->nested.pre_vmenter_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
>>> +		vmx->nested.pre_vmenter_debugctl = vmx_get_guest_debugctl(vcpu);
>>>  	if (kvm_mpx_supported() &&
>>>  	    (!vmx->nested.nested_run_pending ||
>>>  	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
>>> @@ -4788,7 +4795,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>>>  	__vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR);
>>>  
>>>  	kvm_set_dr(vcpu, 7, 0x400);
>>> -	vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
>>> +	vmx_set_guest_debugctl(vcpu, 0, false);
>>>  
>>>  	if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr,
>>>  				vmcs12->vm_exit_msr_load_count))
>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>> index 8a94b52c5731..f6f448adfb80 100644
>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>> @@ -19,6 +19,7 @@
>>>  #include "lapic.h"
>>>  #include "nested.h"
>>>  #include "pmu.h"
>>> +#include "vmx.h"
>>>  #include "tdx.h"
>>>  
>>>  /*
>>> @@ -652,11 +653,11 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
>>>   */
>>>  static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu)
>>>  {
>>> -	u64 data = vmcs_read64(GUEST_IA32_DEBUGCTL);
>>> +	u64 data = vmx_get_guest_debugctl(vcpu);
>>>  
>>>  	if (data & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) {
>>>  		data &= ~DEBUGCTLMSR_LBR;
>>> -		vmcs_write64(GUEST_IA32_DEBUGCTL, data);
>>> +		vmx_set_guest_debugctl(vcpu, data, true);
>> Two questions. 
>>
>> 1. why to call vmx_set_guest_debugctl() to do the extra check? currently
>> IA32_DEBUGCTL MSR is always intercepted and it's already checked at
>> vmx_set_msr() and seems unnecessary to check here again.
> Hi,
>
>
> I wanted this to be consistent. KVM has plenty of functions that can be both
> guest triggered and internally triggered. For example kvm_set_cr4()
>
> Besides the vmx_set_guest_debugctl also notes the value the guest wrote
> to be able to return it back to the guest if we choose to overide some
> bits of the MSR, so it made sense to have one common function to set the msr.
>
> Do you think that can affect performance? 

hmm, since only DEBUGCTLMSR_LBR bit is changed here, it's safe to skip this
check and write guest debug_ctrl directly. I have no idea how much
performance impact this check would bring in high sampling frequency, but
why not to eliminate it if it can?


>
>
>> 2. why the argument "host_initiated" is true? It looks the data is not from
>> host.
> This is my mistake.
>
>>
>>>  	}
>>>  }
>>>  
>>> @@ -729,7 +730,7 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu)
>>>  
>>>  	if (!lbr_desc->event) {
>>>  		vmx_disable_lbr_msrs_passthrough(vcpu);
>>> -		if (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR)
>>> +		if (vmx_get_guest_debugctl(vcpu) & DEBUGCTLMSR_LBR)
>>>  			goto warn;
>>>  		if (test_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use))
>>>  			goto warn;
>>> @@ -751,7 +752,7 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu)
>>>  
>>>  static void intel_pmu_cleanup(struct kvm_vcpu *vcpu)
>>>  {
>>> -	if (!(vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
>>> +	if (!(vmx_get_guest_debugctl(vcpu) & DEBUGCTLMSR_LBR))
>>>  		intel_pmu_release_guest_lbr_event(vcpu);
>>>  }
>>>  
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index ef2d7208dd20..4237422dc4ed 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -2154,7 +2154,7 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>  			msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
>>>  		break;
>>>  	case MSR_IA32_DEBUGCTLMSR:
>>> -		msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
>>> +		msr_info->data = vmx_get_guest_debugctl(vcpu);
>>>  		break;
>>>  	default:
>>>  	find_uret_msr:
>>> @@ -2194,6 +2194,41 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
>>>  	return debugctl;
>>>  }
>>>  
>>> +u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
>>> +{
>>> +	return vmcs_read64(GUEST_IA32_DEBUGCTL);
>>> +}
>>> +
>>> +static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
>>> +{
>>> +	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
>>> +}
>> IMO,  it seems unnecessary to add these 2  wrappers since the original code
>> is already intuitive enough and simple. But if you want, please add
>> "inline" before these 2 wrappers.
> The __vmx_set_guest_debugctl in the next patch will store the written value in
> a field, this is why I did it this way.
>
> The vmx_get_guest_debugctl will read this value instead, also in the next patch.
>
> I thought it would be cleaner to first introduce the trivial wrappers and then
> extend them.
>
>>
>>> +
>>> +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
>> Since most of code in this function checks guest debugctl, better to rename
>> it to "vmx_check_and_set_guest_debugctl".
> I don't mind doing so.
>
>>
>>> +{
>>> +	u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated);
>>> +
>>> +	if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
>>> +		kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data);
>>> +		data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
>>> +		invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
>> Add space around above 3 "|".
> I copied this code "as is" from the wrmsr code. I can add this though.
>
> Best regards,
> 	Maxim Levitsky
>
>>
>>> +	}
>>> +
>>> +	if (invalid)
>>> +		return false;
>>> +
>>> +	if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls &
>>> +					VM_EXIT_SAVE_DEBUG_CONTROLS))
>>> +		get_vmcs12(vcpu)->guest_ia32_debugctl = data;
>>> +
>>> +	if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
>>> +	    (data & DEBUGCTLMSR_LBR))
>>> +		intel_pmu_create_guest_lbr_event(vcpu);
>>> +
>>> +	__vmx_set_guest_debugctl(vcpu, data);
>>> +	return true;
>>> +}
>>> +
>>>  /*
>>>   * Writes msr value into the appropriate "register".
>>>   * Returns 0 on success, non-0 otherwise.
>>> @@ -2263,26 +2298,9 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>  		vmcs_writel(GUEST_SYSENTER_ESP, data);
>>>  		break;
>>>  	case MSR_IA32_DEBUGCTLMSR: {
>>> -		u64 invalid;
>>> -
>>> -		invalid = data & ~vmx_get_supported_debugctl(vcpu, msr_info->host_initiated);
>>> -		if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
>>> -			kvm_pr_unimpl_wrmsr(vcpu, msr_index, data);
>>> -			data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
>>> -			invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
>>> -		}
>>> -
>>> -		if (invalid)
>>> +		if (!vmx_set_guest_debugctl(vcpu, data, msr_info->host_initiated))
>>>  			return 1;
>>>  
>>> -		if (is_guest_mode(vcpu) && get_vmcs12(vcpu)->vm_exit_controls &
>>> -						VM_EXIT_SAVE_DEBUG_CONTROLS)
>>> -			get_vmcs12(vcpu)->guest_ia32_debugctl = data;
>>> -
>>> -		vmcs_write64(GUEST_IA32_DEBUGCTL, data);
>>> -		if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
>>> -		    (data & DEBUGCTLMSR_LBR))
>>> -			intel_pmu_create_guest_lbr_event(vcpu);
>>>  		return 0;
>>>  	}
>>>  	case MSR_IA32_BNDCFGS:
>>> @@ -4795,7 +4813,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>>>  	vmcs_write32(GUEST_SYSENTER_CS, 0);
>>>  	vmcs_writel(GUEST_SYSENTER_ESP, 0);
>>>  	vmcs_writel(GUEST_SYSENTER_EIP, 0);
>>> -	vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
>>> +	__vmx_set_guest_debugctl(&vmx->vcpu, 0);
>>>  
>>>  	if (cpu_has_vmx_tpr_shadow()) {
>>>  		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
>>> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
>>> index 6d1e40ecc024..8ac46fb47abd 100644
>>> --- a/arch/x86/kvm/vmx/vmx.h
>>> +++ b/arch/x86/kvm/vmx/vmx.h
>>> @@ -404,6 +404,9 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
>>>  
>>>  gva_t vmx_get_untagged_addr(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
>>>  
>>> +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 value, bool host_initiated);
>>> +u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu);
>>> +
>>>  static inline void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr,
>>>  					     int type, bool value)
>>>  {

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode
  2025-05-01 20:53       ` mlevitsk
@ 2025-05-07  5:27         ` Mi, Dapeng
  2025-05-07 14:31           ` mlevitsk
  2025-05-07 23:03         ` Sean Christopherson
  1 sibling, 1 reply; 21+ messages in thread
From: Mi, Dapeng @ 2025-05-07  5:27 UTC (permalink / raw)
  To: mlevitsk, Sean Christopherson
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin


On 5/2/2025 4:53 AM, mlevitsk@redhat.com wrote:
> On Thu, 2025-05-01 at 16:41 -0400, mlevitsk@redhat.com wrote:
>> On Tue, 2025-04-22 at 16:41 -0700, Sean Christopherson wrote:
>>> On Tue, Apr 15, 2025, Maxim Levitsky wrote:
>>>> Pass through the host's DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM to the guest
>>>> GUEST_IA32_DEBUGCTL without the guest seeing this value.
>>>>
>>>> Note that in the future we might allow the guest to set this bit as well,
>>>> when we implement PMU freezing on VM own, virtual SMM entry.
>>>>
>>>> Since the value of the host DEBUGCTL can in theory change between VM runs,
>>>> check if has changed, and if yes, then reload the GUEST_IA32_DEBUGCTL with
>>>> the new value of the host portion of it (currently only the
>>>> DEBUGCTLMSR_FREEZE_IN_SMM bit)
>>> No, it can't.  DEBUGCTLMSR_FREEZE_IN_SMM can be toggled via IPI callback, but
>>> IRQs are disabled for the entirety of the inner run loop.  And if I'm somehow
>>> wrong, this change movement absolutely belongs in a separate patch.
>
> Hi,
>
> You are right here - reading MSR_IA32_DEBUGCTLMSR in the inner loop is a performance
> regression.
>
>
> Any ideas on how to solve this then? Since currently its the common code that
> reads the current value of the MSR_IA32_DEBUGCTLMSR and it doesn't leave any indication
> about if it changed I can do either
>
> 1. store old value as well, something like 'vcpu->arch.host_debugctl_old' Ugly IMHO.
>
> 2. add DEBUG_CTL to the set of the 'dirty' registers, e.g add new bit for kvm_register_mark_dirty
> It looks a bit overkill to me
>
> 3. Add new x86 callback for something like .sync_debugctl(). I vote for this option.
>
> What do you think/prefer?

Hmm, not sure if I missed something, but why to move the reading host
debug_ctrl MSR from the original place into inner loop? The interrupt has
been disabled before reading host debug_ctrl for original code, suppose
host debug_ctrl won't changed after reading it?


>
> Best regards,
> 	Maxim Levitsky
>
>>>> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
>>>> ---
>>>>  arch/x86/kvm/svm/svm.c |  2 ++
>>>>  arch/x86/kvm/vmx/vmx.c | 28 +++++++++++++++++++++++++++-
>>>>  arch/x86/kvm/x86.c     |  2 --
>>>>  3 files changed, 29 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>>> index cc1c721ba067..fda0660236d8 100644
>>>> --- a/arch/x86/kvm/svm/svm.c
>>>> +++ b/arch/x86/kvm/svm/svm.c
>>>> @@ -4271,6 +4271,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>>>>  	svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
>>>>  	svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
>>>>  
>>>> +	vcpu->arch.host_debugctl = get_debugctlmsr();
>>>> +
>>>>  	/*
>>>>  	 * Disable singlestep if we're injecting an interrupt/exception.
>>>>  	 * We don't want our modified rflags to be pushed on the stack where
>>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>>> index c9208a4acda4..e0bc31598d60 100644
>>>> --- a/arch/x86/kvm/vmx/vmx.c
>>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>>> @@ -2194,6 +2194,17 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
>>>>  	return debugctl;
>>>>  }
>>>>  
>>>> +static u64 vmx_get_host_preserved_debugctl(struct kvm_vcpu *vcpu)
>>> No, just open code handling DEBUGCTLMSR_FREEZE_IN_SMM, or make it a #define.
>>> I'm not remotely convinced that we'll ever want to emulate DEBUGCTLMSR_FREEZE_IN_SMM,
>>> and trying to plan for that possibility and adds complexity for no immediate value.
>> Hi,
>>
>> The problem here is a bit different: we indeed are very unlikely to emulate the
>> DEBUGCTLMSR_FREEZE_IN_SMM but however, when I wrote this patch I was sure that this bit is 
>> mandatory with PMU version of 2 or more,  but looks like it is optional after all:
>>
>> "
>> Note that system software must check if the processor supports the IA32_DEBUGCTL.FREEZE_WHILE_SMM
>> control bit. IA32_DEBUGCTL.FREEZE_WHILE_SMM is supported if IA32_PERF_CAPABIL-
>> ITIES.FREEZE_WHILE_SMM[Bit 12] is reporting 1. See Section 20.8 for details of detecting the presence of
>> IA32_PERF_CAPABILITIES MSR."
>>
>> KVM indeed doesn't set the bit 12 of IA32_PERF_CAPABILITIES.
>>
>> However, note that the Linux kernel silently sets this bit without checking the aforementioned capability 
>> bit and ends up with a #GP exception, which it silently ignores.... (I checked this with a trace...)
>>
>> This led me to believe that this bit should be unconditionally supported,
>> meaning that KVM should at least fake setting it without triggering a #GP.
>>
>> Since that is not the case, I can revert to the simpler model of exclusively using GUEST_IA32_DEBUGCTL 
>> while hiding the bit from the guest, however I do vote to keep the guest/host separation.
>>
>>>> +{
>>>> +	/*
>>>> +	 * Bits of host's DEBUGCTL that we should preserve while the guest is
>>>> +	 * running.
>>>> +	 *
>>>> +	 * Some of those bits might still be emulated for the guest own use.
>>>> +	 */
>>>> +	return DEBUGCTLMSR_FREEZE_IN_SMM;
>>>>
>>>>  u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
>>>>  {
>>>>  	return to_vmx(vcpu)->msr_ia32_debugctl;
>>>> @@ -2202,9 +2213,11 @@ u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
>>>>  static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
>>>>  {
>>>>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>> +	u64 host_mask = vmx_get_host_preserved_debugctl(vcpu);
>>>>  
>>>>  	vmx->msr_ia32_debugctl = data;
>>>> -	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
>>>> +	vmcs_write64(GUEST_IA32_DEBUGCTL,
>>>> +		     (vcpu->arch.host_debugctl & host_mask) | (data & ~host_mask));
>>>>  }
>>>>  
>>>>  bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
>>>> @@ -2232,6 +2245,7 @@ bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated
>>>>  	return true;
>>>>  }
>>>>  
>>>> +
>>> Spurious newline.
>>>
>>>>  /*
>>>>   * Writes msr value into the appropriate "register".
>>>>   * Returns 0 on success, non-0 otherwise.
>>>> @@ -7349,6 +7363,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>>>  {
>>>>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>>  	unsigned long cr3, cr4;
>>>> +	u64 old_debugctl;
>>>>  
>>>>  	/* Record the guest's net vcpu time for enforced NMI injections. */
>>>>  	if (unlikely(!enable_vnmi &&
>>>> @@ -7379,6 +7394,17 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>>>  		vmcs_write32(PLE_WINDOW, vmx->ple_window);
>>>>  	}
>>>>  
>>>> +	old_debugctl = vcpu->arch.host_debugctl;
>>>> +	vcpu->arch.host_debugctl = get_debugctlmsr();
>>>> +
>>>> +	/*
>>>> +	 * In case the host DEBUGCTL had changed since the last time we
>>>> +	 * read it, update the guest's GUEST_IA32_DEBUGCTL with
>>>> +	 * the host's bits.
>>>> +	 */
>>>> +	if (old_debugctl != vcpu->arch.host_debugctl)
>>> This can and should be optimized to only do an update if a host-preserved bit
>>> is toggled.
>> True, I will do this in the next version.
>>
>>>> +		__vmx_set_guest_debugctl(vcpu, vmx->msr_ia32_debugctl);
>>> I would rather have a helper that explicitly writes the VMCS field, not one that
>>> sets the guest value *and* writes the VMCS field.
>>> The usage in init_vmcs() doesn't need to write vmx->msr_ia32_debugctl because the
>>> vCPU is zero allocated, and this usage doesn't change vmx->msr_ia32_debugctl.
>>> So the only path that actually needs to modify vmx->msr_ia32_debugctl is
>>> vmx_set_guest_debugctl().
>>
>> But what about nested entry? nested entry pretty much sets the MSR to a value given by the guest.
>>
>> Also technically the intel_pmu_legacy_freezing_lbrs_on_pmi also changes the guest value by emulating what the real hardware does.
>>
>> Best regards,
>> 	Maxim Levitsky
>>
>>
>>>> +
>>>>  	/*
>>>>  	 * We did this in prepare_switch_to_guest, because it needs to
>>>>  	 * be within srcu_read_lock.
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index 844e81ee1d96..05e866ed345d 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -11020,8 +11020,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>>>  		set_debugreg(0, 7);
>>>>  	}
>>>>  
>>>> -	vcpu->arch.host_debugctl = get_debugctlmsr();
>>>> -
>>>>  	guest_timing_enter_irqoff();
>>>>  
>>>>  	for (;;) {
>>>> -- 
>>>> 2.26.3
>>>>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode
  2025-05-07  5:27         ` Mi, Dapeng
@ 2025-05-07 14:31           ` mlevitsk
  0 siblings, 0 replies; 21+ messages in thread
From: mlevitsk @ 2025-05-07 14:31 UTC (permalink / raw)
  To: Mi, Dapeng, Sean Christopherson
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

On Wed, 2025-05-07 at 13:27 +0800, Mi, Dapeng wrote:
> 
> On 5/2/2025 4:53 AM, mlevitsk@redhat.com wrote:
> > On Thu, 2025-05-01 at 16:41 -0400, mlevitsk@redhat.com wrote:
> > > On Tue, 2025-04-22 at 16:41 -0700, Sean Christopherson wrote:
> > > > On Tue, Apr 15, 2025, Maxim Levitsky wrote:
> > > > > Pass through the host's DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM to the guest
> > > > > GUEST_IA32_DEBUGCTL without the guest seeing this value.
> > > > > 
> > > > > Note that in the future we might allow the guest to set this bit as well,
> > > > > when we implement PMU freezing on VM own, virtual SMM entry.
> > > > > 
> > > > > Since the value of the host DEBUGCTL can in theory change between VM runs,
> > > > > check if has changed, and if yes, then reload the GUEST_IA32_DEBUGCTL with
> > > > > the new value of the host portion of it (currently only the
> > > > > DEBUGCTLMSR_FREEZE_IN_SMM bit)
> > > > No, it can't.  DEBUGCTLMSR_FREEZE_IN_SMM can be toggled via IPI callback, but
> > > > IRQs are disabled for the entirety of the inner run loop.  And if I'm somehow
> > > > wrong, this change movement absolutely belongs in a separate patch.
> > 
> > Hi,
> > 
> > You are right here - reading MSR_IA32_DEBUGCTLMSR in the inner loop is a performance
> > regression.
> > 
> > 
> > Any ideas on how to solve this then? Since currently its the common code that
> > reads the current value of the MSR_IA32_DEBUGCTLMSR and it doesn't leave any indication
> > about if it changed I can do either
> > 
> > 1. store old value as well, something like 'vcpu->arch.host_debugctl_old' Ugly IMHO.
> > 
> > 2. add DEBUG_CTL to the set of the 'dirty' registers, e.g add new bit for kvm_register_mark_dirty
> > It looks a bit overkill to me
> > 
> > 3. Add new x86 callback for something like .sync_debugctl(). I vote for this option.
> > 
> > What do you think/prefer?
> 
> Hmm, not sure if I missed something, but why to move the reading host
> debug_ctrl MSR from the original place into inner loop? The interrupt has
> been disabled before reading host debug_ctrl for original code, suppose
> host debug_ctrl won't changed after reading it?

No, no, I propose to keep the call in the same place it is now,
just to wrap it with a callback which can compare the value with
previous value and then update the vmcs field if needed.


Best regards,
	Maxim Levitsky

> 
> 
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> > > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > > > ---
> > > > >  arch/x86/kvm/svm/svm.c |  2 ++
> > > > >  arch/x86/kvm/vmx/vmx.c | 28 +++++++++++++++++++++++++++-
> > > > >  arch/x86/kvm/x86.c     |  2 --
> > > > >  3 files changed, 29 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > > > index cc1c721ba067..fda0660236d8 100644
> > > > > --- a/arch/x86/kvm/svm/svm.c
> > > > > +++ b/arch/x86/kvm/svm/svm.c
> > > > > @@ -4271,6 +4271,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
> > > > >  	svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
> > > > >  	svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
> > > > >  
> > > > > +	vcpu->arch.host_debugctl = get_debugctlmsr();
> > > > > +
> > > > >  	/*
> > > > >  	 * Disable singlestep if we're injecting an interrupt/exception.
> > > > >  	 * We don't want our modified rflags to be pushed on the stack where
> > > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > > index c9208a4acda4..e0bc31598d60 100644
> > > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > > @@ -2194,6 +2194,17 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
> > > > >  	return debugctl;
> > > > >  }
> > > > >  
> > > > > +static u64 vmx_get_host_preserved_debugctl(struct kvm_vcpu *vcpu)
> > > > No, just open code handling DEBUGCTLMSR_FREEZE_IN_SMM, or make it a #define.
> > > > I'm not remotely convinced that we'll ever want to emulate DEBUGCTLMSR_FREEZE_IN_SMM,
> > > > and trying to plan for that possibility and adds complexity for no immediate value.
> > > Hi,
> > > 
> > > The problem here is a bit different: we indeed are very unlikely to emulate the
> > > DEBUGCTLMSR_FREEZE_IN_SMM but however, when I wrote this patch I was sure that this bit is 
> > > mandatory with PMU version of 2 or more,  but looks like it is optional after all:
> > > 
> > > "
> > > Note that system software must check if the processor supports the IA32_DEBUGCTL.FREEZE_WHILE_SMM
> > > control bit. IA32_DEBUGCTL.FREEZE_WHILE_SMM is supported if IA32_PERF_CAPABIL-
> > > ITIES.FREEZE_WHILE_SMM[Bit 12] is reporting 1. See Section 20.8 for details of detecting the presence of
> > > IA32_PERF_CAPABILITIES MSR."
> > > 
> > > KVM indeed doesn't set the bit 12 of IA32_PERF_CAPABILITIES.
> > > 
> > > However, note that the Linux kernel silently sets this bit without checking the aforementioned capability 
> > > bit and ends up with a #GP exception, which it silently ignores.... (I checked this with a trace...)
> > > 
> > > This led me to believe that this bit should be unconditionally supported,
> > > meaning that KVM should at least fake setting it without triggering a #GP.
> > > 
> > > Since that is not the case, I can revert to the simpler model of exclusively using GUEST_IA32_DEBUGCTL 
> > > while hiding the bit from the guest, however I do vote to keep the guest/host separation.
> > > 
> > > > > +{
> > > > > +	/*
> > > > > +	 * Bits of host's DEBUGCTL that we should preserve while the guest is
> > > > > +	 * running.
> > > > > +	 *
> > > > > +	 * Some of those bits might still be emulated for the guest own use.
> > > > > +	 */
> > > > > +	return DEBUGCTLMSR_FREEZE_IN_SMM;
> > > > > 
> > > > >  u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
> > > > >  {
> > > > >  	return to_vmx(vcpu)->msr_ia32_debugctl;
> > > > > @@ -2202,9 +2213,11 @@ u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu)
> > > > >  static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data)
> > > > >  {
> > > > >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > > > +	u64 host_mask = vmx_get_host_preserved_debugctl(vcpu);
> > > > >  
> > > > >  	vmx->msr_ia32_debugctl = data;
> > > > > -	vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> > > > > +	vmcs_write64(GUEST_IA32_DEBUGCTL,
> > > > > +		     (vcpu->arch.host_debugctl & host_mask) | (data & ~host_mask));
> > > > >  }
> > > > >  
> > > > >  bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
> > > > > @@ -2232,6 +2245,7 @@ bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated
> > > > >  	return true;
> > > > >  }
> > > > >  
> > > > > +
> > > > Spurious newline.
> > > > 
> > > > >  /*
> > > > >   * Writes msr value into the appropriate "register".
> > > > >   * Returns 0 on success, non-0 otherwise.
> > > > > @@ -7349,6 +7363,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> > > > >  {
> > > > >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > > >  	unsigned long cr3, cr4;
> > > > > +	u64 old_debugctl;
> > > > >  
> > > > >  	/* Record the guest's net vcpu time for enforced NMI injections. */
> > > > >  	if (unlikely(!enable_vnmi &&
> > > > > @@ -7379,6 +7394,17 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> > > > >  		vmcs_write32(PLE_WINDOW, vmx->ple_window);
> > > > >  	}
> > > > >  
> > > > > +	old_debugctl = vcpu->arch.host_debugctl;
> > > > > +	vcpu->arch.host_debugctl = get_debugctlmsr();
> > > > > +
> > > > > +	/*
> > > > > +	 * In case the host DEBUGCTL had changed since the last time we
> > > > > +	 * read it, update the guest's GUEST_IA32_DEBUGCTL with
> > > > > +	 * the host's bits.
> > > > > +	 */
> > > > > +	if (old_debugctl != vcpu->arch.host_debugctl)
> > > > This can and should be optimized to only do an update if a host-preserved bit
> > > > is toggled.
> > > True, I will do this in the next version.
> > > 
> > > > > +		__vmx_set_guest_debugctl(vcpu, vmx->msr_ia32_debugctl);
> > > > I would rather have a helper that explicitly writes the VMCS field, not one that
> > > > sets the guest value *and* writes the VMCS field.
> > > > The usage in init_vmcs() doesn't need to write vmx->msr_ia32_debugctl because the
> > > > vCPU is zero allocated, and this usage doesn't change vmx->msr_ia32_debugctl.
> > > > So the only path that actually needs to modify vmx->msr_ia32_debugctl is
> > > > vmx_set_guest_debugctl().
> > > 
> > > But what about nested entry? nested entry pretty much sets the MSR to a value given by the guest.
> > > 
> > > Also technically the intel_pmu_legacy_freezing_lbrs_on_pmi also changes the guest value by emulating what the real hardware does.
> > > 
> > > Best regards,
> > > 	Maxim Levitsky
> > > 
> > > 
> > > > > +
> > > > >  	/*
> > > > >  	 * We did this in prepare_switch_to_guest, because it needs to
> > > > >  	 * be within srcu_read_lock.
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index 844e81ee1d96..05e866ed345d 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -11020,8 +11020,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > > > >  		set_debugreg(0, 7);
> > > > >  	}
> > > > >  
> > > > > -	vcpu->arch.host_debugctl = get_debugctlmsr();
> > > > > -
> > > > >  	guest_timing_enter_irqoff();
> > > > >  
> > > > >  	for (;;) {
> > > > > -- 
> > > > > 2.26.3
> > > > > 
> > 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions
  2025-05-01 20:35     ` mlevitsk
@ 2025-05-07 17:18       ` Sean Christopherson
  2025-05-13  0:34         ` mlevitsk
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-05-07 17:18 UTC (permalink / raw)
  To: mlevitsk
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

On Thu, May 01, 2025, mlevitsk@redhat.com wrote:
> On Tue, 2025-04-22 at 16:33 -0700, Sean Christopherson wrote:
> > > @@ -2653,11 +2654,17 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> > >  	if (vmx->nested.nested_run_pending &&
> > >  	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) {
> > >  		kvm_set_dr(vcpu, 7, vmcs12->guest_dr7);
> > > -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
> > > +		new_debugctl = vmcs12->guest_ia32_debugctl;
> > >  	} else {
> > >  		kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
> > > -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl);
> > > +		new_debugctl = vmx->nested.pre_vmenter_debugctl;
> > >  	}
> > > +
> > > +	if (CC(!vmx_set_guest_debugctl(vcpu, new_debugctl, false))) {
> > 
> > The consistency check belongs in nested_vmx_check_guest_state(), only needs to
> > check the VM_ENTRY_LOAD_DEBUG_CONTROLS case, and should be posted as a separate
> > patch.
> 
> I can move it there. Can you explain why though you want this? Is it because of the
> order of checks specified in the PRM?

To be consistent with how KVM checks guest state.  The two checks in prepare_vmcs02()
are special cases.  vmx_guest_state_valid() consumes a huge variety of state, and
so replicating all of its logic for vmcs12 isn't worth doing.  The check on the
kvm_set_msr() for guest_ia32_perf_global_ctrl exists purely so that KVM doesn't
simply ignore the return value.

And to a lesser degree, because KVM assumes that guest state has been sanitized
after nested_vmx_check_guest_state() is called.  Violating that risks introducing
bugs, e.g. consuming vmcs12->guest_ia32_debugctl before it's been vetted could
theoretically be problematic.

> Currently GUEST_IA32_DEBUGCTL of the host is *written* in prepare_vmcs02. 
> Should I also move this write to nested_vmx_check_guest_state?

No.  nested_vmx_check_guest_state() verifies the incoming vmcs12 state,
prepare_vmcs02() merges the vmcs12 state with KVM's desires and fills vmcs02.

> Or should I write the value blindly in prepare_vmcs02 and then check the value
> of 'vmx->msr_ia32_debugctl' in nested_vmx_check_guest_state and fail if the value
> contains reserved bits? 

I don't follow.  nested_vmx_check_guest_state() is called before prepare_vmcs02().

> > > +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
> > > +{
> > > +	u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated);
> > > +
> > > +	if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
> > > +		kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data);
> > > +		data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> > > +		invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> > > +	}
> > > +
> > > +	if (invalid)
> > > +		return false;
> > > +
> > > +	if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls &
> > > +					VM_EXIT_SAVE_DEBUG_CONTROLS))
> > > +		get_vmcs12(vcpu)->guest_ia32_debugctl = data;
> > > +
> > > +	if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
> > > +	    (data & DEBUGCTLMSR_LBR))
> > > +		intel_pmu_create_guest_lbr_event(vcpu);
> > > +
> > > +	__vmx_set_guest_debugctl(vcpu, data);
> > > +	return true;
> > 
> > Return 0/-errno, not true/false.
> 
> There are plenty of functions in this file and KVM that return boolean.

That doesn't make them "right".  For helpers that are obvious predicates, then
absolutely use a boolean return value.  The names for nested_vmx_check_eptp()
and vmx_control_verify() aren't very good, e.g. they should be
nested_vmx_is_valid_eptp() and vmx_is_valid_control(), but the intent is good.

But for flows like modifying guest state, KVM should return 0/-errno.

> e.g: 
> 
> static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp)
> static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
> static bool nested_evmcs_handle_vmclear(struct kvm_vcpu *vcpu, gpa_t vmptr)
> static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> 						 struct vmcs12 *vmcs12)

These two should return 0/-errno.

 
> static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp)
> static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)

Probably should return 0/-errno, but nested_get_vmcs12_pages() is a bit of a mess.

> ...
> 
> 
> I personally think that functions that emulate hardware should return boolean
> values or some hardware specific status code (e.g VMX failure code) because
> the real hardware never returns -EINVAL and such.

Real hardware absolutely "returns" granular error codes.  KVM even has informal
mappings between some of them, e.g. -EINVAL == #GP, -EFAULT == #PF, -EOPNOTSUPP == #UD,
BUG() == 3-strike #MC.

And hardware has many more ways to report errors to software. E.g. VMLAUNCH can
#UD, #GP(0), VM-Exit, VMfailInvalid, or VMFailValid with 30+ unique reasons.  #MC
has a crazy number of possible error encodings.  And so on and so forth.

Software visible error codes aside, comparing individual KVM functions to an
overall CPU is wildly misguided.  A more appropriate comparison would be between
a KVM function and the ucode for a single instruction/operation.  I highly, highly
doubt ucode flows are limited to binary yes/no outputs.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode
  2025-05-01 20:53       ` mlevitsk
  2025-05-07  5:27         ` Mi, Dapeng
@ 2025-05-07 23:03         ` Sean Christopherson
  2025-05-08 13:35           ` Sean Christopherson
  1 sibling, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-05-07 23:03 UTC (permalink / raw)
  To: mlevitsk
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

On Thu, May 01, 2025, mlevitsk@redhat.com wrote:
> On Thu, 2025-05-01 at 16:41 -0400, mlevitsk@redhat.com wrote:
> > On Tue, 2025-04-22 at 16:41 -0700, Sean Christopherson wrote:
> > > On Tue, Apr 15, 2025, Maxim Levitsky wrote:
> > > > Pass through the host's DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM to the guest
> > > > GUEST_IA32_DEBUGCTL without the guest seeing this value.
> > > > 
> > > > Note that in the future we might allow the guest to set this bit as well,
> > > > when we implement PMU freezing on VM own, virtual SMM entry.
> > > > 
> > > > Since the value of the host DEBUGCTL can in theory change between VM runs,
> > > > check if has changed, and if yes, then reload the GUEST_IA32_DEBUGCTL with
> > > > the new value of the host portion of it (currently only the
> > > > DEBUGCTLMSR_FREEZE_IN_SMM bit)
> > > 
> > > No, it can't.  DEBUGCTLMSR_FREEZE_IN_SMM can be toggled via IPI callback, but
> > > IRQs are disabled for the entirety of the inner run loop.  And if I'm somehow
> > > wrong, this change movement absolutely belongs in a separate patch.
> 
> 
> Hi,
> 
> You are right here - reading MSR_IA32_DEBUGCTLMSR in the inner loop is a
> performance regression.
> 
> Any ideas on how to solve this then? Since currently its the common code that
> reads the current value of the MSR_IA32_DEBUGCTLMSR and it doesn't leave any
> indication about if it changed I can do either
> 
> 1. store old value as well, something like 'vcpu->arch.host_debugctl_old' Ugly IMHO.
> 
> 2. add DEBUG_CTL to the set of the 'dirty' registers, e.g add new bit for kvm_register_mark_dirty
> It looks a bit overkill to me
> 
> 3. Add new x86 callback for something like .sync_debugctl(). I vote for this option.
> 
> What do you think/prefer?

I was going to say #3 as well, but I think I have a better idea.

DR6 has a similar problem; the guest's value needs to be loaded into hardware,
but only somewhat rarely, and more importantly, never on a fastpath reentry.

Forced immediate exits also have a similar need: some control logic in common x86
needs instruct kvm_x86_ops.vcpu_run() to do something.

Unless I've misread the DEBUGCTLMSR situation, in all cases, common x86 only needs
to a single flag to tell vendor code to do something.  The payload for that action
is already available.

So rather than add a bunch of kvm_x86_ops hooks that are only called immediately
before kvm_x86_ops.vcpu_run(), expand @req_immediate_exit into a bitmap of flags
to communicate what works needs to be done, without having to resort to a field
in kvm_vcpu_arch that isn't actually persistent.

The attached patches are relatively lightly tested, but the DR6 tests from the
recent bug[*] pass, so hopefully they're correct?

The downside with this approach is that it would be difficult to backport to LTS
kernels, but given how long this has been a problem, I'm not super concerned about
optimizing for backports.

If they look ok, feel free to include them in the next version.  Or I can post
them separately if you want.

> > > > +		__vmx_set_guest_debugctl(vcpu, vmx->msr_ia32_debugctl);
> > > 
> > > I would rather have a helper that explicitly writes the VMCS field, not one that
> > > sets the guest value *and* writes the VMCS field.
> > 
> > > 
> > > The usage in init_vmcs() doesn't need to write vmx->msr_ia32_debugctl because the
> > > vCPU is zero allocated, and this usage doesn't change vmx->msr_ia32_debugctl.
> > > So the only path that actually needs to modify vmx->msr_ia32_debugctl is
> > > vmx_set_guest_debugctl().
> > 
> > But what about nested entry? nested entry pretty much sets the MSR to a
> > value given by the guest.
> > 
> > Also technically the intel_pmu_legacy_freezing_lbrs_on_pmi also changes the
> > guest value by emulating what the real hardware does.

Drat, sorry, my feedback was way too terse.  What I was trying to say is that if
we cache the guest's msr_ia32_debugctl, then I would rather have this:

--
static void vmx_guest_debugctl_write(struct kvm_vcpu *vcpu)
{
	u64 val = vmx->msr_ia32_debugctl |
		  vcpu->arch.host_debugctl & DEBUGCTLMSR_FREEZE_IN_SMM);

	vmcs_write64(GUEST_IA32_DEBUGCTL, val);
}

int vmx_set_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
{
	u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated);

	if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
		kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data);
		data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
		invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
	}

	if (invalid)
		return 1;

	if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls &
					VM_EXIT_SAVE_DEBUG_CONTROLS))
		get_vmcs12(vcpu)->guest_ia32_debugctl = data;

	if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
	    (data & DEBUGCTLMSR_LBR))
		intel_pmu_create_guest_lbr_event(vcpu);

	vmx->msr_ia32_debugctl = data;
	vmx_guest_debugctl_write(vcpu);
	return 0;
}
--

So that the path that refreshes vmcs.GUEST_IA32_DEBUGCTL on VM-Entry doesn't have
to feed in vmx->msr_ia32_debugctl, because the only value that is ever written to
hardware is vmx->msr_ia32_debugctl.

However, I'm not entirely convinced that we need to cache the guest value,
because toggling DEBUGCTLMSR_FREEZE_IN_SMM should be extremely rare.  So something
like this?

--
static void vmx_guest_debugctl_write(struct kvm_vcpu *vcpu, u64 val)
{
	val |= vcpu->arch.host_debugctl & DEBUGCTLMSR_FREEZE_IN_SMM);

	vmcs_write64(GUEST_IA32_DEBUGCTL, val);
}

int vmx_set_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
{
	u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated);

	if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
		kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data);
		data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
		invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
	}

	if (invalid)
		return 1;

	if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls &
					VM_EXIT_SAVE_DEBUG_CONTROLS))
		get_vmcs12(vcpu)->guest_ia32_debugctl = data;

	if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
	    (data & DEBUGCTLMSR_LBR))
		intel_pmu_create_guest_lbr_event(vcpu);

	vmx_guest_debugctl_write(vcpu, data);
	return 0;
}
--

And then when DEBUGCTLMSR_FREEZE_IN_SMM changes:

	if (<is DEBUGCTLMSR_FREEZE_IN_SMM toggled>)
		vmx_guest_debugctl_write(vmcs_read64(GUEST_IA32_DEBUGCTL) &
					 ~DEBUGCTLMSR_FREEZE_IN_SMM);

And the LBR crud doesn't need to call into the "full" vmx_set_debugctl() (or we
don't even need that helper?).

Side topic, we really should be able to drop @host_initiated, because KVM's ABI
is effectively that CPUID must be set before MSRs, i.e. allowing the host to stuff
unsupported bits isn't necessary.  But that's a future problem.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode
  2025-05-07 23:03         ` Sean Christopherson
@ 2025-05-08 13:35           ` Sean Christopherson
  2025-05-15  0:19             ` mlevitsk
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-05-08 13:35 UTC (permalink / raw)
  To: mlevitsk
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 2053 bytes --]

On Wed, May 07, 2025, Sean Christopherson wrote:
> On Thu, May 01, 2025, mlevitsk@redhat.com wrote:
> > Any ideas on how to solve this then? Since currently its the common code that
> > reads the current value of the MSR_IA32_DEBUGCTLMSR and it doesn't leave any
> > indication about if it changed I can do either
> > 
> > 1. store old value as well, something like 'vcpu->arch.host_debugctl_old' Ugly IMHO.
> > 
> > 2. add DEBUG_CTL to the set of the 'dirty' registers, e.g add new bit for kvm_register_mark_dirty
> > It looks a bit overkill to me
> > 
> > 3. Add new x86 callback for something like .sync_debugctl(). I vote for this option.
> > 
> > What do you think/prefer?
> 
> I was going to say #3 as well, but I think I have a better idea.
> 
> DR6 has a similar problem; the guest's value needs to be loaded into hardware,
> but only somewhat rarely, and more importantly, never on a fastpath reentry.
> 
> Forced immediate exits also have a similar need: some control logic in common x86
> needs instruct kvm_x86_ops.vcpu_run() to do something.
> 
> Unless I've misread the DEBUGCTLMSR situation, in all cases, common x86 only needs
> to a single flag to tell vendor code to do something.  The payload for that action
> is already available.
> 
> So rather than add a bunch of kvm_x86_ops hooks that are only called immediately
> before kvm_x86_ops.vcpu_run(), expand @req_immediate_exit into a bitmap of flags
> to communicate what works needs to be done, without having to resort to a field
> in kvm_vcpu_arch that isn't actually persistent.
> 
> The attached patches are relatively lightly tested, but the DR6 tests from the
> recent bug[*] pass, so hopefully they're correct?
> 
> The downside with this approach is that it would be difficult to backport to LTS
> kernels, but given how long this has been a problem, I'm not super concerned about
> optimizing for backports.
> 
> If they look ok, feel free to include them in the next version.  Or I can post
> them separately if you want.

And of course I forgot to attach the patches...

[-- Attachment #2: 0001-KVM-x86-Convert-vcpu_run-s-immediate-exit-param-into.patch --]
[-- Type: text/x-diff, Size: 6711 bytes --]

From edaa5525a228bc8711c4cefca391b436e6b18803 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Wed, 7 May 2025 13:46:03 -0700
Subject: [PATCH 1/2] KVM: x86: Convert vcpu_run()'s immediate exit param into
 a generic bitmap

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  6 +++++-
 arch/x86/kvm/svm/svm.c          |  4 ++--
 arch/x86/kvm/vmx/main.c         |  6 +++---
 arch/x86/kvm/vmx/tdx.c          |  3 ++-
 arch/x86/kvm/vmx/vmx.c          |  3 ++-
 arch/x86/kvm/vmx/x86_ops.h      |  4 ++--
 arch/x86/kvm/x86.c              | 11 ++++++++---
 7 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4c27f213ea55..62137a777f33 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1673,6 +1673,10 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical)
 	return dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
 }
 
+enum kvm_x86_run_flags {
+	KVM_RUN_FORCE_IMMEDIATE_EXIT	= BIT(0),
+};
+
 struct kvm_x86_ops {
 	const char *name;
 
@@ -1754,7 +1758,7 @@ struct kvm_x86_ops {
 
 	int (*vcpu_pre_run)(struct kvm_vcpu *vcpu);
 	enum exit_fastpath_completion (*vcpu_run)(struct kvm_vcpu *vcpu,
-						  bool force_immediate_exit);
+						  u64 run_flags);
 	int (*handle_exit)(struct kvm_vcpu *vcpu,
 		enum exit_fastpath_completion exit_fastpath);
 	int (*skip_emulated_instruction)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5f8c571cbe7c..79b14fe4e3d8 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4311,9 +4311,9 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_in
 	guest_state_exit_irqoff();
 }
 
-static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
-					  bool force_immediate_exit)
+static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 {
+	bool force_immediate_exit = run_flags & KVM_RUN_FORCE_IMMEDIATE_EXIT;
 	struct vcpu_svm *svm = to_svm(vcpu);
 	bool spec_ctrl_intercepted = msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL);
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d1e02e567b57..fef3e3803707 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -175,12 +175,12 @@ static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
 	return vmx_vcpu_pre_run(vcpu);
 }
 
-static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
+static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 {
 	if (is_td_vcpu(vcpu))
-		return tdx_vcpu_run(vcpu, force_immediate_exit);
+		return tdx_vcpu_run(vcpu, run_flags);
 
-	return vmx_vcpu_run(vcpu, force_immediate_exit);
+	return vmx_vcpu_run(vcpu, run_flags);
 }
 
 static int vt_handle_exit(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..7dbfad28debc 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1020,8 +1020,9 @@ static void tdx_load_host_xsave_state(struct kvm_vcpu *vcpu)
 				DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI | \
 				DEBUGCTLMSR_FREEZE_IN_SMM)
 
-fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 {
+	bool force_immediate_exit = run_flags & KVM_RUN_FORCE_IMMEDIATE_EXIT;
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	struct vcpu_vt *vt = to_vt(vcpu);
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9ff00ae9f05a..e66f5ffa8716 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7317,8 +7317,9 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	guest_state_exit_irqoff();
 }
 
-fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 {
+	bool force_immediate_exit = run_flags & KVM_RUN_FORCE_IMMEDIATE_EXIT;
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned long cr3, cr4;
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index b4596f651232..0b4f5c5558d0 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -21,7 +21,7 @@ void vmx_vm_destroy(struct kvm *kvm);
 int vmx_vcpu_precreate(struct kvm *kvm);
 int vmx_vcpu_create(struct kvm_vcpu *vcpu);
 int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
-fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags);
 void vmx_vcpu_free(struct kvm_vcpu *vcpu);
 void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
@@ -133,7 +133,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
-fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags);
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
 void tdx_vcpu_put(struct kvm_vcpu *vcpu);
 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 75c0a934556d..d2ad83621595 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10777,6 +10777,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		dm_request_for_irq_injection(vcpu) &&
 		kvm_cpu_accept_dm_intr(vcpu);
 	fastpath_t exit_fastpath;
+	u64 run_flags;
 
 	bool req_immediate_exit = false;
 
@@ -11021,8 +11022,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		goto cancel_injection;
 	}
 
-	if (req_immediate_exit)
+	run_flags = 0;
+	if (req_immediate_exit) {
+		run_flags |= KVM_RUN_FORCE_IMMEDIATE_EXIT;
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
+	}
 
 	fpregs_assert_state_consistent();
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
@@ -11059,8 +11063,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		WARN_ON_ONCE((kvm_vcpu_apicv_activated(vcpu) != kvm_vcpu_apicv_active(vcpu)) &&
 			     (kvm_get_apic_mode(vcpu) != LAPIC_MODE_DISABLED));
 
-		exit_fastpath = kvm_x86_call(vcpu_run)(vcpu,
-						       req_immediate_exit);
+		exit_fastpath = kvm_x86_call(vcpu_run)(vcpu, run_flags);
 		if (likely(exit_fastpath != EXIT_FASTPATH_REENTER_GUEST))
 			break;
 
@@ -11072,6 +11075,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			break;
 		}
 
+		run_flags = 0;
+
 		/* Note, VM-Exits that go down the "slow" path are accounted below. */
 		++vcpu->stat.exits;
 	}

base-commit: 94da2b969670d100730b5537f20523e49e989920
-- 
2.49.0.1015.ga840276032-goog


[-- Attachment #3: 0002-KVM-x86-Drop-kvm_x86_ops.set_dr6-in-favor-of-a-new-K.patch --]
[-- Type: text/x-diff, Size: 5676 bytes --]

From fdb106317c1a743f79014157c13728ea021ad0b1 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Wed, 7 May 2025 14:00:39 -0700
Subject: [PATCH 2/2] KVM: x86: Drop kvm_x86_ops.set_dr6() in favor of a new
 KVM_RUN flag

Instruct vendor code to load the guest's DR6 into hardware via a new
KVM_RUN flag, and remove kvm_x86_ops.set_dr6(), whose sole purpose was to
load vcpu->arch.dr6 into hardware when DR6 can be read/written directly
by the guest.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 -
 arch/x86/include/asm/kvm_host.h    |  2 +-
 arch/x86/kvm/svm/svm.c             | 10 ++++++----
 arch/x86/kvm/vmx/main.c            |  9 ---------
 arch/x86/kvm/vmx/vmx.c             |  9 +++------
 arch/x86/kvm/x86.c                 |  2 +-
 6 files changed, 11 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8d50e3e0a19b..9e0c37ea267e 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -49,7 +49,6 @@ KVM_X86_OP(set_idt)
 KVM_X86_OP(get_gdt)
 KVM_X86_OP(set_gdt)
 KVM_X86_OP(sync_dirty_debug_regs)
-KVM_X86_OP(set_dr6)
 KVM_X86_OP(set_dr7)
 KVM_X86_OP(cache_reg)
 KVM_X86_OP(get_rflags)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 62137a777f33..18be7835fb73 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1675,6 +1675,7 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical)
 
 enum kvm_x86_run_flags {
 	KVM_RUN_FORCE_IMMEDIATE_EXIT	= BIT(0),
+	KVM_RUN_LOAD_GUEST_DR6		= BIT(1),
 };
 
 struct kvm_x86_ops {
@@ -1727,7 +1728,6 @@ struct kvm_x86_ops {
 	void (*get_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 	void (*set_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 	void (*sync_dirty_debug_regs)(struct kvm_vcpu *vcpu);
-	void (*set_dr6)(struct kvm_vcpu *vcpu, unsigned long value);
 	void (*set_dr7)(struct kvm_vcpu *vcpu, unsigned long value);
 	void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
 	unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 79b14fe4e3d8..ffa10c448efa 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4360,10 +4360,13 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 	svm_hv_update_vp_id(svm->vmcb, vcpu);
 
 	/*
-	 * Run with all-zero DR6 unless needed, so that we can get the exact cause
-	 * of a #DB.
+	 * Run with all-zero DR6 unless the guest can write DR6 freely, so that
+	 * KVM can get the exact cause of a #DB.  Note, loading guest DR6 from
+	 * KVM's snapshot is only necessary when DR accesses won't exit.
 	 */
-	if (likely(!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)))
+	if (unlikely(run_flags & KVM_RUN_LOAD_GUEST_DR6))
+		svm_set_dr6(vcpu, vcpu->arch.dr6);
+	else if (likely(!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)))
 		svm_set_dr6(vcpu, DR6_ACTIVE_LOW);
 
 	clgi();
@@ -5171,7 +5174,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.set_idt = svm_set_idt,
 	.get_gdt = svm_get_gdt,
 	.set_gdt = svm_set_gdt,
-	.set_dr6 = svm_set_dr6,
 	.set_dr7 = svm_set_dr7,
 	.sync_dirty_debug_regs = svm_sync_dirty_debug_regs,
 	.cache_reg = svm_cache_reg,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index fef3e3803707..c85cbce6d2f6 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -489,14 +489,6 @@ static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 	vmx_set_gdt(vcpu, dt);
 }
 
-static void vt_set_dr6(struct kvm_vcpu *vcpu, unsigned long val)
-{
-	if (is_td_vcpu(vcpu))
-		return;
-
-	vmx_set_dr6(vcpu, val);
-}
-
 static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
 {
 	if (is_td_vcpu(vcpu))
@@ -943,7 +935,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_idt = vt_op(set_idt),
 	.get_gdt = vt_op(get_gdt),
 	.set_gdt = vt_op(set_gdt),
-	.set_dr6 = vt_op(set_dr6),
 	.set_dr7 = vt_op(set_dr7),
 	.sync_dirty_debug_regs = vt_op(sync_dirty_debug_regs),
 	.cache_reg = vt_op(cache_reg),
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e66f5ffa8716..49bf58ca9ffd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5604,12 +5604,6 @@ void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 	set_debugreg(DR6_RESERVED, 6);
 }
 
-void vmx_set_dr6(struct kvm_vcpu *vcpu, unsigned long val)
-{
-	lockdep_assert_irqs_disabled();
-	set_debugreg(vcpu->arch.dr6, 6);
-}
-
 void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
 {
 	vmcs_writel(GUEST_DR7, val);
@@ -7364,6 +7358,9 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
 	vcpu->arch.regs_dirty = 0;
 
+	if (run_flags & KVM_RUN_LOAD_GUEST_DR6)
+		set_debugreg(vcpu->arch.dr6, 6);
+
 	/*
 	 * Refresh vmcs.HOST_CR3 if necessary.  This must be done immediately
 	 * prior to VM-Enter, as the kernel may load a new ASID (PCID) any time
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2ad83621595..d8ae62f49bd7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11044,7 +11044,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		set_debugreg(vcpu->arch.eff_db[3], 3);
 		/* When KVM_DEBUGREG_WONT_EXIT, dr6 is accessible in guest. */
 		if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT))
-			kvm_x86_call(set_dr6)(vcpu, vcpu->arch.dr6);
+			run_flags |= KVM_RUN_LOAD_GUEST_DR6;
 	} else if (unlikely(hw_breakpoint_active())) {
 		set_debugreg(0, 7);
 	}
-- 
2.49.0.1015.ga840276032-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions
  2025-05-07 17:18       ` Sean Christopherson
@ 2025-05-13  0:34         ` mlevitsk
  2025-05-14 14:28           ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: mlevitsk @ 2025-05-13  0:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

On Wed, 2025-05-07 at 10:18 -0700, Sean Christopherson wrote:
> On Thu, May 01, 2025, mlevitsk@redhat.com wrote:
> > On Tue, 2025-04-22 at 16:33 -0700, Sean Christopherson wrote:
> > > > @@ -2653,11 +2654,17 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> > > >  	if (vmx->nested.nested_run_pending &&
> > > >  	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) {
> > > >  		kvm_set_dr(vcpu, 7, vmcs12->guest_dr7);
> > > > -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
> > > > +		new_debugctl = vmcs12->guest_ia32_debugctl;
> > > >  	} else {
> > > >  		kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
> > > > -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl);
> > > > +		new_debugctl = vmx->nested.pre_vmenter_debugctl;
> > > >  	}
> > > > +
> > > > +	if (CC(!vmx_set_guest_debugctl(vcpu, new_debugctl, false))) {
> > > 
> > > The consistency check belongs in nested_vmx_check_guest_state(), only needs to
> > > check the VM_ENTRY_LOAD_DEBUG_CONTROLS case, and should be posted as a separate
> > > patch.
> > 
> > I can move it there. Can you explain why though you want this? Is it because of the
> > order of checks specified in the PRM?
> 
> To be consistent with how KVM checks guest state.  The two checks in prepare_vmcs02()
> are special cases.  vmx_guest_state_valid() consumes a huge variety of state, and
> so replicating all of its logic for vmcs12 isn't worth doing.  The check on the
> kvm_set_msr() for guest_ia32_perf_global_ctrl exists purely so that KVM doesn't
> simply ignore the return value.
> 
> And to a lesser degree, because KVM assumes that guest state has been sanitized
> after nested_vmx_check_guest_state() is called.  Violating that risks introducing
> bugs, e.g. consuming vmcs12->guest_ia32_debugctl before it's been vetted could
> theoretically be problematic.
> 
> > Currently GUEST_IA32_DEBUGCTL of the host is *written* in prepare_vmcs02. 
> > Should I also move this write to nested_vmx_check_guest_state?
> 
> No.  nested_vmx_check_guest_state() verifies the incoming vmcs12 state,
> prepare_vmcs02() merges the vmcs12 state with KVM's desires and fills vmcs02.
> 
> > Or should I write the value blindly in prepare_vmcs02 and then check the value
> > of 'vmx->msr_ia32_debugctl' in nested_vmx_check_guest_state and fail if the value
> > contains reserved bits? 
> 
> I don't follow.  nested_vmx_check_guest_state() is called before prepare_vmcs02().

My mistake, I for some reason thought that nested_vmx_check_guest_state is called from
prepare_vmcs02(). Your explanation now makes sense.


> 
> > > > +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
> > > > +{
> > > > +	u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated);
> > > > +
> > > > +	if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
> > > > +		kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data);
> > > > +		data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> > > > +		invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> > > > +	}
> > > > +
> > > > +	if (invalid)
> > > > +		return false;
> > > > +
> > > > +	if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls &
> > > > +					VM_EXIT_SAVE_DEBUG_CONTROLS))
> > > > +		get_vmcs12(vcpu)->guest_ia32_debugctl = data;
> > > > +
> > > > +	if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
> > > > +	    (data & DEBUGCTLMSR_LBR))
> > > > +		intel_pmu_create_guest_lbr_event(vcpu);
> > > > +
> > > > +	__vmx_set_guest_debugctl(vcpu, data);
> > > > +	return true;
> > > 
> > > Return 0/-errno, not true/false.
> > 
> > There are plenty of functions in this file and KVM that return boolean.
> 
> That doesn't make them "right".  For helpers that are obvious predicates, then
> absolutely use a boolean return value.  The names for nested_vmx_check_eptp()
> and vmx_control_verify() aren't very good, e.g. they should be
> nested_vmx_is_valid_eptp() and vmx_is_valid_control(), but the intent is good.
> 
> But for flows like modifying guest state, KVM should return 0/-errno.
> 
> > e.g: 
> > 
> > static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp)
> > static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
> > static bool nested_evmcs_handle_vmclear(struct kvm_vcpu *vcpu, gpa_t vmptr)
> > static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > 						 struct vmcs12 *vmcs12)
> 
> These two should return 0/-errno.
> 
>  
> > static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp)
> > static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> 
> Probably should return 0/-errno, but nested_get_vmcs12_pages() is a bit of a mess.

I am not going to argue with you about this, let it be.

> 
> > ...
> > 
> > 
> > I personally think that functions that emulate hardware should return boolean
> > values or some hardware specific status code (e.g VMX failure code) because
> > the real hardware never returns -EINVAL and such.
> 
> Real hardware absolutely "returns" granular error codes.  KVM even has informal
> mappings between some of them, e.g. -EINVAL == #GP, -EFAULT == #PF, -EOPNOTSUPP == #UD,
> BUG() == 3-strike #MC.
> 
> And hardware has many more ways to report errors to software. E.g. VMLAUNCH can
> #UD, #GP(0), VM-Exit, VMfailInvalid, or VMFailValid with 30+ unique reasons.  #MC
> has a crazy number of possible error encodings.  And so on and so forth.
> 
> Software visible error codes aside, comparing individual KVM functions to an
> overall CPU is wildly misguided.  A more appropriate comparison would be between
> a KVM function and the ucode for a single instruction/operation.  I highly, highly
> doubt ucode flows are limited to binary yes/no outputs.

I don't think you understood my point - I just pointed out that real hardware will never return
things like -EINVAL.

I never have claimed that real hardware never does return error codes - it of course does, 
like indeed VMX can return something like 77 different error codes.

So I said that functions that emulate hardware should return either boolean in case hardware
only accepts/rejects the action, or hardware specific error codes, because
I think that its a bit confusing to map hardware error codes and kernel error codes.

In case of MSR write, hardware response is more or less boolean - hardware either accepts
the write or raises #GP.


Yes I understand that hardware can in theory also #UD, or silently ignore write, etc,
so I am not going to argue about this, let it be.

AFAIK the KVM convention for msr writes is that 1 is GP, 0 success, and negative value
exits as a KVM internal error to userspace. Not very developer friendly IMHO, there is
a room for improvement here.

And I see that we now also have KVM_MSR_RET_UNSUPPORTED and KVM_MSR_RET_FILTERED.

Thanks for the review,
Best regards,
	Maxim Levitsky

> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions
  2025-05-13  0:34         ` mlevitsk
@ 2025-05-14 14:28           ` Sean Christopherson
  0 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-14 14:28 UTC (permalink / raw)
  To: mlevitsk
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

On Mon, May 12, 2025, mlevitsk@redhat.com wrote:
> On Wed, 2025-05-07 at 10:18 -0700, Sean Christopherson wrote:
> AFAIK the KVM convention for msr writes is that 1 is GP, 0 success, and
> negative value exits as a KVM internal error to userspace. Not very developer
> friendly IMHO, there is a room for improvement here.

Yeah, it's ugly.  You're definitely not the first person to complain about KVM's
error code shenanigans.  Unfortunately, disentangling everything and doing so in
a way that is maintainable in the long term would be quite tricky, and absurdly
invasive. :-/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode
  2025-05-08 13:35           ` Sean Christopherson
@ 2025-05-15  0:19             ` mlevitsk
  0 siblings, 0 replies; 21+ messages in thread
From: mlevitsk @ 2025-05-15  0:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Thomas Gleixner, Borislav Petkov, Paolo Bonzini, x86,
	Dave Hansen, Ingo Molnar, linux-kernel, H. Peter Anvin

On Thu, 2025-05-08 at 06:35 -0700, Sean Christopherson wrote:
> On Wed, May 07, 2025, Sean Christopherson wrote:
> > On Thu, May 01, 2025, mlevitsk@redhat.com wrote:
> > > Any ideas on how to solve this then? Since currently its the common code that
> > > reads the current value of the MSR_IA32_DEBUGCTLMSR and it doesn't leave any
> > > indication about if it changed I can do either
> > > 
> > > 1. store old value as well, something like 'vcpu->arch.host_debugctl_old' Ugly IMHO.
> > > 
> > > 2. add DEBUG_CTL to the set of the 'dirty' registers, e.g add new bit for kvm_register_mark_dirty
> > > It looks a bit overkill to me
> > > 
> > > 3. Add new x86 callback for something like .sync_debugctl(). I vote for this option.
> > > 
> > > What do you think/prefer?
> > 
> > I was going to say #3 as well, but I think I have a better idea.
> > 
> > DR6 has a similar problem; the guest's value needs to be loaded into hardware,
> > but only somewhat rarely, and more importantly, never on a fastpath reentry.
> > 
> > Forced immediate exits also have a similar need: some control logic in common x86
> > needs instruct kvm_x86_ops.vcpu_run() to do something.
> > 
> > Unless I've misread the DEBUGCTLMSR situation, in all cases, common x86 only needs
> > to a single flag to tell vendor code to do something.  The payload for that action
> > is already available.
> > 
> > So rather than add a bunch of kvm_x86_ops hooks that are only called immediately
> > before kvm_x86_ops.vcpu_run(), expand @req_immediate_exit into a bitmap of flags
> > to communicate what works needs to be done, without having to resort to a field
> > in kvm_vcpu_arch that isn't actually persistent.
> > 
> > The attached patches are relatively lightly tested, but the DR6 tests from the
> > recent bug[*] pass, so hopefully they're correct?
> > 
> > The downside with this approach is that it would be difficult to backport to LTS
> > kernels, but given how long this has been a problem, I'm not super concerned about
> > optimizing for backports.
> > 
> > If they look ok, feel free to include them in the next version.  Or I can post
> > them separately if you want.
> 
> And of course I forgot to attach the patches...

There is one problem with this approach though: the common x86 code will still have to decide if
to set KVM_RUN_LOAD_DEBUGCTL flag.

Checking that DEBUGCTLMSR_FREEZE_IN_SMM bit of 'vcpu->arch.host_debugctl' changed is VMX specific,
because AMD doesn't have this bit, and it might even in the future have a different bit at that
position for different purpose.

I can set the KVM_RUN_LOAD_DEBUGCTL when any bit in DEBUGCTL changes instead, which should still be rare
and then SVM code can ignore the KVM_RUN_LOAD_DEBUGCTL, while VMX code will reload the VMCS field.

Is this OK?

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-05-15  0:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-16  0:25 [PATCH 0/3] KVM: x86: allow DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM passthrough Maxim Levitsky
2025-04-16  0:25 ` [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions Maxim Levitsky
2025-04-22 23:33   ` Sean Christopherson
2025-05-01 20:35     ` mlevitsk
2025-05-07 17:18       ` Sean Christopherson
2025-05-13  0:34         ` mlevitsk
2025-05-14 14:28           ` Sean Christopherson
2025-04-23  9:51   ` Mi, Dapeng
2025-05-01 20:34     ` mlevitsk
2025-05-07  5:17       ` Mi, Dapeng
2025-04-16  0:25 ` [PATCH 2/3] x86: KVM: VMX: cache guest written value of MSR_IA32_DEBUGCTL Maxim Levitsky
2025-04-16  0:25 ` [PATCH 3/3] x86: KVM: VMX: preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while in the guest mode Maxim Levitsky
2025-04-22 23:41   ` Sean Christopherson
2025-05-01 20:41     ` mlevitsk
2025-05-01 20:53       ` mlevitsk
2025-05-07  5:27         ` Mi, Dapeng
2025-05-07 14:31           ` mlevitsk
2025-05-07 23:03         ` Sean Christopherson
2025-05-08 13:35           ` Sean Christopherson
2025-05-15  0:19             ` mlevitsk
2025-04-23 10:10   ` Mi, Dapeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).