[PATCH v3 00/27] Enable FRED with KVM VMX

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/27] Enable FRED with KVM VMX
@ 2024-10-01  5:00 Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 01/27] KVM: x86: Use a dedicated flow for queueing re-injected exceptions Xin Li (Intel)
                   ` (28 more replies)
  0 siblings, 29 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

This patch set enables the Intel flexible return and event delivery
(FRED) architecture with KVM VMX to allow guests to utilize FRED.

The FRED architecture defines simple new transitions that change
privilege level (ring transitions). The FRED architecture was
designed with the following goals:

1) Improve overall performance and response time by replacing event
   delivery through the interrupt descriptor table (IDT event
   delivery) and event return by the IRET instruction with lower
   latency transitions.

2) Improve software robustness by ensuring that event delivery
   establishes the full supervisor context and that event return
   establishes the full user context.

The new transitions defined by the FRED architecture are FRED event
delivery and, for returning from events, two FRED return instructions.
FRED event delivery can effect a transition from ring 3 to ring 0, but
it is used also to deliver events incident to ring 0. One FRED
instruction (ERETU) effects a return from ring 0 to ring 3, while the
other (ERETS) returns while remaining in ring 0. Collectively, FRED
event delivery and the FRED return instructions are FRED transitions.

Intel VMX architecture is extended to run FRED guests, and the major
changes are:

1) New VMCS fields for FRED context management, which includes two new
event data VMCS fields, eight new guest FRED context VMCS fields and
eight new host FRED context VMCS fields.

2) VMX nested-exception support for proper virtualization of stack
levels introduced with FRED architecture.

Search for the latest FRED spec in most search engines with this search
pattern:

  site:intel.com FRED (flexible return and event delivery) specification

The first 20 patches add FRED support to VMX, and the rest 7 patches
add FRED support to nested VMX.

Following is the link to the v2 of this patch set:
https://lore.kernel.org/kvm/20240207172646.3981-1-xin3.li@intel.com/

Sean Christopherson (3):
  KVM: x86: Use a dedicated flow for queueing re-injected exceptions
  KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
  KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM

Xin Li (21):
  KVM: VMX: Add support for the secondary VM exit controls
  KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
  KVM: VMX: Disable FRED if FRED consistency checks fail
  KVM: VMX: Initialize VMCS FRED fields
  KVM: x86: Use KVM-governed feature framework to track "FRED enabled"
  KVM: VMX: Set FRED MSR interception
  KVM: VMX: Save/restore guest FRED RSP0
  KVM: VMX: Add support for FRED context save/restore
  KVM: x86: Add a helper to detect if FRED is enabled for a vCPU
  KVM: VMX: Virtualize FRED event_data
  KVM: VMX: Virtualize FRED nested exception tracking
  KVM: x86: Mark CR4.FRED as not reserved when guest can use FRED
  KVM: VMX: Dump FRED context in dump_vmcs()
  KVM: x86: Allow FRED/LKGS to be advertised to guests
  KVM: x86: Allow WRMSRNS to be advertised to guests
  KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup
  KVM: nVMX: Add support for the secondary VM exit controls
  KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros
  KVM: nVMX: Add FRED VMCS fields
  KVM: nVMX: Add VMCS FRED states checking
  KVM: nVMX: Allow VMX FRED controls

Xin Li (Intel) (3):
  x86/cea: Export per CPU variable cea_exception_stacks
  KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
  KVM: nVMX: Add a prerequisite to existence of VMCS fields

 Documentation/virt/kvm/x86/nested-vmx.rst |  19 ++
 arch/x86/include/asm/kvm_host.h           |   9 +-
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/include/asm/vmx.h                |  32 ++-
 arch/x86/kvm/cpuid.c                      |   4 +-
 arch/x86/kvm/governed_features.h          |   1 +
 arch/x86/kvm/kvm_cache_regs.h             |  15 ++
 arch/x86/kvm/svm/svm.c                    |  15 +-
 arch/x86/kvm/vmx/capabilities.h           |  17 +-
 arch/x86/kvm/vmx/nested.c                 | 291 ++++++++++++++++----
 arch/x86/kvm/vmx/nested.h                 |   8 +
 arch/x86/kvm/vmx/nested_vmcs_fields.h     |  25 ++
 arch/x86/kvm/vmx/vmcs.h                   |   1 +
 arch/x86/kvm/vmx/vmcs12.c                 |  19 ++
 arch/x86/kvm/vmx/vmcs12.h                 |  38 +++
 arch/x86/kvm/vmx/vmcs_shadow_fields.h     |  37 ++-
 arch/x86/kvm/vmx/vmx.c                    | 308 +++++++++++++++++++---
 arch/x86/kvm/vmx/vmx.h                    |  15 +-
 arch/x86/kvm/x86.c                        | 140 ++++++----
 arch/x86/kvm/x86.h                        |   8 +-
 arch/x86/mm/cpu_entry_area.c              |   1 +
 21 files changed, 846 insertions(+), 158 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/nested_vmcs_fields.h

base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
-- 
2.46.2

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v3 01/27] KVM: x86: Use a dedicated flow for queueing re-injected exceptions
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 02/27] KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1 Xin Li (Intel)
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Sean Christopherson <seanjc@google.com>

Open code the filling of vcpu->arch.exception in kvm_requeue_exception()
instead of bouncing through kvm_multiple_exception(), as re-injection
doesn't actually share that much code with "normal" injection, e.g. the
VM-Exit interception check, payload delivery, and nested exception code
is all bypassed as those flows only apply during initial injection.

When FRED comes along, the special casing will only get worse, as FRED
explicitly tracks nested exceptions and essentially delivers the payload
on the stack frame, i.e. re-injection will need more inputs, and normal
injection will have yet more code that needs to be bypassed when KVM is
re-injecting an exception.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  4 +-
 arch/x86/kvm/svm/svm.c          | 15 +++---
 arch/x86/kvm/vmx/vmx.c          | 16 +++---
 arch/x86/kvm/x86.c              | 89 ++++++++++++++++-----------------
 4 files changed, 63 insertions(+), 61 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d9f763a7bb9..43b08d12cb32 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2112,8 +2112,8 @@ int kvm_emulate_rdpmc(struct kvm_vcpu *vcpu);
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long payload);
-void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
-void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
+void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
+			   bool has_error_code, u32 error_code);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
 void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 				    struct x86_exception *fault);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9df3e1e5ae81..d9e2568bcd54 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4112,20 +4112,23 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 		vcpu->arch.nmi_injected = true;
 		svm->nmi_l1_to_l2 = nmi_l1_to_l2;
 		break;
-	case SVM_EXITINTINFO_TYPE_EXEPT:
+	case SVM_EXITINTINFO_TYPE_EXEPT: {
+		u32 error_code = 0;
+
 		/*
 		 * Never re-inject a #VC exception.
 		 */
 		if (vector == X86_TRAP_VC)
 			break;
 
-		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
-			u32 err = svm->vmcb->control.exit_int_info_err;
-			kvm_requeue_exception_e(vcpu, vector, err);
+		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR)
+			error_code = svm->vmcb->control.exit_int_info_err;
 
-		} else
-			kvm_requeue_exception(vcpu, vector);
+		kvm_requeue_exception(vcpu, vector,
+				      exitintinfo & SVM_EXITINTINFO_VALID_ERR,
+				      error_code);
 		break;
+	}
 	case SVM_EXITINTINFO_TYPE_INTR:
 		kvm_queue_interrupt(vcpu, vector, false);
 		break;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1a4438358c5e..6a93f5edbc0d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7136,13 +7136,17 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 	case INTR_TYPE_SOFT_EXCEPTION:
 		vcpu->arch.event_exit_inst_len = vmcs_read32(instr_len_field);
 		fallthrough;
-	case INTR_TYPE_HARD_EXCEPTION:
-		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK) {
-			u32 err = vmcs_read32(error_code_field);
-			kvm_requeue_exception_e(vcpu, vector, err);
-		} else
-			kvm_requeue_exception(vcpu, vector);
+	case INTR_TYPE_HARD_EXCEPTION: {
+		u32 error_code = 0;
+
+		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK)
+			error_code = vmcs_read32(error_code_field);
+
+		kvm_requeue_exception(vcpu, vector,
+				      idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK,
+				      error_code);
 		break;
+	}
 	case INTR_TYPE_SOFT_INTR:
 		vcpu->arch.event_exit_inst_len = vmcs_read32(instr_len_field);
 		fallthrough;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83fe0a78146f..e8de9f4734a6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -833,9 +833,9 @@ static void kvm_queue_exception_vmexit(struct kvm_vcpu *vcpu, unsigned int vecto
 	ex->payload = payload;
 }
 
-static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
-		unsigned nr, bool has_error, u32 error_code,
-	        bool has_payload, unsigned long payload, bool reinject)
+static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
+				   bool has_error, u32 error_code,
+				   bool has_payload, unsigned long payload)
 {
 	u32 prev_nr;
 	int class1, class2;
@@ -843,13 +843,10 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
 	/*
-	 * If the exception is destined for L2 and isn't being reinjected,
-	 * morph it to a VM-Exit if L1 wants to intercept the exception.  A
-	 * previously injected exception is not checked because it was checked
-	 * when it was original queued, and re-checking is incorrect if _L1_
-	 * injected the exception, in which case it's exempt from interception.
+	 * If the exception is destined for L2, morph it to a VM-Exit if L1
+	 * wants to intercept the exception.
 	 */
-	if (!reinject && is_guest_mode(vcpu) &&
+	if (is_guest_mode(vcpu) &&
 	    kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) {
 		kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
 					   has_payload, payload);
@@ -858,28 +855,9 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 
 	if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
 	queue:
-		if (reinject) {
-			/*
-			 * On VM-Entry, an exception can be pending if and only
-			 * if event injection was blocked by nested_run_pending.
-			 * In that case, however, vcpu_enter_guest() requests an
-			 * immediate exit, and the guest shouldn't proceed far
-			 * enough to need reinjection.
-			 */
-			WARN_ON_ONCE(kvm_is_exception_pending(vcpu));
-			vcpu->arch.exception.injected = true;
-			if (WARN_ON_ONCE(has_payload)) {
-				/*
-				 * A reinjected event has already
-				 * delivered its payload.
-				 */
-				has_payload = false;
-				payload = 0;
-			}
-		} else {
-			vcpu->arch.exception.pending = true;
-			vcpu->arch.exception.injected = false;
-		}
+		vcpu->arch.exception.pending = true;
+		vcpu->arch.exception.injected = false;
+
 		vcpu->arch.exception.has_error_code = has_error;
 		vcpu->arch.exception.vector = nr;
 		vcpu->arch.exception.error_code = error_code;
@@ -920,29 +898,52 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
 {
-	kvm_multiple_exception(vcpu, nr, false, 0, false, 0, false);
+	kvm_multiple_exception(vcpu, nr, false, 0, false, 0);
 }
 EXPORT_SYMBOL_GPL(kvm_queue_exception);
 
-void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr)
-{
-	kvm_multiple_exception(vcpu, nr, false, 0, false, 0, true);
-}
-EXPORT_SYMBOL_GPL(kvm_requeue_exception);
 
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr,
 			   unsigned long payload)
 {
-	kvm_multiple_exception(vcpu, nr, false, 0, true, payload, false);
+	kvm_multiple_exception(vcpu, nr, false, 0, true, payload);
 }
 EXPORT_SYMBOL_GPL(kvm_queue_exception_p);
 
 static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
 				    u32 error_code, unsigned long payload)
 {
-	kvm_multiple_exception(vcpu, nr, true, error_code,
-			       true, payload, false);
+	kvm_multiple_exception(vcpu, nr, true, error_code, true, payload);
+}
+
+void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
+			   bool has_error_code, u32 error_code)
+{
+
+	/*
+	 * On VM-Entry, an exception can be pending if and only if event
+	 * injection was blocked by nested_run_pending.  In that case, however,
+	 * vcpu_enter_guest() requests an immediate exit, and the guest
+	 * shouldn't proceed far enough to need reinjection.
+	 */
+	WARN_ON_ONCE(kvm_is_exception_pending(vcpu));
+
+	/*
+	 * Do not check for interception when injecting an event for L2, as the
+	 * exception was checked for intercept when it was original queued, and
+	 * re-checking is incorrect if _L1_ injected the exception, in which
+	 * case it's exempt from interception.
+	 */
+	kvm_make_request(KVM_REQ_EVENT, vcpu);
+
+	vcpu->arch.exception.injected = true;
+	vcpu->arch.exception.has_error_code = has_error_code;
+	vcpu->arch.exception.vector = nr;
+	vcpu->arch.exception.error_code = error_code;
+	vcpu->arch.exception.has_payload = false;
+	vcpu->arch.exception.payload = 0;
 }
+EXPORT_SYMBOL_GPL(kvm_requeue_exception);
 
 int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err)
 {
@@ -1013,16 +1014,10 @@ void kvm_inject_nmi(struct kvm_vcpu *vcpu)
 
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
 {
-	kvm_multiple_exception(vcpu, nr, true, error_code, false, 0, false);
+	kvm_multiple_exception(vcpu, nr, true, error_code, false, 0);
 }
 EXPORT_SYMBOL_GPL(kvm_queue_exception_e);
 
-void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
-{
-	kvm_multiple_exception(vcpu, nr, true, error_code, false, 0, true);
-}
-EXPORT_SYMBOL_GPL(kvm_requeue_exception_e);
-
 /*
  * Checks if cpl <= required_cpl; if true, return true.  Otherwise queue
  * a #GP and return false.
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 02/27] KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 01/27] KVM: x86: Use a dedicated flow for queueing re-injected exceptions Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 03/27] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Sean Christopherson <seanjc@google.com>

Don't update the guest's XFD_ERR MSR if CR0.TS is set; per the SDM,
XFD_ERR is not modified if CR0.TS=1.  Although it's not explicitly stated
in the SDM, conceptually it makes sense the CR0.TS check would be done
prior to the XFD_ERR check, e.g. CR0.TS=1 blocks all SIMD state, whereas
XFD blocks only XTILE state.

  Device-not-available exceptions that are not due to XFD - those
  resulting from setting CR0.TS to 1 - do not modify the IA32_XFD_ERR MSR.

Opportunistically update the comment to call out that XFD_ERR is updated
before the VM-Exit check occurs.  Nothing in the SDM explicitly calls out
this behavior, but logically it must be the behavior, otherwise reading
XFD_ERR in handle_nm_fault_irqoff() would return stale data, i.e. the
to-be-delivered XFD_ERR value would need to be saved in EXIT_QUALIFICATION,
a la DR6 for #DB and CR2 for #PF, so that software could capture the guest
value.

Fixes: ec5be88ab29f ("kvm: x86: Intercept #NM for saving IA32_XFD_ERR")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6a93f5edbc0d..3f6257d88ded 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6976,16 +6976,16 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 	 * MSR value is not clobbered by the host activity before the guest
 	 * has chance to consume it.
 	 *
-	 * Do not blindly read xfd_err here, since this exception might
-	 * be caused by L1 interception on a platform which doesn't
-	 * support xfd at all.
+	 * Update the guest's XFD_ERR if and only if XFD is enabled, as the #NM
+	 * interception may have been caused by L1 interception.  Per the SDM,
+	 * XFD_ERR is not modified if CR0.TS=1.
 	 *
-	 * Do it conditionally upon guest_fpu::xfd. xfd_err matters
-	 * only when xfd contains a non-zero value.
-	 *
-	 * Queuing exception is done in vmx_handle_exit. See comment there.
+	 * Note, XFD_ERR is updated _before_ the #NM interception check, i.e.
+	 * unlike CR2 and DR6, the value is not a payload that is attached to
+	 * the #NM exception.
 	 */
-	if (vcpu->arch.guest_fpu.fpstate->xfd)
+	if (vcpu->arch.guest_fpu.fpstate->xfd &&
+	    !kvm_is_cr0_bit_set(vcpu, X86_CR0_TS))
 		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
 }
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 03/27] KVM: VMX: Add support for the secondary VM exit controls
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 01/27] KVM: x86: Use a dedicated flow for queueing re-injected exceptions Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 02/27] KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1 Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-21  8:28   ` Chao Gao
  2024-10-01  5:00 ` [PATCH v3 04/27] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config Xin Li (Intel)
                   ` (25 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Always load the secondary VM exit controls to prepare for FRED enabling.

Extend the VM exit/entry consistency check framework to accommodate this
newly added secondary VM exit controls.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
---

Change since v2:
* Do FRED controls consistency checks in the VM exit/entry consistency
  check framework (Sean Christopherson).

Change since v1:
* Always load the secondary VM exit controls (Sean Christopherson).
---
 arch/x86/include/asm/msr-index.h |  1 +
 arch/x86/include/asm/vmx.h       |  3 ++
 arch/x86/kvm/vmx/capabilities.h  |  9 +++++-
 arch/x86/kvm/vmx/vmcs.h          |  1 +
 arch/x86/kvm/vmx/vmx.c           | 51 ++++++++++++++++++++++++++------
 arch/x86/kvm/vmx/vmx.h           |  7 ++++-
 6 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..95b6e2749256 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1178,6 +1178,7 @@
 #define MSR_IA32_VMX_TRUE_ENTRY_CTLS     0x00000490
 #define MSR_IA32_VMX_VMFUNC             0x00000491
 #define MSR_IA32_VMX_PROCBASED_CTLS3	0x00000492
+#define MSR_IA32_VMX_EXIT_CTLS2		0x00000493
 
 /* Resctrl MSRs: */
 /* - Intel: */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f7fd4369b821..57a37ea06a17 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -106,6 +106,7 @@
 #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
+#define VM_EXIT_ACTIVATE_SECONDARY_CONTROLS	0x80000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
@@ -258,6 +259,8 @@ enum vmcs_field {
 	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
 	PID_POINTER_TABLE		= 0x00002042,
 	PID_POINTER_TABLE_HIGH		= 0x00002043,
+	SECONDARY_VM_EXIT_CONTROLS	= 0x00002044,
+	SECONDARY_VM_EXIT_CONTROLS_HIGH	= 0x00002045,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index cb6588238f46..e8f3ad0f79ee 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -59,8 +59,9 @@ struct vmcs_config {
 	u32 cpu_based_exec_ctrl;
 	u32 cpu_based_2nd_exec_ctrl;
 	u64 cpu_based_3rd_exec_ctrl;
-	u32 vmexit_ctrl;
 	u32 vmentry_ctrl;
+	u32 vmexit_ctrl;
+	u64 secondary_vmexit_ctrl;
 	u64 misc;
 	struct nested_vmx_msrs nested;
 };
@@ -136,6 +137,12 @@ static inline bool cpu_has_tertiary_exec_ctrls(void)
 		CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
 }
 
+static inline bool cpu_has_secondary_vmexit_ctrls(void)
+{
+	return vmcs_config.vmexit_ctrl &
+		VM_EXIT_ACTIVATE_SECONDARY_CONTROLS;
+}
+
 static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index b25625314658..ae152a9d1963 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -47,6 +47,7 @@ struct vmcs_host_state {
 struct vmcs_controls_shadow {
 	u32 vm_entry;
 	u32 vm_exit;
+	u64 secondary_vm_exit;
 	u32 pin;
 	u32 exec;
 	u32 secondary_exec;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3f6257d88ded..ec548c75c3ef 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2606,6 +2606,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	u32 _cpu_based_2nd_exec_control = 0;
 	u64 _cpu_based_3rd_exec_control = 0;
 	u32 _vmexit_control = 0;
+	u64 _secondary_vmexit_control = 0;
 	u32 _vmentry_control = 0;
 	u64 basic_msr;
 	u64 misc_msr;
@@ -2619,7 +2620,8 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	struct {
 		u32 entry_control;
 		u32 exit_control;
-	} const vmcs_entry_exit_pairs[] = {
+		u64 exit_2nd_control;
+	} const vmcs_entry_exit_triplets[] = {
 		{ VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL,	VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL },
 		{ VM_ENTRY_LOAD_IA32_PAT,		VM_EXIT_LOAD_IA32_PAT },
 		{ VM_ENTRY_LOAD_IA32_EFER,		VM_EXIT_LOAD_IA32_EFER },
@@ -2713,21 +2715,43 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				&_vmentry_control))
 		return -EIO;
 
-	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit_pairs); i++) {
-		u32 n_ctrl = vmcs_entry_exit_pairs[i].entry_control;
-		u32 x_ctrl = vmcs_entry_exit_pairs[i].exit_control;
-
-		if (!(_vmentry_control & n_ctrl) == !(_vmexit_control & x_ctrl))
+	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
+		_secondary_vmexit_control =
+			adjust_vmx_controls64(KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS,
+					      MSR_IA32_VMX_EXIT_CTLS2);
+
+	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit_triplets); i++) {
+		u32 n_ctrl = vmcs_entry_exit_triplets[i].entry_control;
+		u32 x_ctrl = vmcs_entry_exit_triplets[i].exit_control;
+		u64 x_ctrl_2 = vmcs_entry_exit_triplets[i].exit_2nd_control;
+		bool has_n = n_ctrl && ((_vmentry_control & n_ctrl) == n_ctrl);
+		bool has_x = x_ctrl && ((_vmexit_control & x_ctrl) == x_ctrl);
+		bool has_x_2 = x_ctrl_2 && ((_secondary_vmexit_control & x_ctrl_2) == x_ctrl_2);
+
+		if (x_ctrl_2) {
+			/* Only activate secondary VM exit control bit should be set */
+			if ((_vmexit_control & x_ctrl) == VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
+				if (has_n == has_x_2)
+					continue;
+			} else {
+				/* The feature should not be supported in any control */
+				if (!has_n && !has_x && !has_x_2)
+					continue;
+			}
+		} else if (has_n == has_x) {
 			continue;
+		}
 
-		pr_warn_once("Inconsistent VM-Entry/VM-Exit pair, entry = %x, exit = %x\n",
-			     _vmentry_control & n_ctrl, _vmexit_control & x_ctrl);
+		pr_warn_once("Inconsistent VM-Entry/VM-Exit triplet, entry = %x, exit = %x, secondary_exit = %llx\n",
+			     _vmentry_control & n_ctrl, _vmexit_control & x_ctrl,
+			     _secondary_vmexit_control & x_ctrl_2);
 
 		if (error_on_inconsistent_vmcs_config)
 			return -EIO;
 
 		_vmentry_control &= ~n_ctrl;
 		_vmexit_control &= ~x_ctrl;
+		_secondary_vmexit_control &= ~x_ctrl_2;
 	}
 
 	rdmsrl(MSR_IA32_VMX_BASIC, basic_msr);
@@ -2757,8 +2781,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
 	vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
 	vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
-	vmcs_conf->vmexit_ctrl         = _vmexit_control;
 	vmcs_conf->vmentry_ctrl        = _vmentry_control;
+	vmcs_conf->vmexit_ctrl         = _vmexit_control;
+	vmcs_conf->secondary_vmexit_ctrl   = _secondary_vmexit_control;
 	vmcs_conf->misc	= misc_msr;
 
 #if IS_ENABLED(CONFIG_HYPERV)
@@ -4449,6 +4474,11 @@ static u32 vmx_vmexit_ctrl(void)
 		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
 }
 
+static u64 vmx_secondary_vmexit_ctrl(void)
+{
+	return vmcs_config.secondary_vmexit_ctrl;
+}
+
 void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4799,6 +4829,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 
 	vm_exit_controls_set(vmx, vmx_vmexit_ctrl());
 
+	if (cpu_has_secondary_vmexit_ctrls())
+		secondary_vm_exit_controls_set(vmx, vmx_secondary_vmexit_ctrl());
+
 	/* 22.2.1, 20.8.1 */
 	vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 2325f773a20b..cf3a6c116634 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -507,7 +507,11 @@ static inline u8 vmx_get_rvi(void)
 	       VM_EXIT_LOAD_IA32_EFER |					\
 	       VM_EXIT_CLEAR_BNDCFGS |					\
 	       VM_EXIT_PT_CONCEAL_PIP |					\
-	       VM_EXIT_CLEAR_IA32_RTIT_CTL)
+	       VM_EXIT_CLEAR_IA32_RTIT_CTL |				\
+	       VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
+
+#define KVM_REQUIRED_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
+#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
 
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
@@ -612,6 +616,7 @@ static __always_inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##b
 }
 BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
 BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(secondary_vm_exit, SECONDARY_VM_EXIT_CONTROLS, 64)
 BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 04/27] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (2 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 03/27] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 05/27] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Setup FRED VM entry/exit controls in the global vmcs_config for proper
FRED VMCS fields management, i.e., load guest FRED state upon VM entry,
and save guest/load host FRED state during VM exit.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
---

Change since v2:
* Add FRED VM entry/exit controls consistency checks to the existing
  consistency check framework (Sean Christopherson).
* Just do the unnecessary FRED state load/store on entry/exit (Sean
  Christopherson).
---
 arch/x86/include/asm/vmx.h | 4 ++++
 arch/x86/kvm/vmx/vmx.c     | 2 ++
 arch/x86/kvm/vmx/vmx.h     | 7 +++++--
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 57a37ea06a17..551f62892e1a 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -108,6 +108,9 @@
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
 #define VM_EXIT_ACTIVATE_SECONDARY_CONTROLS	0x80000000
 
+#define SECONDARY_VM_EXIT_SAVE_IA32_FRED	BIT_ULL(0)
+#define SECONDARY_VM_EXIT_LOAD_IA32_FRED	BIT_ULL(1)
+
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
 #define VM_ENTRY_LOAD_DEBUG_CONTROLS            0x00000004
@@ -120,6 +123,7 @@
 #define VM_ENTRY_LOAD_BNDCFGS                   0x00010000
 #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
 #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
+#define VM_ENTRY_LOAD_IA32_FRED			0x00800000
 
 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ec548c75c3ef..efd2ad397ad2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2627,6 +2627,8 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 		{ VM_ENTRY_LOAD_IA32_EFER,		VM_EXIT_LOAD_IA32_EFER },
 		{ VM_ENTRY_LOAD_BNDCFGS,		VM_EXIT_CLEAR_BNDCFGS },
 		{ VM_ENTRY_LOAD_IA32_RTIT_CTL,		VM_EXIT_CLEAR_IA32_RTIT_CTL },
+		{ VM_ENTRY_LOAD_IA32_FRED,		VM_EXIT_ACTIVATE_SECONDARY_CONTROLS,
+			SECONDARY_VM_EXIT_SAVE_IA32_FRED | SECONDARY_VM_EXIT_LOAD_IA32_FRED },
 	};
 
 	memset(vmcs_conf, 0, sizeof(*vmcs_conf));
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index cf3a6c116634..e0d76d2460ef 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -485,7 +485,8 @@ static inline u8 vmx_get_rvi(void)
 	 VM_ENTRY_LOAD_IA32_EFER |					\
 	 VM_ENTRY_LOAD_BNDCFGS |					\
 	 VM_ENTRY_PT_CONCEAL_PIP |					\
-	 VM_ENTRY_LOAD_IA32_RTIT_CTL)
+	 VM_ENTRY_LOAD_IA32_RTIT_CTL |					\
+	 VM_ENTRY_LOAD_IA32_FRED)
 
 #define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS				\
 	(VM_EXIT_SAVE_DEBUG_CONTROLS |					\
@@ -511,7 +512,9 @@ static inline u8 vmx_get_rvi(void)
 	       VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
 
 #define KVM_REQUIRED_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
-#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
+#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS			\
+	     (SECONDARY_VM_EXIT_SAVE_IA32_FRED |			\
+	      SECONDARY_VM_EXIT_LOAD_IA32_FRED)
 
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 05/27] KVM: VMX: Disable FRED if FRED consistency checks fail
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (3 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 04/27] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-22  8:48   ` Chao Gao
  2024-11-26 15:32   ` Borislav Petkov
  2024-10-01  5:00 ` [PATCH v3 06/27] x86/cea: Export per CPU variable cea_exception_stacks Xin Li (Intel)
                   ` (23 subsequent siblings)
  28 siblings, 2 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Do not virtualize FRED if FRED consistency checks fail.

Either on broken hardware, or when run KVM on top of another hypervisor
before the underlying hypervisor implements nested FRED correctly.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/vmx/capabilities.h | 7 +++++++
 arch/x86/kvm/vmx/vmx.c          | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index e8f3ad0f79ee..2962a3bb9747 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -400,6 +400,13 @@ static inline bool vmx_pebs_supported(void)
 	return boot_cpu_has(X86_FEATURE_PEBS) && kvm_pmu_cap.pebs_ept;
 }
 
+static inline bool cpu_has_vmx_fred(void)
+{
+	/* No need to check FRED VM exit controls. */
+	return boot_cpu_has(X86_FEATURE_FRED) &&
+		(vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_FRED);
+}
+
 static inline bool cpu_has_notify_vmexit(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index efd2ad397ad2..9b4c30db911f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8001,6 +8001,9 @@ static __init void vmx_set_cpu_caps(void)
 		kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
 	}
 
+	if (!cpu_has_vmx_fred())
+		kvm_cpu_cap_clear(X86_FEATURE_FRED);
+
 	if (!enable_pmu)
 		kvm_cpu_cap_clear(X86_FEATURE_PDCM);
 	kvm_caps.supported_perf_cap = vmx_get_perf_capabilities();
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 06/27] x86/cea: Export per CPU variable cea_exception_stacks
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (4 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 05/27] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-01 16:12   ` Dave Hansen
  2024-10-01  5:00 ` [PATCH v3 07/27] KVM: VMX: Initialize VMCS FRED fields Xin Li (Intel)
                   ` (22 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

The per CPU variable cea_exception_stacks contains per CPU stacks for
NMI, #DB and #DF, which is referenced in KVM to set host FRED RSP[123]
each time a vCPU is loaded onto a CPU, thus it needs to be exported.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/mm/cpu_entry_area.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 575f863f3c75..b8af71b67d9a 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -17,6 +17,7 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
 #ifdef CONFIG_X86_64
 static DEFINE_PER_CPU_PAGE_ALIGNED(struct exception_stacks, exception_stacks);
 DEFINE_PER_CPU(struct cea_exception_stacks*, cea_exception_stacks);
+EXPORT_PER_CPU_SYMBOL(cea_exception_stacks);
 
 static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, _cea_offset);
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 07/27] KVM: VMX: Initialize VMCS FRED fields
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (5 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 06/27] x86/cea: Export per CPU variable cea_exception_stacks Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-22  9:06   ` Chao Gao
  2024-10-01  5:00 ` [PATCH v3 08/27] KVM: x86: Use KVM-governed feature framework to track "FRED enabled" Xin Li (Intel)
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Initialize host VMCS FRED fields with host FRED MSRs' value and
guest VMCS FRED fields to 0.

FRED CPU states are managed in 9 new FRED MSRs, as well as a few
existing CPU registers and MSRs, e.g., CR4.FRED.  To support FRED
context management, new VMCS fields corresponding to most of FRED
CPU state MSRs are added to both the host-state and guest-state
areas of VMCS.

Specifically no VMCS fields are added for FRED RSP0 and SSP0 MSRs,
because the 2 FRED MSRs are used during ring 3 event delivery only,
thus KVM can run safely even with guest FRED RSP0 and SSP0.  Thus
save and restore of FRED RSP0 and SSP0 are deferred.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---

Change since v2:
* Use structure kvm_host_values to keep host fred config & stack levels
  (Sean Christopherson).

Changes since v1:
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() to decouple
  KVM's capability to virtualize a feature and host's enabling of a
  feature (Chao Gao).
* Move guest FRED states init into __vmx_vcpu_reset() (Chao Gao).
---
 arch/x86/include/asm/vmx.h | 16 ++++++++++++++++
 arch/x86/kvm/vmx/vmx.c     | 34 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.h         |  3 +++
 3 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 551f62892e1a..5184e03945dd 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -289,12 +289,28 @@ enum vmcs_field {
 	GUEST_BNDCFGS_HIGH              = 0x00002813,
 	GUEST_IA32_RTIT_CTL		= 0x00002814,
 	GUEST_IA32_RTIT_CTL_HIGH	= 0x00002815,
+	GUEST_IA32_FRED_CONFIG		= 0x0000281a,
+	GUEST_IA32_FRED_RSP1		= 0x0000281c,
+	GUEST_IA32_FRED_RSP2		= 0x0000281e,
+	GUEST_IA32_FRED_RSP3		= 0x00002820,
+	GUEST_IA32_FRED_STKLVLS		= 0x00002822,
+	GUEST_IA32_FRED_SSP1		= 0x00002824,
+	GUEST_IA32_FRED_SSP2		= 0x00002826,
+	GUEST_IA32_FRED_SSP3		= 0x00002828,
 	HOST_IA32_PAT			= 0x00002c00,
 	HOST_IA32_PAT_HIGH		= 0x00002c01,
 	HOST_IA32_EFER			= 0x00002c02,
 	HOST_IA32_EFER_HIGH		= 0x00002c03,
 	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
 	HOST_IA32_PERF_GLOBAL_CTRL_HIGH	= 0x00002c05,
+	HOST_IA32_FRED_CONFIG		= 0x00002c08,
+	HOST_IA32_FRED_RSP1		= 0x00002c0a,
+	HOST_IA32_FRED_RSP2		= 0x00002c0c,
+	HOST_IA32_FRED_RSP3		= 0x00002c0e,
+	HOST_IA32_FRED_STKLVLS		= 0x00002c10,
+	HOST_IA32_FRED_SSP1		= 0x00002c12,
+	HOST_IA32_FRED_SSP2		= 0x00002c14,
+	HOST_IA32_FRED_SSP3		= 0x00002c16,
 	PIN_BASED_VM_EXEC_CONTROL       = 0x00004000,
 	CPU_BASED_VM_EXEC_CONTROL       = 0x00004002,
 	EXCEPTION_BITMAP                = 0x00004004,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9b4c30db911f..fee0df93e07c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1503,6 +1503,18 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
 				    (unsigned long)(cpu_entry_stack(cpu) + 1));
 		}
 
+		/* Per-CPU FRED MSRs */
+		if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+#ifdef CONFIG_X86_64
+			vmcs_write64(HOST_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB));
+			vmcs_write64(HOST_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI));
+			vmcs_write64(HOST_IA32_FRED_RSP3, __this_cpu_ist_top_va(DF));
+#endif
+			vmcs_write64(HOST_IA32_FRED_SSP1, 0);
+			vmcs_write64(HOST_IA32_FRED_SSP2, 0);
+			vmcs_write64(HOST_IA32_FRED_SSP3, 0);
+		}
+
 		vmx->loaded_vmcs->cpu = cpu;
 	}
 }
@@ -4366,6 +4378,12 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 	 */
 	vmcs_write16(HOST_DS_SELECTOR, 0);
 	vmcs_write16(HOST_ES_SELECTOR, 0);
+
+	/* FRED CONFIG and STKLVLS are the same on all CPUs. */
+	if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+		vmcs_write64(HOST_IA32_FRED_CONFIG, kvm_host.fred_config);
+		vmcs_write64(HOST_IA32_FRED_STKLVLS, kvm_host.fred_stklvls);
+	}
 #else
 	vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
 	vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
@@ -4876,6 +4894,17 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 	}
 
 	vmx_setup_uret_msrs(vmx);
+
+	if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+		vmcs_write64(GUEST_IA32_FRED_CONFIG, 0);
+		vmcs_write64(GUEST_IA32_FRED_RSP1, 0);
+		vmcs_write64(GUEST_IA32_FRED_RSP2, 0);
+		vmcs_write64(GUEST_IA32_FRED_RSP3, 0);
+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, 0);
+		vmcs_write64(GUEST_IA32_FRED_SSP1, 0);
+		vmcs_write64(GUEST_IA32_FRED_SSP2, 0);
+		vmcs_write64(GUEST_IA32_FRED_SSP3, 0);
+	}
 }
 
 static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
@@ -8614,6 +8643,11 @@ __init int vmx_hardware_setup(void)
 
 	kvm_set_posted_intr_wakeup_handler(pi_wakeup_handler);
 
+	if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+		rdmsrl(MSR_IA32_FRED_CONFIG, kvm_host.fred_config);
+		rdmsrl(MSR_IA32_FRED_STKLVLS, kvm_host.fred_stklvls);
+	}
+
 	return r;
 }
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a84c48ef5278..578fea05ff18 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -45,6 +45,9 @@ struct kvm_host_values {
 	u64 xcr0;
 	u64 xss;
 	u64 arch_capabilities;
+
+	u64 fred_config;
+	u64 fred_stklvls;
 };
 
 void kvm_spurious_fault(void);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 08/27] KVM: x86: Use KVM-governed feature framework to track "FRED enabled"
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (6 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 07/27] KVM: VMX: Initialize VMCS FRED fields Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition Xin Li (Intel)
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Track "FRED enabled" via a governed feature flag to avoid the guest
CPUID lookups at runtime.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/governed_features.h | 1 +
 arch/x86/kvm/vmx/vmx.c           | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
index ad463b1ed4e4..507ca73e52e9 100644
--- a/arch/x86/kvm/governed_features.h
+++ b/arch/x86/kvm/governed_features.h
@@ -17,6 +17,7 @@ KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
 KVM_GOVERNED_X86_FEATURE(VGIF)
 KVM_GOVERNED_X86_FEATURE(VNMI)
 KVM_GOVERNED_X86_FEATURE(LAM)
+KVM_GOVERNED_X86_FEATURE(FRED)
 
 #undef KVM_GOVERNED_X86_FEATURE
 #undef KVM_GOVERNED_FEATURE
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fee0df93e07c..9acc9661fdb2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7893,6 +7893,7 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_LAM);
+	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_FRED);
 
 	vmx_setup_uret_msrs(vmx);
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (7 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 08/27] KVM: x86: Use KVM-governed feature framework to track "FRED enabled" Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-11-26 18:02   ` Borislav Petkov
  2024-10-01  5:00 ` [PATCH v3 10/27] KVM: VMX: Set FRED MSR interception Xin Li (Intel)
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

No need to use MAX_POSSIBLE_PASSTHROUGH_MSRS in the definition of array
vmx_possible_passthrough_msrs, as the macro name indicates the _possible_
maximum size of passthrough MSRs.

Use ARRAY_SIZE instead of MAX_POSSIBLE_PASSTHROUGH_MSRS when the size of
the array is needed and add a BUILD_BUG_ON to make sure the actual array
size does not exceed the possible maximum size of passthrough MSRs.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 8 +++++---
 arch/x86/kvm/vmx/vmx.h | 2 +-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9acc9661fdb2..28cf89c97bda 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -167,7 +167,7 @@ module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
  * List of MSRs that can be directly passed to the guest.
  * In addition to these x2apic, PT and LBR MSRs are handled specially.
  */
-static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
+static u32 vmx_possible_passthrough_msrs[] = {
 	MSR_IA32_SPEC_CTRL,
 	MSR_IA32_PRED_CMD,
 	MSR_IA32_FLUSH_CMD,
@@ -4182,6 +4182,8 @@ void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
 	if (!cpu_has_vmx_msr_bitmap())
 		return;
 
+	BUILD_BUG_ON(ARRAY_SIZE(vmx_possible_passthrough_msrs) > MAX_POSSIBLE_PASSTHROUGH_MSRS);
+
 	/*
 	 * Redo intercept permissions for MSRs that KVM is passing through to
 	 * the guest.  Disabling interception will check the new MSR filter and
@@ -7626,8 +7628,8 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 	}
 
 	/* The MSR bitmap starts with all ones */
-	bitmap_fill(vmx->shadow_msr_intercept.read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
-	bitmap_fill(vmx->shadow_msr_intercept.write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
+	bitmap_fill(vmx->shadow_msr_intercept.read, ARRAY_SIZE(vmx_possible_passthrough_msrs));
+	bitmap_fill(vmx->shadow_msr_intercept.write, ARRAY_SIZE(vmx_possible_passthrough_msrs));
 
 	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_TSC, MSR_TYPE_R);
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e0d76d2460ef..e7409f8f28b1 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -356,7 +356,7 @@ struct vcpu_vmx {
 	struct lbr_desc lbr_desc;
 
 	/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS	16
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS	64
 	struct {
 		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 10/27] KVM: VMX: Set FRED MSR interception
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (8 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-11-13 11:31   ` Chao Gao
  2024-10-01  5:00 ` [PATCH v3 11/27] KVM: VMX: Save/restore guest FRED RSP0 Xin Li (Intel)
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Add FRED MSRs to the VMX passthrough MSR list and set FRED MSRs
interception.

8 FRED MSRs, i.e., MSR_IA32_FRED_RSP[123], MSR_IA32_FRED_STKLVLS,
MSR_IA32_FRED_SSP[123] and MSR_IA32_FRED_CONFIG, are all safe to be
passthrough, because they all have a pair of corresponding host and
guest VMCS fields.

Both MSR_IA32_FRED_RSP0 and MSR_IA32_FRED_SSP0 are dedicated for user
level event delivery only, IOW they are NOT used in any kernel event
delivery and the execution of ERETS.  Thus KVM can run safely with
guest values in the 2 MSRs.  As a result, save and restore of their
guest values are postponed until vCPU context switching and their host
values are restored on returning to userspace.

Save/restore of MSR_IA32_FRED_RSP0 is done in the next patch.

Note, as MSR_IA32_FRED_SSP0 is an alias of MSR_IA32_PL0_SSP, its save
and restore is done through the CET supervisor context management.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 28cf89c97bda..c10c955722a3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -176,6 +176,16 @@ static u32 vmx_possible_passthrough_msrs[] = {
 	MSR_FS_BASE,
 	MSR_GS_BASE,
 	MSR_KERNEL_GS_BASE,
+	MSR_IA32_FRED_RSP0,
+	MSR_IA32_FRED_RSP1,
+	MSR_IA32_FRED_RSP2,
+	MSR_IA32_FRED_RSP3,
+	MSR_IA32_FRED_STKLVLS,
+	MSR_IA32_FRED_SSP1,
+	MSR_IA32_FRED_SSP2,
+	MSR_IA32_FRED_SSP3,
+	MSR_IA32_FRED_CONFIG,
+	MSR_IA32_FRED_SSP0,		/* Should be added through CET */
 	MSR_IA32_XFD,
 	MSR_IA32_XFD_ERR,
 #endif
@@ -7880,6 +7890,28 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
 }
 
+static void vmx_set_intercept_for_fred_msr(struct kvm_vcpu *vcpu)
+{
+	bool flag = !guest_can_use(vcpu, X86_FEATURE_FRED);
+
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP0, MSR_TYPE_RW, flag);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP1, MSR_TYPE_RW, flag);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP2, MSR_TYPE_RW, flag);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP3, MSR_TYPE_RW, flag);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_STKLVLS, MSR_TYPE_RW, flag);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP1, MSR_TYPE_RW, flag);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP2, MSR_TYPE_RW, flag);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP3, MSR_TYPE_RW, flag);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_CONFIG, MSR_TYPE_RW, flag);
+
+	/*
+	 * flag = !(CET.SUPERVISOR_SHADOW_STACK || FRED)
+	 *
+	 * A possible optimization is to intercept SSPs when FRED && !CET.SUPERVISOR_SHADOW_STACK.
+	 */
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP0, MSR_TYPE_RW, flag);
+}
+
 void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7957,6 +7989,8 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
+
+	vmx_set_intercept_for_fred_msr(vcpu);
 }
 
 static __init u64 vmx_get_perf_capabilities(void)
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 11/27] KVM: VMX: Save/restore guest FRED RSP0
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (9 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 10/27] KVM: VMX: Set FRED MSR interception Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 12/27] KVM: VMX: Add support for FRED context save/restore Xin Li (Intel)
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Save guest FRED RSP0 in vmx_prepare_switch_to_host() and restore it
in vmx_prepare_switch_to_guest() because MSR_IA32_FRED_RSP0 is passed
through to the guest, the guest value is volatile/unknown.

Note, host FRED RSP0 is restored in arch_exit_to_user_mode_prepare(),
regardless of whether it is modified in KVM.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---

Changes since v2:
* KVM only needs to save/restore guest FRED RSP0 now as host FRED RSP0
  is restored in arch_exit_to_user_mode_prepare() (Sean Christopherson).

Changes since v1:
* Don't use guest_cpuid_has() in vmx_prepare_switch_to_{host,guest}(),
  which are called from IRQ-disabled context (Chao Gao).
* Reset msr_guest_fred_rsp0 in __vmx_vcpu_reset() (Chao Gao).
---
 arch/x86/kvm/vmx/vmx.c | 8 ++++++++
 arch/x86/kvm/vmx/vmx.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c10c955722a3..c638492ebd59 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1348,6 +1348,9 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	}
 
 	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
+
+	if (cpu_feature_enabled(X86_FEATURE_FRED) && guest_can_use(vcpu, X86_FEATURE_FRED))
+		wrmsrns(MSR_IA32_FRED_RSP0, vmx->msr_guest_fred_rsp0);
 #else
 	savesegment(fs, fs_sel);
 	savesegment(gs, gs_sel);
@@ -1392,6 +1395,11 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx)
 	invalidate_tss_limit();
 #ifdef CONFIG_X86_64
 	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
+
+	if (cpu_feature_enabled(X86_FEATURE_FRED) && guest_can_use(&vmx->vcpu, X86_FEATURE_FRED)) {
+		vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
+		fred_sync_rsp0(vmx->msr_guest_fred_rsp0);
+	}
 #endif
 	load_fixmap_gdt(raw_smp_processor_id());
 	vmx->guest_state_loaded = false;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e7409f8f28b1..9ba960472c5f 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -277,6 +277,7 @@ struct vcpu_vmx {
 #ifdef CONFIG_X86_64
 	u64		      msr_host_kernel_gs_base;
 	u64		      msr_guest_kernel_gs_base;
+	u64		      msr_guest_fred_rsp0;
 #endif
 
 	u64		      spec_ctrl;
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 12/27] KVM: VMX: Add support for FRED context save/restore
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (10 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 11/27] KVM: VMX: Save/restore guest FRED RSP0 Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 13/27] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU Xin Li (Intel)
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Handle FRED MSR access requests, allowing FRED context to be set/get
from both host and guest.

During VM save/restore and live migration, FRED context needs to be
saved/restored, which requires FRED MSRs to be accessed from userspace,
e.g., Qemu.

Note, handling of MSR_IA32_FRED_SSP0, i.e., MSR_IA32_PL0_SSP, is not
added yet, which is done in the KVM CET patch set.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---

Changes since v2:
* Add a helper to convert FRED MSR index to VMCS field encoding to
  make the code more compact (Chao Gao).
* Get rid of the "host_initiated" check because userspace has to set
  CPUID before MSRs (Chao Gao & Sean Christopherson).
* Address a few cleanup comments (Sean Christopherson).

Changes since v1:
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() (Chao Gao).
* Fail host requested FRED MSRs access if KVM cannot virtualize FRED
  (Chao Gao).
* Handle the case FRED MSRs are valid but KVM cannot virtualize FRED
  (Chao Gao).
* Add sanity checks when writing to FRED MSRs.
---
 arch/x86/kvm/vmx/vmx.c | 48 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c     | 25 ++++++++++++++++++++++
 2 files changed, 73 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c638492ebd59..65ab26b13d24 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1424,6 +1424,24 @@ static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data)
 	preempt_enable();
 	vmx->msr_guest_kernel_gs_base = data;
 }
+
+static u64 vmx_read_guest_fred_rsp0(struct vcpu_vmx *vmx)
+{
+	preempt_disable();
+	if (vmx->guest_state_loaded)
+		vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
+	preempt_enable();
+	return vmx->msr_guest_fred_rsp0;
+}
+
+static void vmx_write_guest_fred_rsp0(struct vcpu_vmx *vmx, u64 data)
+{
+	preempt_disable();
+	if (vmx->guest_state_loaded)
+		wrmsrns(MSR_IA32_FRED_RSP0, data);
+	preempt_enable();
+	vmx->msr_guest_fred_rsp0 = data;
+}
 #endif
 
 static void grow_ple_window(struct kvm_vcpu *vcpu)
@@ -2036,6 +2054,24 @@ int vmx_get_feature_msr(u32 msr, u64 *data)
 	}
 }
 
+#ifdef CONFIG_X86_64
+static u32 fred_msr_vmcs_fields[] = {
+	GUEST_IA32_FRED_RSP1,
+	GUEST_IA32_FRED_RSP2,
+	GUEST_IA32_FRED_RSP3,
+	GUEST_IA32_FRED_STKLVLS,
+	GUEST_IA32_FRED_SSP1,
+	GUEST_IA32_FRED_SSP2,
+	GUEST_IA32_FRED_SSP3,
+	GUEST_IA32_FRED_CONFIG,
+};
+
+static u32 fred_msr_to_vmcs(u32 msr)
+{
+	return fred_msr_vmcs_fields[msr - MSR_IA32_FRED_RSP1];
+}
+#endif
+
 /*
  * Reads an msr value (of 'msr_info->index') into 'msr_info->data'.
  * Returns 0 on success, non-0 otherwise.
@@ -2058,6 +2094,12 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_KERNEL_GS_BASE:
 		msr_info->data = vmx_read_guest_kernel_gs_base(vmx);
 		break;
+	case MSR_IA32_FRED_RSP0:
+		msr_info->data = vmx_read_guest_fred_rsp0(vmx);
+		break;
+	case MSR_IA32_FRED_RSP1 ... MSR_IA32_FRED_CONFIG:
+		msr_info->data = vmcs_read64(fred_msr_to_vmcs(msr_info->index));
+		break;
 #endif
 	case MSR_EFER:
 		return kvm_get_msr_common(vcpu, msr_info);
@@ -2265,6 +2307,12 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			vmx_update_exception_bitmap(vcpu);
 		}
 		break;
+	case MSR_IA32_FRED_RSP0:
+		vmx_write_guest_fred_rsp0(vmx, data);
+		break;
+	case MSR_IA32_FRED_RSP1 ... MSR_IA32_FRED_CONFIG:
+		vmcs_write64(fred_msr_to_vmcs(msr_index), data);
+		break;
 #endif
 	case MSR_IA32_SYSENTER_CS:
 		if (is_guest_mode(vcpu))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e8de9f4734a6..b31ebafbe0bc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -320,6 +320,9 @@ static const u32 msrs_to_save_base[] = {
 	MSR_STAR,
 #ifdef CONFIG_X86_64
 	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
+	MSR_IA32_FRED_RSP0, MSR_IA32_FRED_RSP1, MSR_IA32_FRED_RSP2,
+	MSR_IA32_FRED_RSP3, MSR_IA32_FRED_STKLVLS, MSR_IA32_FRED_SSP1,
+	MSR_IA32_FRED_SSP2, MSR_IA32_FRED_SSP3, MSR_IA32_FRED_CONFIG,
 #endif
 	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
 	MSR_IA32_FEAT_CTL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
@@ -1891,6 +1894,20 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
 
 		data = (u32)data;
 		break;
+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
+		if (!guest_can_use(vcpu, X86_FEATURE_FRED))
+			return 1;
+
+		if (index != MSR_IA32_FRED_STKLVLS && is_noncanonical_address(data, vcpu))
+			return 1;
+		if ((index >= MSR_IA32_FRED_RSP0 && index <= MSR_IA32_FRED_RSP3) &&
+		    (data & GENMASK_ULL(5, 0)))
+			return 1;
+		if ((index >= MSR_IA32_FRED_SSP1 && index <= MSR_IA32_FRED_SSP3) &&
+		    (data & GENMASK_ULL(2, 0)))
+			return 1;
+
+		break;
 	}
 
 	msr.data = data;
@@ -1935,6 +1952,10 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
 		    !guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
 			return 1;
 		break;
+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
+		if (!guest_can_use(vcpu, X86_FEATURE_FRED))
+			return 1;
+		break;
 	}
 
 	msr.index = index;
@@ -7460,6 +7481,10 @@ static void kvm_probe_msr_to_save(u32 msr_index)
 		if (!(kvm_get_arch_capabilities() & ARCH_CAP_TSX_CTRL_MSR))
 			return;
 		break;
+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
+		if (!kvm_cpu_cap_has(X86_FEATURE_FRED))
+			return;
+		break;
 	default:
 		break;
 	}
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 13/27] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (11 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 12/27] KVM: VMX: Add support for FRED context save/restore Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 14/27] KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM Xin Li (Intel)
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Add is_fred_enabled() to detect if FRED is enabled on a vCPU.

Signed-off-by: Xin Li <xin3.li@intel.com>
[ Sean: removed the "kvm_" prefix from the function name ]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/kvm_cache_regs.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index b1eb46e26b2e..386c79f5dcb8 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -187,6 +187,21 @@ static __always_inline bool kvm_is_cr4_bit_set(struct kvm_vcpu *vcpu,
 	return !!kvm_read_cr4_bits(vcpu, cr4_bit);
 }
 
+/*
+ * It's enough to check just CR4.FRED (X86_CR4_FRED) to tell if
+ * a vCPU is running with FRED enabled, because:
+ * 1) CR4.FRED can be set to 1 only _after_ IA32_EFER.LMA = 1.
+ * 2) To leave IA-32e mode, CR4.FRED must be cleared first.
+ */
+static inline bool is_fred_enabled(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_X86_64
+	return kvm_is_cr4_bit_set(vcpu, X86_CR4_FRED);
+#else
+	return false;
+#endif
+}
+
 static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu)
 {
 	if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 14/27] KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (12 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 13/27] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 15/27] KVM: VMX: Virtualize FRED event_data Xin Li (Intel)
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Sean Christopherson <seanjc@google.com>

Pass XFD_ERR via KVM's exception payload mechanism when injecting an #NM
after interception so that XFD_ERR can be propagated to FRED's event_data
field without needing a dedicated field (which would need to be migrated).

For non-FRED vCPUs, this is a glorified NOP as
kvm_deliver_exception_payload() will simply do nothing (which is desirable
and correct).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 65ab26b13d24..686006fe6d45 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5336,6 +5336,12 @@ bool vmx_guest_inject_ac(struct kvm_vcpu *vcpu)
 	       (kvm_get_rflags(vcpu) & X86_EFLAGS_AC);
 }
 
+static bool is_xfd_nm_fault(struct kvm_vcpu *vcpu)
+{
+	return vcpu->arch.guest_fpu.fpstate->xfd &&
+	       !kvm_is_cr0_bit_set(vcpu, X86_CR0_TS);
+}
+
 static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -5362,7 +5368,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 	 * point.
 	 */
 	if (is_nm_fault(intr_info)) {
-		kvm_queue_exception(vcpu, NM_VECTOR);
+		kvm_queue_exception_p(vcpu, NM_VECTOR,
+				      is_xfd_nm_fault(vcpu) ? vcpu->arch.guest_fpu.xfd_err : 0);
 		return 1;
 	}
 
@@ -7110,14 +7117,13 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 	 *
 	 * Update the guest's XFD_ERR if and only if XFD is enabled, as the #NM
 	 * interception may have been caused by L1 interception.  Per the SDM,
-	 * XFD_ERR is not modified if CR0.TS=1.
+	 * XFD_ERR is not modified for non-XFD #NM, i.e. if CR0.TS=1.
 	 *
 	 * Note, XFD_ERR is updated _before_ the #NM interception check, i.e.
 	 * unlike CR2 and DR6, the value is not a payload that is attached to
 	 * the #NM exception.
 	 */
-	if (vcpu->arch.guest_fpu.fpstate->xfd &&
-	    !kvm_is_cr0_bit_set(vcpu, X86_CR0_TS))
+	if (is_xfd_nm_fault(vcpu))
 		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
 }
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 15/27] KVM: VMX: Virtualize FRED event_data
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (13 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 14/27] KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-01  5:00 ` [PATCH v3 16/27] KVM: VMX: Virtualize FRED nested exception tracking Xin Li (Intel)
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Set injected-event data when injecting a #PF, #DB, or #NM caused
by extended feature disable using FRED event delivery, and save
original-event data for being used as injected-event data.

Unlike IDT using some extra CPU register as part of an event
context, e.g., %cr2 for #PF, FRED saves a complete event context
in its stack frame, e.g., FRED saves the faulting linear address
of a #PF into the event data field defined in its stack frame.

Thus a new VMX control field called injected-event data is added
to provide the event data that will be pushed into a FRED stack
frame for VM entries that inject an event using FRED event delivery.
In addition, a new VM exit information field called original-event
data is added to store the event data that would have saved into a
FRED stack frame for VM exits that occur during FRED event delivery.
After such a VM exit is handled to allow the original-event to be
delivered, the data in the original-event data VMCS field needs to
be set into the injected-event data VMCS field for the injection of
the original event.

Signed-off-by: Xin Li <xin3.li@intel.com>
[ Sean: reworked event data injection for nested ]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---

Changes since v2:
* Rework event data injection for nested (Chao Gao & Sean Christopherson).

Changes since v1:
* Document event data should be equal to CR2/DR6/IA32_XFD_ERR instead
  of using WARN_ON() (Chao Gao).
* Zero event data if a #NM was not caused by extended feature disable
  (Chao Gao).
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/include/asm/vmx.h      |  4 ++++
 arch/x86/kvm/svm/svm.c          |  2 +-
 arch/x86/kvm/vmx/vmx.c          | 22 ++++++++++++++++++----
 arch/x86/kvm/x86.c              | 16 +++++++++++++++-
 5 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 43b08d12cb32..b9b82aaea9a3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -736,6 +736,7 @@ struct kvm_queued_exception {
 	u32 error_code;
 	unsigned long payload;
 	bool has_payload;
+	u64 event_data;
 };
 
 struct kvm_vcpu_arch {
@@ -2113,7 +2114,7 @@ void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long payload);
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
-			   bool has_error_code, u32 error_code);
+			   bool has_error_code, u32 error_code, u64 event_data);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
 void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 				    struct x86_exception *fault);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 5184e03945dd..3696e763c231 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -265,8 +265,12 @@ enum vmcs_field {
 	PID_POINTER_TABLE_HIGH		= 0x00002043,
 	SECONDARY_VM_EXIT_CONTROLS	= 0x00002044,
 	SECONDARY_VM_EXIT_CONTROLS_HIGH	= 0x00002045,
+	INJECTED_EVENT_DATA		= 0x00002052,
+	INJECTED_EVENT_DATA_HIGH	= 0x00002053,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
+	ORIGINAL_EVENT_DATA		= 0x00002404,
+	ORIGINAL_EVENT_DATA_HIGH	= 0x00002405,
 	VMCS_LINK_POINTER               = 0x00002800,
 	VMCS_LINK_POINTER_HIGH          = 0x00002801,
 	GUEST_IA32_DEBUGCTL             = 0x00002802,
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d9e2568bcd54..7fa8f842f116 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4126,7 +4126,7 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 
 		kvm_requeue_exception(vcpu, vector,
 				      exitintinfo & SVM_EXITINTINFO_VALID_ERR,
-				      error_code);
+				      error_code, 0);
 		break;
 	}
 	case SVM_EXITINTINFO_TYPE_INTR:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 686006fe6d45..d81144bd648f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1915,6 +1915,9 @@ void vmx_inject_exception(struct kvm_vcpu *vcpu)
 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
 
+	if (is_fred_enabled(vcpu))
+		vmcs_write64(INJECTED_EVENT_DATA, ex->event_data);
+
 	vmx_clear_hlt(vcpu);
 }
 
@@ -7241,7 +7244,8 @@ static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx)
 static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 				      u32 idt_vectoring_info,
 				      int instr_len_field,
-				      int error_code_field)
+				      int error_code_field,
+				      int event_data_field)
 {
 	u8 vector;
 	int type;
@@ -7276,13 +7280,17 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 		fallthrough;
 	case INTR_TYPE_HARD_EXCEPTION: {
 		u32 error_code = 0;
+		u64 event_data = 0;
 
 		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK)
 			error_code = vmcs_read32(error_code_field);
+		if (is_fred_enabled(vcpu))
+			event_data = vmcs_read64(event_data_field);
 
 		kvm_requeue_exception(vcpu, vector,
 				      idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK,
-				      error_code);
+				      error_code,
+				      event_data);
 		break;
 	}
 	case INTR_TYPE_SOFT_INTR:
@@ -7300,7 +7308,8 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
 	__vmx_complete_interrupts(&vmx->vcpu, vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
-				  IDT_VECTORING_ERROR_CODE);
+				  IDT_VECTORING_ERROR_CODE,
+				  ORIGINAL_EVENT_DATA);
 }
 
 void vmx_cancel_injection(struct kvm_vcpu *vcpu)
@@ -7308,7 +7317,8 @@ void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 	__vmx_complete_interrupts(vcpu,
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 				  VM_ENTRY_INSTRUCTION_LEN,
-				  VM_ENTRY_EXCEPTION_ERROR_CODE);
+				  VM_ENTRY_EXCEPTION_ERROR_CODE,
+				  INJECTED_EVENT_DATA);
 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
@@ -7439,6 +7449,10 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 
 	vmx_disable_fb_clear(vmx);
 
+	/*
+	 * Note, even though FRED delivers the faulting linear address via the
+	 * event data field on the stack, CR2 is still updated.
+	 */
 	if (vcpu->arch.cr2 != native_read_cr2())
 		native_write_cr2(vcpu->arch.cr2);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b31ebafbe0bc..7a55c1eb5297 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -810,9 +810,22 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
 		 * breakpoint), it is reserved and must be zero in DR6.
 		 */
 		vcpu->arch.dr6 &= ~BIT(12);
+
+		/*
+		 * FRED #DB event data matches DR6, but follows the polarity of
+		 * VMX's pending debug exceptions, not DR6.
+		 */
+		ex->event_data = ex->payload & ~BIT(12);
+		break;
+	case NM_VECTOR:
+		ex->event_data = ex->payload;
 		break;
 	case PF_VECTOR:
 		vcpu->arch.cr2 = ex->payload;
+		ex->event_data = ex->payload;
+		break;
+	default:
+		ex->event_data = 0;
 		break;
 	}
 
@@ -920,7 +933,7 @@ static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
 }
 
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
-			   bool has_error_code, u32 error_code)
+			   bool has_error_code, u32 error_code, u64 event_data)
 {
 
 	/*
@@ -945,6 +958,7 @@ void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
 	vcpu->arch.exception.error_code = error_code;
 	vcpu->arch.exception.has_payload = false;
 	vcpu->arch.exception.payload = 0;
+	vcpu->arch.exception.event_data = event_data;
 }
 EXPORT_SYMBOL_GPL(kvm_requeue_exception);
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 16/27] KVM: VMX: Virtualize FRED nested exception tracking
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (14 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 15/27] KVM: VMX: Virtualize FRED event_data Xin Li (Intel)
@ 2024-10-01  5:00 ` Xin Li (Intel)
  2024-10-24  6:24   ` Chao Gao
  2024-10-01  5:01 ` [PATCH v3 17/27] KVM: x86: Mark CR4.FRED as not reserved when guest can use FRED Xin Li (Intel)
                   ` (12 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:00 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Set VMX nested exception bit in the VM-entry interruption information
VMCS field when injecting a nested exception using FRED event delivery
to ensure:
  1) The nested exception is injected on a correct stack level.
  2) The nested bit defined in FRED stack frame is set.

The event stack level used by FRED event delivery depends on whether the
event was a nested exception encountered during delivery of another event,
because a nested exception is "regarded" as happening on ring 0.  E.g.,
when #PF is configured to use stack level 1 in IA32_FRED_STKLVLS MSR:
  - nested #PF will be delivered on stack level 1 when encountered in
    ring 3.
  - normal #PF will be delivered on stack level 0 when encountered in
    ring 3.

The VMX nested-exception support ensures the correct event stack level is
chosen when a VM entry injects a nested exception.

Signed-off-by: Xin Li <xin3.li@intel.com>
[ Sean: reworked kvm_requeue_exception() to simply the code changes ]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---

Change since v2:
* Rework kvm_requeue_exception() to simply the code changes (Sean
  Christopherson).

Change since v1:
* Set the nested flag when there is an original interrupt (Chao Gao).
---
 arch/x86/include/asm/kvm_host.h |  4 +++-
 arch/x86/include/asm/vmx.h      |  5 ++++-
 arch/x86/kvm/svm/svm.c          |  2 +-
 arch/x86/kvm/vmx/vmx.c          |  6 +++++-
 arch/x86/kvm/x86.c              | 14 +++++++++++++-
 arch/x86/kvm/x86.h              |  1 +
 6 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b9b82aaea9a3..3830084b569b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -736,6 +736,7 @@ struct kvm_queued_exception {
 	u32 error_code;
 	unsigned long payload;
 	bool has_payload;
+	bool nested;
 	u64 event_data;
 };
 
@@ -2114,7 +2115,8 @@ void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long payload);
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
-			   bool has_error_code, u32 error_code, u64 event_data);
+			   bool has_error_code, u32 error_code, bool nested,
+			   u64 event_data);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
 void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 				    struct x86_exception *fault);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 3696e763c231..06c52fee5dcd 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -137,6 +137,7 @@
 #define VMX_BASIC_DUAL_MONITOR_TREATMENT	BIT_ULL(49)
 #define VMX_BASIC_INOUT				BIT_ULL(54)
 #define VMX_BASIC_TRUE_CTLS			BIT_ULL(55)
+#define VMX_BASIC_NESTED_EXCEPTION		BIT_ULL(58)
 
 static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
 {
@@ -416,13 +417,15 @@ enum vmcs_field {
 #define INTR_INFO_INTR_TYPE_MASK        0x700           /* 10:8 */
 #define INTR_INFO_DELIVER_CODE_MASK     0x800           /* 11 */
 #define INTR_INFO_UNBLOCK_NMI		0x1000		/* 12 */
+#define INTR_INFO_NESTED_EXCEPTION_MASK	0x2000		/* 13 */
 #define INTR_INFO_VALID_MASK            0x80000000      /* 31 */
-#define INTR_INFO_RESVD_BITS_MASK       0x7ffff000
+#define INTR_INFO_RESVD_BITS_MASK       0x7fffd000
 
 #define VECTORING_INFO_VECTOR_MASK           	INTR_INFO_VECTOR_MASK
 #define VECTORING_INFO_TYPE_MASK        	INTR_INFO_INTR_TYPE_MASK
 #define VECTORING_INFO_DELIVER_CODE_MASK    	INTR_INFO_DELIVER_CODE_MASK
 #define VECTORING_INFO_VALID_MASK       	INTR_INFO_VALID_MASK
+#define VECTORING_INFO_NESTED_EXCEPTION_MASK	INTR_INFO_NESTED_EXCEPTION_MASK
 
 #define INTR_TYPE_EXT_INTR		(EVENT_TYPE_EXTINT << 8)	/* external interrupt */
 #define INTR_TYPE_RESERVED		(EVENT_TYPE_RESERVED << 8)	/* reserved */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7fa8f842f116..e479e0208efe 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4126,7 +4126,7 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 
 		kvm_requeue_exception(vcpu, vector,
 				      exitintinfo & SVM_EXITINTINFO_VALID_ERR,
-				      error_code, 0);
+				      error_code, false, 0);
 		break;
 	}
 	case SVM_EXITINTINFO_TYPE_INTR:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d81144bd648f..03f42b218554 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1910,8 +1910,11 @@ void vmx_inject_exception(struct kvm_vcpu *vcpu)
 		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
 			     vmx->vcpu.arch.event_exit_inst_len);
 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
-	} else
+	} else {
 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
+		if (ex->nested)
+			intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
+	}
 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
 
@@ -7290,6 +7293,7 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 		kvm_requeue_exception(vcpu, vector,
 				      idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK,
 				      error_code,
+				      idt_vectoring_info & VECTORING_INFO_NESTED_EXCEPTION_MASK,
 				      event_data);
 		break;
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7a55c1eb5297..8546629166e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -874,6 +874,11 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
 		vcpu->arch.exception.pending = true;
 		vcpu->arch.exception.injected = false;
 
+		vcpu->arch.exception.nested = vcpu->arch.exception.nested ||
+					      (is_fred_enabled(vcpu) &&
+					       (vcpu->arch.nmi_injected ||
+					        vcpu->arch.interrupt.injected));
+
 		vcpu->arch.exception.has_error_code = has_error;
 		vcpu->arch.exception.vector = nr;
 		vcpu->arch.exception.error_code = error_code;
@@ -903,8 +908,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
 		vcpu->arch.exception.injected = false;
 		vcpu->arch.exception.pending = false;
 
+		/* #DF is NOT a nested event, per its definition. */
+		vcpu->arch.exception.nested = false;
+
 		kvm_queue_exception_e(vcpu, DF_VECTOR, 0);
 	} else {
+		vcpu->arch.exception.nested = is_fred_enabled(vcpu);
+
 		/* replace previous exception with a new one in a hope
 		   that instruction re-execution will regenerate lost
 		   exception */
@@ -933,7 +943,8 @@ static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
 }
 
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
-			   bool has_error_code, u32 error_code, u64 event_data)
+			   bool has_error_code, u32 error_code, bool nested,
+			   u64 event_data)
 {
 
 	/*
@@ -958,6 +969,7 @@ void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
 	vcpu->arch.exception.error_code = error_code;
 	vcpu->arch.exception.has_payload = false;
 	vcpu->arch.exception.payload = 0;
+	vcpu->arch.exception.nested = nested;
 	vcpu->arch.exception.event_data = event_data;
 }
 EXPORT_SYMBOL_GPL(kvm_requeue_exception);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 578fea05ff18..992e73ee2ec5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -134,6 +134,7 @@ static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.exception.pending = false;
 	vcpu->arch.exception.injected = false;
+	vcpu->arch.exception.nested = false;
 	vcpu->arch.exception_vmexit.pending = false;
 }
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 17/27] KVM: x86: Mark CR4.FRED as not reserved when guest can use FRED
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (15 preceding siblings ...)
  2024-10-01  5:00 ` [PATCH v3 16/27] KVM: VMX: Virtualize FRED nested exception tracking Xin Li (Intel)
@ 2024-10-01  5:01 ` Xin Li (Intel)
  2024-10-24  7:18   ` Chao Gao
  2024-10-01  5:01 ` [PATCH v3 18/27] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li (Intel)
                   ` (11 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:01 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

The CR4.FRED bit, i.e., CR4[32], is no longer a reserved bit when
guest can use FRED, i.e.,
  1) All of FRED KVM support is in place.
  2) Guest enumerates FRED.
Otherwise it is still a reserved bit.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---

Changes since v2:
* Don't allow CR4.FRED=1 before all of FRED KVM support is in place
  (Sean Christopherson).
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/vmx/vmx.c          | 4 ++++
 arch/x86/kvm/x86.h              | 2 ++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3830084b569b..87f9f0b6cf3c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -136,7 +136,7 @@
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
 			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
 			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
-			  | X86_CR4_LAM_SUP))
+			  | X86_CR4_LAM_SUP | X86_CR4_FRED))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 03f42b218554..bfdd10773136 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8009,6 +8009,10 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_LAM);
 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_FRED);
 
+	/* Don't allow CR4.FRED=1 before all of FRED KVM support is in place. */
+	if (!guest_can_use(vcpu, X86_FEATURE_FRED))
+		vcpu->arch.cr4_guest_rsvd_bits |= X86_CR4_FRED;
+
 	vmx_setup_uret_msrs(vmx);
 
 	if (cpu_has_secondary_exec_ctrls())
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 992e73ee2ec5..0ed91512b757 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -561,6 +561,8 @@ enum kvm_msr_access {
 		__reserved_bits |= X86_CR4_PCIDE;       \
 	if (!__cpu_has(__c, X86_FEATURE_LAM))           \
 		__reserved_bits |= X86_CR4_LAM_SUP;     \
+	if (!__cpu_has(__c, X86_FEATURE_FRED))          \
+		__reserved_bits |= X86_CR4_FRED;        \
 	__reserved_bits;                                \
 })
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 18/27] KVM: VMX: Dump FRED context in dump_vmcs()
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (16 preceding siblings ...)
  2024-10-01  5:01 ` [PATCH v3 17/27] KVM: x86: Mark CR4.FRED as not reserved when guest can use FRED Xin Li (Intel)
@ 2024-10-01  5:01 ` Xin Li (Intel)
  2024-10-24  7:23   ` Chao Gao
  2024-10-01  5:01 ` [PATCH v3 19/27] KVM: x86: Allow FRED/LKGS to be advertised to guests Xin Li (Intel)
                   ` (10 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:01 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Add FRED related VMCS fields to dump_vmcs() to dump FRED context.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---

Change since v2:
* Use (vmentry_ctrl & VM_ENTRY_LOAD_IA32_FRED) instead of is_fred_enabled()
  (Chao Gao).

Change since v1:
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() (Chao Gao).
* Dump guest FRED states only if guest has FRED enabled (Nikolay Borisov).
---
 arch/x86/kvm/vmx/vmx.c | 40 +++++++++++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bfdd10773136..ef807194ccbd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6399,7 +6399,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 vmentry_ctl, vmexit_ctl;
 	u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
-	u64 tertiary_exec_control;
+	u64 tertiary_exec_control, secondary_vmexit_ctl;
 	unsigned long cr4;
 	int efer_slot;
 
@@ -6410,6 +6410,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 
 	vmentry_ctl = vmcs_read32(VM_ENTRY_CONTROLS);
 	vmexit_ctl = vmcs_read32(VM_EXIT_CONTROLS);
+	secondary_vmexit_ctl = cpu_has_secondary_vmexit_ctrls() ?
+			       vmcs_read64(SECONDARY_VM_EXIT_CONTROLS) : 0;
 	cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
 	cr4 = vmcs_readl(GUEST_CR4);
@@ -6456,6 +6458,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	vmx_dump_sel("LDTR:", GUEST_LDTR_SELECTOR);
 	vmx_dump_dtsel("IDTR:", GUEST_IDTR_LIMIT);
 	vmx_dump_sel("TR:  ", GUEST_TR_SELECTOR);
+	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_FRED)
+		pr_err("FRED guest: config=0x%016llx, stack_levels=0x%016llx\n"
+		       "RSP0=0x%016llx, RSP1=0x%016llx\n"
+		       "RSP2=0x%016llx, RSP3=0x%016llx\n",
+		       vmcs_read64(GUEST_IA32_FRED_CONFIG),
+		       vmcs_read64(GUEST_IA32_FRED_STKLVLS),
+		       __rdmsr(MSR_IA32_FRED_RSP0),
+		       vmcs_read64(GUEST_IA32_FRED_RSP1),
+		       vmcs_read64(GUEST_IA32_FRED_RSP2),
+		       vmcs_read64(GUEST_IA32_FRED_RSP3));
 	efer_slot = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest, MSR_EFER);
 	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_EFER)
 		pr_err("EFER= 0x%016llx\n", vmcs_read64(GUEST_IA32_EFER));
@@ -6503,6 +6515,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	       vmcs_readl(HOST_TR_BASE));
 	pr_err("GDTBase=%016lx IDTBase=%016lx\n",
 	       vmcs_readl(HOST_GDTR_BASE), vmcs_readl(HOST_IDTR_BASE));
+	if (vmexit_ctl & SECONDARY_VM_EXIT_LOAD_IA32_FRED)
+		pr_err("FRED host: config=0x%016llx, stack_levels=0x%016llx\n"
+		       "RSP0=0x%016lx, RSP1=0x%016llx\n"
+		       "RSP2=0x%016llx, RSP3=0x%016llx\n",
+		       vmcs_read64(HOST_IA32_FRED_CONFIG),
+		       vmcs_read64(HOST_IA32_FRED_STKLVLS),
+		       (unsigned long)task_stack_page(current) + THREAD_SIZE,
+		       vmcs_read64(HOST_IA32_FRED_RSP1),
+		       vmcs_read64(HOST_IA32_FRED_RSP2),
+		       vmcs_read64(HOST_IA32_FRED_RSP3));
 	pr_err("CR0=%016lx CR3=%016lx CR4=%016lx\n",
 	       vmcs_readl(HOST_CR0), vmcs_readl(HOST_CR3),
 	       vmcs_readl(HOST_CR4));
@@ -6524,25 +6546,29 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	pr_err("*** Control State ***\n");
 	pr_err("CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
 	       cpu_based_exec_ctrl, secondary_exec_control, tertiary_exec_control);
-	pr_err("PinBased=0x%08x EntryControls=%08x ExitControls=%08x\n",
-	       pin_based_exec_ctrl, vmentry_ctl, vmexit_ctl);
+	pr_err("PinBased=0x%08x EntryControls=0x%08x\n",
+	       pin_based_exec_ctrl, vmentry_ctl);
+	pr_err("ExitControls=0x%08x SecondaryExitControls=0x%016llx\n",
+	       vmexit_ctl, secondary_vmexit_ctl);
 	pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
 	       vmcs_read32(EXCEPTION_BITMAP),
 	       vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK),
 	       vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH));
-	pr_err("VMEntry: intr_info=%08x errcode=%08x ilen=%08x\n",
+	pr_err("VMEntry: intr_info=%08x errcode=%08x ilen=%08x event_data=%016llx\n",
 	       vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 	       vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE),
-	       vmcs_read32(VM_ENTRY_INSTRUCTION_LEN));
+	       vmcs_read32(VM_ENTRY_INSTRUCTION_LEN),
+	       kvm_cpu_cap_has(X86_FEATURE_FRED) ? vmcs_read64(INJECTED_EVENT_DATA) : 0);
 	pr_err("VMExit: intr_info=%08x errcode=%08x ilen=%08x\n",
 	       vmcs_read32(VM_EXIT_INTR_INFO),
 	       vmcs_read32(VM_EXIT_INTR_ERROR_CODE),
 	       vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
 	pr_err("        reason=%08x qualification=%016lx\n",
 	       vmcs_read32(VM_EXIT_REASON), vmcs_readl(EXIT_QUALIFICATION));
-	pr_err("IDTVectoring: info=%08x errcode=%08x\n",
+	pr_err("IDTVectoring: info=%08x errcode=%08x event_data=%016llx\n",
 	       vmcs_read32(IDT_VECTORING_INFO_FIELD),
-	       vmcs_read32(IDT_VECTORING_ERROR_CODE));
+	       vmcs_read32(IDT_VECTORING_ERROR_CODE),
+	       kvm_cpu_cap_has(X86_FEATURE_FRED) ? vmcs_read64(ORIGINAL_EVENT_DATA) : 0);
 	pr_err("TSC Offset = 0x%016llx\n", vmcs_read64(TSC_OFFSET));
 	if (secondary_exec_control & SECONDARY_EXEC_TSC_SCALING)
 		pr_err("TSC Multiplier = 0x%016llx\n",
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 19/27] KVM: x86: Allow FRED/LKGS to be advertised to guests
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (17 preceding siblings ...)
  2024-10-01  5:01 ` [PATCH v3 18/27] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li (Intel)
@ 2024-10-01  5:01 ` Xin Li (Intel)
  2024-10-01  5:01 ` [PATCH v3 20/27] KVM: x86: Allow WRMSRNS " Xin Li (Intel)
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:01 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Allow FRED/LKGS to be advertised to guests after changes required to
enable FRED in a KVM guest are in place.

LKGS is introduced with FRED to completely eliminate the need to swapgs
explicilty, because

1) FRED transitions ensure that an operating system can always operate
   with its own GS base address.

2) LKGS behaves like the MOV to GS instruction except that it loads
   the base address into the IA32_KERNEL_GS_BASE MSR instead of the
   GS segment’s descriptor cache, which is exactly what Linux kernel
   does to load a user level GS base.  Thus there is no need to SWAPGS
   away from the kernel GS base and an execution of SWAPGS causes #UD
   if FRED transitions are enabled.

A FRED CPU must enumerate LKGS.  When LKGS is not available, FRED must
not be enabled.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/cpuid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 41786b834b16..c55d150ece8d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -699,7 +699,7 @@ void kvm_set_cpu_caps(void)

 	kvm_cpu_cap_mask(CPUID_7_1_EAX,
 		F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
-		F(FZRM) | F(FSRS) | F(FSRC) |
+		F(FZRM) | F(FSRS) | F(FSRC) | F(FRED) | F(LKGS) |
 		F(AMX_FP16) | F(AVX_IFMA) | F(LAM)
 	);

-- 
2.46.2

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 20/27] KVM: x86: Allow WRMSRNS to be advertised to guests
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (18 preceding siblings ...)
  2024-10-01  5:01 ` [PATCH v3 19/27] KVM: x86: Allow FRED/LKGS to be advertised to guests Xin Li (Intel)
@ 2024-10-01  5:01 ` Xin Li (Intel)
  2025-02-25 15:41   ` Sean Christopherson
  2024-10-01  5:01 ` [PATCH v3 21/27] KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup Xin Li (Intel)
                   ` (8 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:01 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Allow WRMSRNS to be advertised to guests.

WRMSRNS behaves exactly like WRMSR with the only difference being
that it is not a serializing instruction by default.  It improves
performance when being used in a hot path, e.g., setting FRED RSP0.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/cpuid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c55d150ece8d..63a78ebf9482 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -700,7 +700,7 @@ void kvm_set_cpu_caps(void)
 	kvm_cpu_cap_mask(CPUID_7_1_EAX,
 		F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
 		F(FZRM) | F(FSRS) | F(FSRC) | F(FRED) | F(LKGS) |
-		F(AMX_FP16) | F(AVX_IFMA) | F(LAM)
+		F(WRMSRNS) | F(AMX_FP16) | F(AVX_IFMA) | F(LAM)
 	);
 
 	kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX,
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 21/27] KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (19 preceding siblings ...)
  2024-10-01  5:01 ` [PATCH v3 20/27] KVM: x86: Allow WRMSRNS " Xin Li (Intel)
@ 2024-10-01  5:01 ` Xin Li (Intel)
  2024-10-24  7:49   ` Chao Gao
  2024-10-01  5:01 ` [PATCH v3 22/27] KVM: nVMX: Add support for the secondary VM exit controls Xin Li (Intel)
                   ` (7 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:01 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Set VMX CPU capabilities before initializing nested instead of after,
as it needs to check VMX CPU capabilities to setup the VMX basic MSR
for nested.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ef807194ccbd..522ee27a4655 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8774,6 +8774,12 @@ __init int vmx_hardware_setup(void)
 
 	setup_default_sgx_lepubkeyhash();
 
+	/*
+	 * VMX CPU capabilities are required to setup the VMX basic MSR for
+	 * nested, so this must be done before nested_vmx_setup_ctls_msrs().
+	 */
+	vmx_set_cpu_caps();
+
 	if (nested) {
 		nested_vmx_setup_ctls_msrs(&vmcs_config, vmx_capability.ept);
 
@@ -8782,8 +8788,6 @@ __init int vmx_hardware_setup(void)
 			return r;
 	}
 
-	vmx_set_cpu_caps();
-
 	r = alloc_kvm_area();
 	if (r && nested)
 		nested_vmx_hardware_unsetup();
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 22/27] KVM: nVMX: Add support for the secondary VM exit controls
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (20 preceding siblings ...)
  2024-10-01  5:01 ` [PATCH v3 21/27] KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup Xin Li (Intel)
@ 2024-10-01  5:01 ` Xin Li (Intel)
  2024-10-01  5:01 ` [PATCH v3 23/27] KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros Xin Li (Intel)
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:01 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Enable the secondary VM exit controls to prepare for nested FRED.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---

Change since v2:
* Read secondary VM exit controls from vmcs_conf insteasd of the hardware
  MSR MSR_IA32_VMX_EXIT_CTLS2 to avoid advertising features to L1 that KVM
  itself doesn't support, e.g. because the expected entry+exit pairs aren't
  supported. (Sean Christopherson)
---
 Documentation/virt/kvm/x86/nested-vmx.rst |  1 +
 arch/x86/kvm/vmx/capabilities.h           |  1 +
 arch/x86/kvm/vmx/nested.c                 | 21 ++++++++++++++++++++-
 arch/x86/kvm/vmx/vmcs12.c                 |  1 +
 arch/x86/kvm/vmx/vmcs12.h                 |  2 ++
 arch/x86/kvm/x86.h                        |  2 +-
 6 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
index ac2095d41f02..e64ef231f310 100644
--- a/Documentation/virt/kvm/x86/nested-vmx.rst
+++ b/Documentation/virt/kvm/x86/nested-vmx.rst
@@ -217,6 +217,7 @@ struct shadow_vmcs is ever changed.
 		u16 host_fs_selector;
 		u16 host_gs_selector;
 		u16 host_tr_selector;
+		u64 secondary_vm_exit_controls;
 	};
 
 
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 2962a3bb9747..c96e6cb18c9a 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -38,6 +38,7 @@ struct nested_vmx_msrs {
 	u32 pinbased_ctls_high;
 	u32 exit_ctls_low;
 	u32 exit_ctls_high;
+	u64 secondary_exit_ctls;
 	u32 entry_ctls_low;
 	u32 entry_ctls_high;
 	u32 misc_low;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a8e7bc04d9bf..42e43eb7561f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1454,6 +1454,7 @@ int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
 	case MSR_IA32_VMX_PINBASED_CTLS:
 	case MSR_IA32_VMX_PROCBASED_CTLS:
 	case MSR_IA32_VMX_EXIT_CTLS:
+	case MSR_IA32_VMX_EXIT_CTLS2:
 	case MSR_IA32_VMX_ENTRY_CTLS:
 		/*
 		 * The "non-true" VMX capability MSRs are generated from the
@@ -1532,6 +1533,9 @@ int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata)
 		if (msr_index == MSR_IA32_VMX_EXIT_CTLS)
 			*pdata |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 		break;
+	case MSR_IA32_VMX_EXIT_CTLS2:
+		*pdata = msrs->secondary_exit_ctls;
+		break;
 	case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
 	case MSR_IA32_VMX_ENTRY_CTLS:
 		*pdata = vmx_control_msr(
@@ -2471,6 +2475,11 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
 		exec_control &= ~VM_EXIT_LOAD_IA32_EFER;
 	vm_exit_controls_set(vmx, exec_control);
 
+	if (exec_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
+		exec_control = __secondary_vm_exit_controls_get(vmcs01);
+		secondary_vm_exit_controls_set(vmx, exec_control);
+	}
+
 	/*
 	 * Interrupt/Exception Fields
 	 */
@@ -6956,7 +6965,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
 #endif
 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
-		VM_EXIT_CLEAR_BNDCFGS;
+		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_ACTIVATE_SECONDARY_CONTROLS;
 	msrs->exit_ctls_high |=
 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -6965,6 +6974,16 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
 
 	/* We support free control of debug control saving. */
 	msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS;
+
+	if (msrs->exit_ctls_high & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
+		msrs->secondary_exit_ctls = vmcs_conf->secondary_vmexit_ctrl;
+		/*
+		 * As the secondary VM exit control is always loaded, do not
+		 * advertise any feature in it to nVMX until its nVMX support
+		 * is ready.
+		 */
+		msrs->secondary_exit_ctls &= 0;
+	}
 }
 
 static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 106a72c923ca..98457d7b2b23 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -73,6 +73,7 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(PAGE_FAULT_ERROR_CODE_MATCH, page_fault_error_code_match),
 	FIELD(CR3_TARGET_COUNT, cr3_target_count),
 	FIELD(VM_EXIT_CONTROLS, vm_exit_controls),
+	FIELD(SECONDARY_VM_EXIT_CONTROLS, secondary_vm_exit_controls),
 	FIELD(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count),
 	FIELD(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count),
 	FIELD(VM_ENTRY_CONTROLS, vm_entry_controls),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 56fd150a6f24..1fe3ed9108aa 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -185,6 +185,7 @@ struct __packed vmcs12 {
 	u16 host_gs_selector;
 	u16 host_tr_selector;
 	u16 guest_pml_index;
+	u64 secondary_vm_exit_controls;
 };
 
 /*
@@ -360,6 +361,7 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(host_gs_selector, 992);
 	CHECK_OFFSET(host_tr_selector, 994);
 	CHECK_OFFSET(guest_pml_index, 996);
+	CHECK_OFFSET(secondary_vm_exit_controls, 998);
 }
 
 extern const unsigned short vmcs12_field_offsets[];
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 0ed91512b757..890b7a6554d5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -66,7 +66,7 @@ void kvm_spurious_fault(void);
  * associated feature that KVM supports for nested virtualization.
  */
 #define KVM_FIRST_EMULATED_VMX_MSR	MSR_IA32_VMX_BASIC
-#define KVM_LAST_EMULATED_VMX_MSR	MSR_IA32_VMX_VMFUNC
+#define KVM_LAST_EMULATED_VMX_MSR	MSR_IA32_VMX_EXIT_CTLS2
 
 #define KVM_DEFAULT_PLE_GAP		128
 #define KVM_VMX_DEFAULT_PLE_WINDOW	4096
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 23/27] KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (21 preceding siblings ...)
  2024-10-01  5:01 ` [PATCH v3 22/27] KVM: nVMX: Add support for the secondary VM exit controls Xin Li (Intel)
@ 2024-10-01  5:01 ` Xin Li (Intel)
  2024-10-01  5:01 ` [PATCH v3 24/27] KVM: nVMX: Add a prerequisite to existence of VMCS fields Xin Li (Intel)
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:01 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Add a prerequisite for accessing VMCS fields referenced in macros
SHADOW_FIELD_R[OW], because a VMCS field may not exist on some CPUs.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---

Change since v2:
* Add __SHADOW_FIELD_R[OW] for better readability or maintability (Sean).
---
 arch/x86/kvm/vmx/nested.c             | 79 +++++++++++++++++++--------
 arch/x86/kvm/vmx/vmcs_shadow_fields.h | 33 ++++++++---
 2 files changed, 79 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 42e43eb7561f..7f3ac558ace5 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -54,14 +54,14 @@ struct shadow_vmcs_field {
 	u16	offset;
 };
 static struct shadow_vmcs_field shadow_read_only_fields[] = {
-#define SHADOW_FIELD_RO(x, y) { x, offsetof(struct vmcs12, y) },
+#define __SHADOW_FIELD_RO(x, y, c) { x, offsetof(struct vmcs12, y) },
 #include "vmcs_shadow_fields.h"
 };
 static int max_shadow_read_only_fields =
 	ARRAY_SIZE(shadow_read_only_fields);
 
 static struct shadow_vmcs_field shadow_read_write_fields[] = {
-#define SHADOW_FIELD_RW(x, y) { x, offsetof(struct vmcs12, y) },
+#define __SHADOW_FIELD_RW(x, y, c) { x, offsetof(struct vmcs12, y) },
 #include "vmcs_shadow_fields.h"
 };
 static int max_shadow_read_write_fields =
@@ -84,6 +84,17 @@ static void init_vmcs_shadow_fields(void)
 			pr_err("Missing field from shadow_read_only_field %x\n",
 			       field + 1);
 
+		switch (field) {
+#define __SHADOW_FIELD_RO(x, y, c)		\
+		case x:				\
+			if (!(c))		\
+				continue;	\
+			break;
+#include "vmcs_shadow_fields.h"
+		default:
+			break;
+		}
+
 		clear_bit(field, vmx_vmread_bitmap);
 		if (field & 1)
 #ifdef CONFIG_X86_64
@@ -109,24 +120,13 @@ static void init_vmcs_shadow_fields(void)
 			  field <= GUEST_TR_AR_BYTES,
 			  "Update vmcs12_write_any() to drop reserved bits from AR_BYTES");
 
-		/*
-		 * PML and the preemption timer can be emulated, but the
-		 * processor cannot vmwrite to fields that don't exist
-		 * on bare metal.
-		 */
 		switch (field) {
-		case GUEST_PML_INDEX:
-			if (!cpu_has_vmx_pml())
-				continue;
-			break;
-		case VMX_PREEMPTION_TIMER_VALUE:
-			if (!cpu_has_vmx_preemption_timer())
-				continue;
-			break;
-		case GUEST_INTR_STATUS:
-			if (!cpu_has_vmx_apicv())
-				continue;
+#define __SHADOW_FIELD_RW(x, y, c)		\
+		case x:				\
+			if (!(c))		\
+				continue;	\
 			break;
+#include "vmcs_shadow_fields.h"
 		default:
 			break;
 		}
@@ -1586,8 +1586,8 @@ int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata)
 /*
  * Copy the writable VMCS shadow fields back to the VMCS12, in case they have
  * been modified by the L1 guest.  Note, "writable" in this context means
- * "writable by the guest", i.e. tagged SHADOW_FIELD_RW; the set of
- * fields tagged SHADOW_FIELD_RO may or may not align with the "read-only"
+ * "writable by the guest", i.e. tagged __SHADOW_FIELD_RW; the set of
+ * fields tagged __SHADOW_FIELD_RO may or may not align with the "read-only"
  * VM-exit information fields (which are actually writable if the vCPU is
  * configured to support "VMWRITE to any supported field in the VMCS").
  */
@@ -1608,6 +1608,18 @@ static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx)
 
 	for (i = 0; i < max_shadow_read_write_fields; i++) {
 		field = shadow_read_write_fields[i];
+
+		switch (field.encoding) {
+#define __SHADOW_FIELD_RW(x, y, c)		\
+		case x:				\
+			if (!(c))		\
+				continue;	\
+			break;
+#include "vmcs_shadow_fields.h"
+		default:
+			break;
+		}
+
 		val = __vmcs_readl(field.encoding);
 		vmcs12_write_any(vmcs12, field.encoding, field.offset, val);
 	}
@@ -1642,6 +1654,23 @@ static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx)
 	for (q = 0; q < ARRAY_SIZE(fields); q++) {
 		for (i = 0; i < max_fields[q]; i++) {
 			field = fields[q][i];
+
+			switch (field.encoding) {
+#define __SHADOW_FIELD_RO(x, y, c)			\
+			case x:				\
+				if (!(c))		\
+					continue;	\
+				break;
+#define __SHADOW_FIELD_RW(x, y, c)			\
+			case x:				\
+				if (!(c))		\
+					continue;	\
+				break;
+#include "vmcs_shadow_fields.h"
+			default:
+				break;
+			}
+
 			val = vmcs12_read_any(vmcs12, field.encoding,
 					      field.offset);
 			__vmcs_writel(field.encoding, val);
@@ -5590,9 +5619,10 @@ static int handle_vmread(struct kvm_vcpu *vcpu)
 static bool is_shadow_field_rw(unsigned long field)
 {
 	switch (field) {
-#define SHADOW_FIELD_RW(x, y) case x:
+#define __SHADOW_FIELD_RW(x, y, c)	\
+	case x:				\
+		return c;
 #include "vmcs_shadow_fields.h"
-		return true;
 	default:
 		break;
 	}
@@ -5602,9 +5632,10 @@ static bool is_shadow_field_rw(unsigned long field)
 static bool is_shadow_field_ro(unsigned long field)
 {
 	switch (field) {
-#define SHADOW_FIELD_RO(x, y) case x:
+#define __SHADOW_FIELD_RO(x, y, c)	\
+	case x:				\
+		return c;
 #include "vmcs_shadow_fields.h"
-		return true;
 	default:
 		break;
 	}
diff --git a/arch/x86/kvm/vmx/vmcs_shadow_fields.h b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
index cad128d1657b..53b64dce1309 100644
--- a/arch/x86/kvm/vmx/vmcs_shadow_fields.h
+++ b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
@@ -1,14 +1,17 @@
-#if !defined(SHADOW_FIELD_RO) && !defined(SHADOW_FIELD_RW)
+#if !defined(__SHADOW_FIELD_RO) && !defined(__SHADOW_FIELD_RW)
 BUILD_BUG_ON(1)
 #endif
 
-#ifndef SHADOW_FIELD_RO
-#define SHADOW_FIELD_RO(x, y)
+#ifndef __SHADOW_FIELD_RO
+#define __SHADOW_FIELD_RO(x, y, c)
 #endif
-#ifndef SHADOW_FIELD_RW
-#define SHADOW_FIELD_RW(x, y)
+#ifndef __SHADOW_FIELD_RW
+#define __SHADOW_FIELD_RW(x, y, c)
 #endif
 
+#define SHADOW_FIELD_RO(x, y) __SHADOW_FIELD_RO(x, y, true)
+#define SHADOW_FIELD_RW(x, y) __SHADOW_FIELD_RW(x, y, true)
+
 /*
  * We do NOT shadow fields that are modified when L0
  * traps and emulates any vmx instruction (e.g. VMPTRLD,
@@ -32,8 +35,12 @@ BUILD_BUG_ON(1)
  */
 
 /* 16-bits */
-SHADOW_FIELD_RW(GUEST_INTR_STATUS, guest_intr_status)
-SHADOW_FIELD_RW(GUEST_PML_INDEX, guest_pml_index)
+__SHADOW_FIELD_RW(GUEST_INTR_STATUS, guest_intr_status, cpu_has_vmx_apicv())
+/*
+ * PML can be emulated, but the processor cannot vmwrite to the VMCS field
+ * GUEST_PML_INDEX that doesn't exist on bare metal.
+ */
+__SHADOW_FIELD_RW(GUEST_PML_INDEX, guest_pml_index, cpu_has_vmx_pml())
 SHADOW_FIELD_RW(HOST_FS_SELECTOR, host_fs_selector)
 SHADOW_FIELD_RW(HOST_GS_SELECTOR, host_gs_selector)
 
@@ -41,9 +48,9 @@ SHADOW_FIELD_RW(HOST_GS_SELECTOR, host_gs_selector)
 SHADOW_FIELD_RO(VM_EXIT_REASON, vm_exit_reason)
 SHADOW_FIELD_RO(VM_EXIT_INTR_INFO, vm_exit_intr_info)
 SHADOW_FIELD_RO(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len)
+SHADOW_FIELD_RO(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code)
 SHADOW_FIELD_RO(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field)
 SHADOW_FIELD_RO(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code)
-SHADOW_FIELD_RO(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code)
 SHADOW_FIELD_RO(GUEST_CS_AR_BYTES, guest_cs_ar_bytes)
 SHADOW_FIELD_RO(GUEST_SS_AR_BYTES, guest_ss_ar_bytes)
 SHADOW_FIELD_RW(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control)
@@ -54,7 +61,12 @@ SHADOW_FIELD_RW(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field)
 SHADOW_FIELD_RW(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len)
 SHADOW_FIELD_RW(TPR_THRESHOLD, tpr_threshold)
 SHADOW_FIELD_RW(GUEST_INTERRUPTIBILITY_INFO, guest_interruptibility_info)
-SHADOW_FIELD_RW(VMX_PREEMPTION_TIMER_VALUE, vmx_preemption_timer_value)
+/*
+ * The preemption timer can be emulated, but the processor cannot vmwrite to
+ * the VMCS field VMX_PREEMPTION_TIMER_VALUE that doesn't exist on bare metal.
+ */
+__SHADOW_FIELD_RW(VMX_PREEMPTION_TIMER_VALUE, vmx_preemption_timer_value,
+		  cpu_has_vmx_preemption_timer())
 
 /* Natural width */
 SHADOW_FIELD_RO(EXIT_QUALIFICATION, exit_qualification)
@@ -77,3 +89,6 @@ SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH, guest_physical_address)
 
 #undef SHADOW_FIELD_RO
 #undef SHADOW_FIELD_RW
+
+#undef __SHADOW_FIELD_RO
+#undef __SHADOW_FIELD_RW
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 24/27] KVM: nVMX: Add a prerequisite to existence of VMCS fields
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (22 preceding siblings ...)
  2024-10-01  5:01 ` [PATCH v3 23/27] KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros Xin Li (Intel)
@ 2024-10-01  5:01 ` Xin Li (Intel)
  2025-02-25 16:22   ` Sean Christopherson
  2024-10-01  5:01 ` [PATCH v3 25/27] KVM: nVMX: Add FRED " Xin Li (Intel)
                   ` (4 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:01 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

Add a prerequisite to existence of VMCS fields as some of them exist
only on processors that support certain CPU features.

This is required to fix KVM unit test VMX_VMCS_ENUM.MAX_INDEX.

Originally-by: Lei Wang <lei4.wang@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/vmx/nested.c             | 19 +++++++++++++++++--
 arch/x86/kvm/vmx/nested_vmcs_fields.h | 13 +++++++++++++
 2 files changed, 30 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/nested_vmcs_fields.h

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 7f3ac558ace5..4529fd635385 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -49,6 +49,21 @@ static unsigned long *vmx_bitmap[VMX_BITMAP_NR];
 #define vmx_vmread_bitmap                    (vmx_bitmap[VMX_VMREAD_BITMAP])
 #define vmx_vmwrite_bitmap                   (vmx_bitmap[VMX_VMWRITE_BITMAP])
 
+static bool nested_cpu_has_vmcs_field(struct kvm_vcpu *vcpu, u16 vmcs_field_encoding)
+{
+	switch (vmcs_field_encoding) {
+#define HAS_VMCS_FIELD(x, c)			\
+	case x:					\
+		return c;
+#define HAS_VMCS_FIELD_RANGE(x, y, c)		\
+	case x...y:				\
+		return c;
+#include "nested_vmcs_fields.h"
+	default:
+		return true;
+	}
+}
+
 struct shadow_vmcs_field {
 	u16	encoding;
 	u16	offset;
@@ -5565,7 +5580,7 @@ static int handle_vmread(struct kvm_vcpu *vcpu)
 			return nested_vmx_failInvalid(vcpu);
 
 		offset = get_vmcs12_field_offset(field);
-		if (offset < 0)
+		if (offset < 0 || !nested_cpu_has_vmcs_field(vcpu, field))
 			return nested_vmx_fail(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
 
 		if (!is_guest_mode(vcpu) && is_vmcs12_ext_field(field))
@@ -5691,7 +5706,7 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
 	field = kvm_register_read(vcpu, (((instr_info) >> 28) & 0xf));
 
 	offset = get_vmcs12_field_offset(field);
-	if (offset < 0)
+	if (offset < 0 || !nested_cpu_has_vmcs_field(vcpu, field))
 		return nested_vmx_fail(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
 
 	/*
diff --git a/arch/x86/kvm/vmx/nested_vmcs_fields.h b/arch/x86/kvm/vmx/nested_vmcs_fields.h
new file mode 100644
index 000000000000..fcd6c32dce31
--- /dev/null
+++ b/arch/x86/kvm/vmx/nested_vmcs_fields.h
@@ -0,0 +1,13 @@
+#if !defined(HAS_VMCS_FIELD) && !defined(HAS_VMCS_FIELD_RANGE)
+BUILD_BUG_ON(1)
+#endif
+
+#ifndef HAS_VMCS_FIELD
+#define HAS_VMCS_FIELD(x, c)
+#endif
+#ifndef HAS_VMCS_FIELD_RANGE
+#define HAS_VMCS_FIELD_RANGE(x, y, c)
+#endif
+
+#undef HAS_VMCS_FIELD
+#undef HAS_VMCS_FIELD_RANGE
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 25/27] KVM: nVMX: Add FRED VMCS fields
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (23 preceding siblings ...)
  2024-10-01  5:01 ` [PATCH v3 24/27] KVM: nVMX: Add a prerequisite to existence of VMCS fields Xin Li (Intel)
@ 2024-10-01  5:01 ` Xin Li (Intel)
  2024-10-24  7:42   ` Chao Gao
  2024-10-01  5:01 ` [PATCH v3 26/27] KVM: nVMX: Add VMCS FRED states checking Xin Li (Intel)
                   ` (3 subsequent siblings)
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:01 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Add FRED VMCS fields to nested VMX context management.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---

Change since v2:
* Add and use nested_cpu_has_fred(vmcs12) because vmcs02 should be set
  from vmcs12 if and only if the field is enabled in L1's VMX config
  (Sean Christopherson).
* Fix coding style (Sean Christopherson).

Change since v1:
* Remove hyperv TLFS related changes (Jeremi Piotrowski).
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() (Chao Gao).
---
 Documentation/virt/kvm/x86/nested-vmx.rst | 18 +++++
 arch/x86/kvm/vmx/nested.c                 | 88 ++++++++++++++++++-----
 arch/x86/kvm/vmx/nested.h                 |  8 +++
 arch/x86/kvm/vmx/nested_vmcs_fields.h     | 12 ++++
 arch/x86/kvm/vmx/vmcs12.c                 | 18 +++++
 arch/x86/kvm/vmx/vmcs12.h                 | 36 ++++++++++
 arch/x86/kvm/vmx/vmcs_shadow_fields.h     |  4 ++
 7 files changed, 168 insertions(+), 16 deletions(-)

diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
index e64ef231f310..87fa9f3877ab 100644
--- a/Documentation/virt/kvm/x86/nested-vmx.rst
+++ b/Documentation/virt/kvm/x86/nested-vmx.rst
@@ -218,6 +218,24 @@ struct shadow_vmcs is ever changed.
 		u16 host_gs_selector;
 		u16 host_tr_selector;
 		u64 secondary_vm_exit_controls;
+		u64 guest_ia32_fred_config;
+		u64 guest_ia32_fred_rsp1;
+		u64 guest_ia32_fred_rsp2;
+		u64 guest_ia32_fred_rsp3;
+		u64 guest_ia32_fred_stklvls;
+		u64 guest_ia32_fred_ssp1;
+		u64 guest_ia32_fred_ssp2;
+		u64 guest_ia32_fred_ssp3;
+		u64 host_ia32_fred_config;
+		u64 host_ia32_fred_rsp1;
+		u64 host_ia32_fred_rsp2;
+		u64 host_ia32_fred_rsp3;
+		u64 host_ia32_fred_stklvls;
+		u64 host_ia32_fred_ssp1;
+		u64 host_ia32_fred_ssp2;
+		u64 host_ia32_fred_ssp3;
+		u64 injected_event_data;
+		u64 original_event_data;
 	};
 
 
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 4529fd635385..45a5ffa51e60 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -719,6 +719,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_FRED_RSP0, MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_FRED_SSP0, MSR_TYPE_RW);
 #endif
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_IA32_SPEC_CTRL, MSR_TYPE_RW);
@@ -1268,9 +1274,11 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
 {
 	const u64 feature_bits = VMX_BASIC_DUAL_MONITOR_TREATMENT |
 				 VMX_BASIC_INOUT |
-				 VMX_BASIC_TRUE_CTLS;
+				 VMX_BASIC_TRUE_CTLS |
+				 VMX_BASIC_NESTED_EXCEPTION;
 
-	const u64 reserved_bits = GENMASK_ULL(63, 56) |
+	const u64 reserved_bits = GENMASK_ULL(63, 59) |
+				  GENMASK_ULL(57, 56) |
 				  GENMASK_ULL(47, 45) |
 				  BIT_ULL(31);
 
@@ -2536,6 +2544,8 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
 			     vmcs12->vm_entry_instruction_len);
 		vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
 			     vmcs12->guest_interruptibility_info);
+		if (nested_cpu_has_fred(vmcs12))
+			vmcs_write64(INJECTED_EVENT_DATA, vmcs12->injected_event_data);
 		vmx->loaded_vmcs->nmi_known_unmasked =
 			!(vmcs12->guest_interruptibility_info & GUEST_INTR_STATE_NMI);
 	} else {
@@ -2588,6 +2598,17 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 		vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
 
 		vmx_segment_cache_clear(vmx);
+
+		if (nested_cpu_has_fred(vmcs12)) {
+			vmcs_write64(GUEST_IA32_FRED_CONFIG, vmcs12->guest_ia32_fred_config);
+			vmcs_write64(GUEST_IA32_FRED_RSP1, vmcs12->guest_ia32_fred_rsp1);
+			vmcs_write64(GUEST_IA32_FRED_RSP2, vmcs12->guest_ia32_fred_rsp2);
+			vmcs_write64(GUEST_IA32_FRED_RSP3, vmcs12->guest_ia32_fred_rsp3);
+			vmcs_write64(GUEST_IA32_FRED_STKLVLS, vmcs12->guest_ia32_fred_stklvls);
+			vmcs_write64(GUEST_IA32_FRED_SSP1, vmcs12->guest_ia32_fred_ssp1);
+			vmcs_write64(GUEST_IA32_FRED_SSP2, vmcs12->guest_ia32_fred_ssp2);
+			vmcs_write64(GUEST_IA32_FRED_SSP3, vmcs12->guest_ia32_fred_ssp3);
+		}
 	}
 
 	if (!hv_evmcs || !(hv_evmcs->hv_clean_fields &
@@ -3881,6 +3902,8 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
 	u32 idt_vectoring;
 	unsigned int nr;
 
+	vmcs12->original_event_data = 0;
+
 	/*
 	 * Per the SDM, VM-Exits due to double and triple faults are never
 	 * considered to occur during event delivery, even if the double/triple
@@ -3919,6 +3942,11 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
 				vcpu->arch.exception.error_code;
 		}
 
+		if (vcpu->arch.exception.nested)
+			idt_vectoring |= INTR_INFO_NESTED_EXCEPTION_MASK;
+
+		vmcs12->original_event_data = vcpu->arch.exception.event_data;
+
 		vmcs12->idt_vectoring_info_field = idt_vectoring;
 	} else if (vcpu->arch.nmi_injected) {
 		vmcs12->idt_vectoring_info_field =
@@ -4009,19 +4037,10 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu)
 	struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
 	u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-	unsigned long exit_qual;
+	unsigned long exit_qual = 0;
 
-	if (ex->has_payload) {
-		exit_qual = ex->payload;
-	} else if (ex->vector == PF_VECTOR) {
-		exit_qual = vcpu->arch.cr2;
-	} else if (ex->vector == DB_VECTOR) {
-		exit_qual = vcpu->arch.dr6;
-		exit_qual &= ~DR6_BT;
-		exit_qual ^= DR6_ACTIVE_LOW;
-	} else {
-		exit_qual = 0;
-	}
+	if (ex->vector != NM_VECTOR)
+		exit_qual = ex->event_data;
 
 	/*
 	 * Unlike AMD's Paged Real Mode, which reports an error code on #PF
@@ -4042,10 +4061,13 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu)
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
 	}
 
-	if (kvm_exception_is_soft(ex->vector))
+	if (kvm_exception_is_soft(ex->vector)) {
 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
-	else
+	} else {
 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
+		if (ex->nested)
+			intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
+	}
 
 	if (!(vmcs12->idt_vectoring_info_field & VECTORING_INFO_VALID_MASK) &&
 	    vmx_get_nmi_mask(vcpu))
@@ -4468,6 +4490,14 @@ static bool is_vmcs12_ext_field(unsigned long field)
 	case GUEST_TR_BASE:
 	case GUEST_GDTR_BASE:
 	case GUEST_IDTR_BASE:
+	case GUEST_IA32_FRED_CONFIG:
+	case GUEST_IA32_FRED_RSP1:
+	case GUEST_IA32_FRED_RSP2:
+	case GUEST_IA32_FRED_RSP3:
+	case GUEST_IA32_FRED_STKLVLS:
+	case GUEST_IA32_FRED_SSP1:
+	case GUEST_IA32_FRED_SSP2:
+	case GUEST_IA32_FRED_SSP3:
 	case GUEST_PENDING_DBG_EXCEPTIONS:
 	case GUEST_BNDCFGS:
 		return true;
@@ -4517,6 +4547,18 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
 	vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
 	vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
 	vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+
+	if (nested_cpu_has_fred(vmcs12)) {
+		vmcs12->guest_ia32_fred_config = vmcs_read64(GUEST_IA32_FRED_CONFIG);
+		vmcs12->guest_ia32_fred_rsp1 = vmcs_read64(GUEST_IA32_FRED_RSP1);
+		vmcs12->guest_ia32_fred_rsp2 = vmcs_read64(GUEST_IA32_FRED_RSP2);
+		vmcs12->guest_ia32_fred_rsp3 = vmcs_read64(GUEST_IA32_FRED_RSP3);
+		vmcs12->guest_ia32_fred_stklvls = vmcs_read64(GUEST_IA32_FRED_STKLVLS);
+		vmcs12->guest_ia32_fred_ssp1 = vmcs_read64(GUEST_IA32_FRED_SSP1);
+		vmcs12->guest_ia32_fred_ssp2 = vmcs_read64(GUEST_IA32_FRED_SSP2);
+		vmcs12->guest_ia32_fred_ssp3 = vmcs_read64(GUEST_IA32_FRED_SSP3);
+	}
+
 	vmcs12->guest_pending_dbg_exceptions =
 		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
@@ -4741,6 +4783,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	vmcs_write32(GUEST_IDTR_LIMIT, 0xFFFF);
 	vmcs_write32(GUEST_GDTR_LIMIT, 0xFFFF);
 
+	if (nested_cpu_has_fred(vmcs12)) {
+		vmcs_write64(GUEST_IA32_FRED_CONFIG, vmcs12->host_ia32_fred_config);
+		vmcs_write64(GUEST_IA32_FRED_RSP1, vmcs12->host_ia32_fred_rsp1);
+		vmcs_write64(GUEST_IA32_FRED_RSP2, vmcs12->host_ia32_fred_rsp2);
+		vmcs_write64(GUEST_IA32_FRED_RSP3, vmcs12->host_ia32_fred_rsp3);
+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, vmcs12->host_ia32_fred_stklvls);
+		vmcs_write64(GUEST_IA32_FRED_SSP1, vmcs12->host_ia32_fred_ssp1);
+		vmcs_write64(GUEST_IA32_FRED_SSP2, vmcs12->host_ia32_fred_ssp2);
+		vmcs_write64(GUEST_IA32_FRED_SSP3, vmcs12->host_ia32_fred_ssp3);
+	}
+
 	/* If not VM_EXIT_CLEAR_BNDCFGS, the L2 value propagates to L1.  */
 	if (vmcs12->vm_exit_controls & VM_EXIT_CLEAR_BNDCFGS)
 		vmcs_write64(GUEST_BNDCFGS, 0);
@@ -7197,6 +7250,9 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
 	msrs->basic |= VMX_BASIC_TRUE_CTLS;
 	if (cpu_has_vmx_basic_inout())
 		msrs->basic |= VMX_BASIC_INOUT;
+
+	if (cpu_has_vmx_fred())
+		msrs->basic |= VMX_BASIC_NESTED_EXCEPTION;
 }
 
 static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index 2c296b6abb8c..5272f617fcef 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -251,6 +251,14 @@ static inline bool nested_cpu_has_encls_exit(struct vmcs12 *vmcs12)
 	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENCLS_EXITING);
 }
 
+static inline bool nested_cpu_has_fred(struct vmcs12 *vmcs12)
+{
+	return vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED &&
+	       vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS &&
+	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_SAVE_IA32_FRED &&
+	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED;
+}
+
 /*
  * if fixed0[i] == 1: val[i] must be 1
  * if fixed1[i] == 0: val[i] must be 0
diff --git a/arch/x86/kvm/vmx/nested_vmcs_fields.h b/arch/x86/kvm/vmx/nested_vmcs_fields.h
index fcd6c32dce31..dea22279d008 100644
--- a/arch/x86/kvm/vmx/nested_vmcs_fields.h
+++ b/arch/x86/kvm/vmx/nested_vmcs_fields.h
@@ -9,5 +9,17 @@ BUILD_BUG_ON(1)
 #define HAS_VMCS_FIELD_RANGE(x, y, c)
 #endif
 
+HAS_VMCS_FIELD(SECONDARY_VM_EXIT_CONTROLS, guest_can_use(vcpu, X86_FEATURE_FRED))
+HAS_VMCS_FIELD(SECONDARY_VM_EXIT_CONTROLS_HIGH, guest_can_use(vcpu, X86_FEATURE_FRED))
+
+HAS_VMCS_FIELD_RANGE(GUEST_IA32_FRED_CONFIG, GUEST_IA32_FRED_SSP3, guest_can_use(vcpu, X86_FEATURE_FRED))
+HAS_VMCS_FIELD_RANGE(HOST_IA32_FRED_CONFIG, HOST_IA32_FRED_SSP3, guest_can_use(vcpu, X86_FEATURE_FRED))
+
+HAS_VMCS_FIELD(INJECTED_EVENT_DATA, guest_can_use(vcpu, X86_FEATURE_FRED))
+HAS_VMCS_FIELD(INJECTED_EVENT_DATA_HIGH, guest_can_use(vcpu, X86_FEATURE_FRED))
+
+HAS_VMCS_FIELD(ORIGINAL_EVENT_DATA, guest_can_use(vcpu, X86_FEATURE_FRED))
+HAS_VMCS_FIELD(ORIGINAL_EVENT_DATA_HIGH, guest_can_use(vcpu, X86_FEATURE_FRED))
+
 #undef HAS_VMCS_FIELD
 #undef HAS_VMCS_FIELD_RANGE
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 98457d7b2b23..59f17fdfad11 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -80,6 +80,7 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(VM_ENTRY_MSR_LOAD_COUNT, vm_entry_msr_load_count),
 	FIELD(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field),
 	FIELD(VM_ENTRY_EXCEPTION_ERROR_CODE, vm_entry_exception_error_code),
+	FIELD(INJECTED_EVENT_DATA, injected_event_data),
 	FIELD(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len),
 	FIELD(TPR_THRESHOLD, tpr_threshold),
 	FIELD(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control),
@@ -89,6 +90,7 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code),
 	FIELD(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field),
 	FIELD(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code),
+	FIELD(ORIGINAL_EVENT_DATA, original_event_data),
 	FIELD(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len),
 	FIELD(VMX_INSTRUCTION_INFO, vmx_instruction_info),
 	FIELD(GUEST_ES_LIMIT, guest_es_limit),
@@ -152,5 +154,21 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
 	FIELD(HOST_RSP, host_rsp),
 	FIELD(HOST_RIP, host_rip),
+	FIELD(GUEST_IA32_FRED_CONFIG, guest_ia32_fred_config),
+	FIELD(GUEST_IA32_FRED_RSP1, guest_ia32_fred_rsp1),
+	FIELD(GUEST_IA32_FRED_RSP2, guest_ia32_fred_rsp2),
+	FIELD(GUEST_IA32_FRED_RSP3, guest_ia32_fred_rsp3),
+	FIELD(GUEST_IA32_FRED_STKLVLS, guest_ia32_fred_stklvls),
+	FIELD(GUEST_IA32_FRED_SSP1, guest_ia32_fred_ssp1),
+	FIELD(GUEST_IA32_FRED_SSP2, guest_ia32_fred_ssp2),
+	FIELD(GUEST_IA32_FRED_SSP3, guest_ia32_fred_ssp3),
+	FIELD(HOST_IA32_FRED_CONFIG, host_ia32_fred_config),
+	FIELD(HOST_IA32_FRED_RSP1, host_ia32_fred_rsp1),
+	FIELD(HOST_IA32_FRED_RSP2, host_ia32_fred_rsp2),
+	FIELD(HOST_IA32_FRED_RSP3, host_ia32_fred_rsp3),
+	FIELD(HOST_IA32_FRED_STKLVLS, host_ia32_fred_stklvls),
+	FIELD(HOST_IA32_FRED_SSP1, host_ia32_fred_ssp1),
+	FIELD(HOST_IA32_FRED_SSP2, host_ia32_fred_ssp2),
+	FIELD(HOST_IA32_FRED_SSP3, host_ia32_fred_ssp3),
 };
 const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 1fe3ed9108aa..f2a33d7007c9 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -186,6 +186,24 @@ struct __packed vmcs12 {
 	u16 host_tr_selector;
 	u16 guest_pml_index;
 	u64 secondary_vm_exit_controls;
+	u64 guest_ia32_fred_config;
+	u64 guest_ia32_fred_rsp1;
+	u64 guest_ia32_fred_rsp2;
+	u64 guest_ia32_fred_rsp3;
+	u64 guest_ia32_fred_stklvls;
+	u64 guest_ia32_fred_ssp1;
+	u64 guest_ia32_fred_ssp2;
+	u64 guest_ia32_fred_ssp3;
+	u64 host_ia32_fred_config;
+	u64 host_ia32_fred_rsp1;
+	u64 host_ia32_fred_rsp2;
+	u64 host_ia32_fred_rsp3;
+	u64 host_ia32_fred_stklvls;
+	u64 host_ia32_fred_ssp1;
+	u64 host_ia32_fred_ssp2;
+	u64 host_ia32_fred_ssp3;
+	u64 injected_event_data;
+	u64 original_event_data;
 };
 
 /*
@@ -362,6 +380,24 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(host_tr_selector, 994);
 	CHECK_OFFSET(guest_pml_index, 996);
 	CHECK_OFFSET(secondary_vm_exit_controls, 998);
+	CHECK_OFFSET(guest_ia32_fred_config, 1006);
+	CHECK_OFFSET(guest_ia32_fred_rsp1, 1014);
+	CHECK_OFFSET(guest_ia32_fred_rsp2, 1022);
+	CHECK_OFFSET(guest_ia32_fred_rsp3, 1030);
+	CHECK_OFFSET(guest_ia32_fred_stklvls, 1038);
+	CHECK_OFFSET(guest_ia32_fred_ssp1, 1046);
+	CHECK_OFFSET(guest_ia32_fred_ssp2, 1054);
+	CHECK_OFFSET(guest_ia32_fred_ssp3, 1062);
+	CHECK_OFFSET(host_ia32_fred_config, 1070);
+	CHECK_OFFSET(host_ia32_fred_rsp1, 1078);
+	CHECK_OFFSET(host_ia32_fred_rsp2, 1086);
+	CHECK_OFFSET(host_ia32_fred_rsp3, 1094);
+	CHECK_OFFSET(host_ia32_fred_stklvls, 1102);
+	CHECK_OFFSET(host_ia32_fred_ssp1, 1110);
+	CHECK_OFFSET(host_ia32_fred_ssp2, 1118);
+	CHECK_OFFSET(host_ia32_fred_ssp3, 1126);
+	CHECK_OFFSET(injected_event_data, 1134);
+	CHECK_OFFSET(original_event_data, 1142);
 }
 
 extern const unsigned short vmcs12_field_offsets[];
diff --git a/arch/x86/kvm/vmx/vmcs_shadow_fields.h b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
index 53b64dce1309..607945ada35f 100644
--- a/arch/x86/kvm/vmx/vmcs_shadow_fields.h
+++ b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
@@ -86,6 +86,10 @@ SHADOW_FIELD_RW(HOST_GS_BASE, host_gs_base)
 /* 64-bit */
 SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS, guest_physical_address)
 SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH, guest_physical_address)
+__SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA, original_event_data, cpu_has_vmx_fred())
+__SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA_HIGH, original_event_data, cpu_has_vmx_fred())
+__SHADOW_FIELD_RW(INJECTED_EVENT_DATA, injected_event_data, cpu_has_vmx_fred())
+__SHADOW_FIELD_RW(INJECTED_EVENT_DATA_HIGH, injected_event_data, cpu_has_vmx_fred())
 
 #undef SHADOW_FIELD_RO
 #undef SHADOW_FIELD_RW
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 26/27] KVM: nVMX: Add VMCS FRED states checking
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (24 preceding siblings ...)
  2024-10-01  5:01 ` [PATCH v3 25/27] KVM: nVMX: Add FRED " Xin Li (Intel)
@ 2024-10-01  5:01 ` Xin Li (Intel)
  2024-10-01  5:01 ` [PATCH v3 27/27] KVM: nVMX: Allow VMX FRED controls Xin Li (Intel)
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:01 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

As real hardware, nested VMX performs checks on various VMCS fields,
including both controls and guest/host states.  Add FRED related VMCS
field checkings with the addition of nested FRED.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 80 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 79 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 45a5ffa51e60..1fbdeea32c98 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2975,6 +2975,8 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 					  struct vmcs12 *vmcs12)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	bool fred_enabled = (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) &&
+			    (vmcs12->guest_cr4 & X86_CR4_FRED);
 
 	if (CC(!vmx_control_verify(vmcs12->vm_entry_controls,
 				    vmx->nested.msrs.entry_ctls_low,
@@ -2993,6 +2995,7 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 		u32 intr_type = intr_info & INTR_INFO_INTR_TYPE_MASK;
 		bool has_error_code = intr_info & INTR_INFO_DELIVER_CODE_MASK;
 		bool should_have_error_code;
+		bool has_nested_exception = vmx->nested.msrs.basic & VMX_BASIC_NESTED_EXCEPTION;
 		bool urg = nested_cpu_has2(vmcs12,
 					   SECONDARY_EXEC_UNRESTRICTED_GUEST);
 		bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
@@ -3006,7 +3009,9 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 		/* VM-entry interruption-info field: vector */
 		if (CC(intr_type == INTR_TYPE_NMI_INTR && vector != NMI_VECTOR) ||
 		    CC(intr_type == INTR_TYPE_HARD_EXCEPTION && vector > 31) ||
-		    CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
+		    CC(intr_type == INTR_TYPE_OTHER_EVENT &&
+		       ((!fred_enabled && vector > 0) ||
+		        (fred_enabled && vector > 2))))
 			return -EINVAL;
 
 		/* VM-entry interruption-info field: deliver error code */
@@ -3025,6 +3030,15 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 		if (CC(intr_info & INTR_INFO_RESVD_BITS_MASK))
 			return -EINVAL;
 
+		/*
+		 * When the CPU enumerates VMX nested-exception support, bit 13
+		 * (set to indicate a nested exception) of the intr info field
+		 * may have value 1. Otherwise bit 13 is reserved.
+		 */
+		if (CC(!has_nested_exception &&
+		       (intr_info & INTR_INFO_NESTED_EXCEPTION_MASK)))
+			return -EINVAL;
+
 		/* VM-entry instruction length */
 		switch (intr_type) {
 		case INTR_TYPE_SOFT_EXCEPTION:
@@ -3034,6 +3048,12 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 			    CC(vmcs12->vm_entry_instruction_len == 0 &&
 			    CC(!nested_cpu_has_zero_length_injection(vcpu))))
 				return -EINVAL;
+			break;
+		case INTR_TYPE_OTHER_EVENT:
+			if (fred_enabled && (vector == 1 || vector == 2))
+				if (CC(vmcs12->vm_entry_instruction_len > 15))
+					return -EINVAL;
+			break;
 		}
 	}
 
@@ -3096,9 +3116,30 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
 	if (ia32e) {
 		if (CC(!(vmcs12->host_cr4 & X86_CR4_PAE)))
 			return -EINVAL;
+		if (vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS &&
+		    vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED) {
+			/* Bit 11, bits 5:4, and bit 2 of the IA32_FRED_CONFIG must be zero */
+			if (CC(vmcs12->host_ia32_fred_config &
+			       (BIT_ULL(11) | GENMASK_ULL(5, 4) | BIT_ULL(2))) ||
+			    CC(vmcs12->host_ia32_fred_rsp1 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->host_ia32_fred_rsp2 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->host_ia32_fred_rsp3 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->host_ia32_fred_ssp1 & GENMASK_ULL(2, 0)) ||
+			    CC(vmcs12->host_ia32_fred_ssp2 & GENMASK_ULL(2, 0)) ||
+			    CC(vmcs12->host_ia32_fred_ssp3 & GENMASK_ULL(2, 0)) ||
+			    CC(is_noncanonical_address(vmcs12->host_ia32_fred_config & PAGE_MASK, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->host_ia32_fred_rsp1, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->host_ia32_fred_rsp2, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->host_ia32_fred_rsp3, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->host_ia32_fred_ssp1, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->host_ia32_fred_ssp2, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->host_ia32_fred_ssp3, vcpu)))
+				return -EINVAL;
+		}
 	} else {
 		if (CC(vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) ||
 		    CC(vmcs12->host_cr4 & X86_CR4_PCIDE) ||
+		    CC(vmcs12->host_cr4 & X86_CR4_FRED) ||
 		    CC((vmcs12->host_rip) >> 32))
 			return -EINVAL;
 	}
@@ -3242,6 +3283,43 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
 	     CC((vmcs12->guest_bndcfgs & MSR_IA32_BNDCFGS_RSVD))))
 		return -EINVAL;
 
+	if (ia32e) {
+		if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED) {
+			/* Bit 11, bits 5:4, and bit 2 of the IA32_FRED_CONFIG must be zero */
+			if (CC(vmcs12->guest_ia32_fred_config &
+			       (BIT_ULL(11) | GENMASK_ULL(5, 4) | BIT_ULL(2))) ||
+			    CC(vmcs12->guest_ia32_fred_rsp1 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->guest_ia32_fred_rsp2 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->guest_ia32_fred_rsp3 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->guest_ia32_fred_ssp1 & GENMASK_ULL(2, 0)) ||
+			    CC(vmcs12->guest_ia32_fred_ssp2 & GENMASK_ULL(2, 0)) ||
+			    CC(vmcs12->guest_ia32_fred_ssp3 & GENMASK_ULL(2, 0)) ||
+			    CC(is_noncanonical_address(vmcs12->guest_ia32_fred_config & PAGE_MASK, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->guest_ia32_fred_rsp1, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->guest_ia32_fred_rsp2, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->guest_ia32_fred_rsp3, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->guest_ia32_fred_ssp1, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->guest_ia32_fred_ssp2, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->guest_ia32_fred_ssp3, vcpu)))
+				return -EINVAL;
+		}
+		if (vmcs12->guest_cr4 & X86_CR4_FRED) {
+			unsigned int ss_dpl = VMX_AR_DPL(vmcs12->guest_ss_ar_bytes);
+			if (CC(ss_dpl == 1 || ss_dpl == 2))
+				return -EINVAL;
+			if (ss_dpl == 0 &&
+			    CC(!(vmcs12->guest_cs_ar_bytes & VMX_AR_L_MASK)))
+				return -EINVAL;
+			if (ss_dpl == 3 &&
+			    (CC(vmcs12->guest_rflags & X86_EFLAGS_IOPL) ||
+			     CC(vmcs12->guest_interruptibility_info & GUEST_INTR_STATE_STI)))
+				return -EINVAL;
+		}
+	} else {
+		if (CC(vmcs12->guest_cr4 & X86_CR4_FRED))
+			return -EINVAL;
+	}
+
 	if (nested_check_guest_non_reg_state(vmcs12))
 		return -EINVAL;
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v3 27/27] KVM: nVMX: Allow VMX FRED controls
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (25 preceding siblings ...)
  2024-10-01  5:01 ` [PATCH v3 26/27] KVM: nVMX: Add VMCS FRED states checking Xin Li (Intel)
@ 2024-10-01  5:01 ` Xin Li (Intel)
  2025-02-19  0:26 ` [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li
  2025-02-28 17:06 ` Sean Christopherson
  28 siblings, 0 replies; 81+ messages in thread
From: Xin Li (Intel) @ 2024-10-01  5:01 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, xin

From: Xin Li <xin3.li@intel.com>

Allow nVMX FRED controls as nested FRED support is in place.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 6 ++++--
 arch/x86/kvm/vmx/vmx.c    | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 1fbdeea32c98..b1b4483afcda 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -7159,7 +7159,8 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
 		 * advertise any feature in it to nVMX until its nVMX support
 		 * is ready.
 		 */
-		msrs->secondary_exit_ctls &= 0;
+		msrs->secondary_exit_ctls &= SECONDARY_VM_EXIT_SAVE_IA32_FRED |
+					     SECONDARY_VM_EXIT_LOAD_IA32_FRED;
 	}
 }
 
@@ -7174,7 +7175,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
 #ifdef CONFIG_X86_64
 		VM_ENTRY_IA32E_MODE |
 #endif
-		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
+		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
+		VM_ENTRY_LOAD_IA32_FRED;
 	msrs->entry_ctls_high |=
 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
 		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 522ee27a4655..ba6a7c6b6727 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7923,6 +7923,7 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
 
 	entry = kvm_find_cpuid_entry_index(vcpu, 0x7, 1);
 	cr4_fixed1_update(X86_CR4_LAM_SUP,    eax, feature_bit(LAM));
+	cr4_fixed1_update(X86_CR4_FRED,       eax, feature_bit(FRED));
 
 #undef cr4_fixed1_update
 }
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 06/27] x86/cea: Export per CPU variable cea_exception_stacks
  2024-10-01  5:00 ` [PATCH v3 06/27] x86/cea: Export per CPU variable cea_exception_stacks Xin Li (Intel)
@ 2024-10-01 16:12   ` Dave Hansen
  2024-10-01 17:51     ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Dave Hansen @ 2024-10-01 16:12 UTC (permalink / raw)
  To: Xin Li (Intel), kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3

On 9/30/24 22:00, Xin Li (Intel) wrote:
> The per CPU variable cea_exception_stacks contains per CPU stacks for
> NMI, #DB and #DF, which is referenced in KVM to set host FRED RSP[123]
> each time a vCPU is loaded onto a CPU, thus it needs to be exported.

Nit: It's not obvious how 'cea_exception_stacks' get used in this
series.  It's never referenced explicitly.

I did figure it out by looking for 'RSP[123]' references, but a much
better changelog would be something like:

	The per CPU array 'cea_exception_stacks' points to per CPU
	stacks for NMI, #DB and #DF. It is normally referenced via the
	#define: __this_cpu_ist_top_va().

	FRED introduced new fields in the host-state area of the VMCS
	for stack levels 1->3 (HOST_IA32_FRED_RSP[123]). KVM must
	populate these each time a vCPU is loaded onto a CPU.

See how that explicitly gives the reader greppable strings for
"__this_cpu_ist_top_va" and "HOST_IA32_FRED_RSP"?  That makes it much
easier to figure out what is going on.

I was also momentarily confused about why these loads need to be done on
_every_ vCPU load.  I think it's because the host state can change as
the vCPU moves around to different physical CPUs and
__this_cpu_ist_top_va() can and will change.  But it's a detail that I
think deserves to be explained in the changelog.  There is also this
note in vmx_vcpu_load_vmcs():

>                 /*
>                  * Linux uses per-cpu TSS and GDT, so set these when switching
>                  * processors.  See 22.2.4.
>                  */

which makes me think that it might not be bad to pull *all* of the
per-cpu VMCS field population code out into a helper since the reasoning
of why these need to be repopulated is identical.

Also, what's the purpose of clearing GUEST_IA32_FRED_RSP[123] at
init_vmcs() time?  I would have thought that those values wouldn't
matter until the VMCS gets loaded at vmx_vcpu_load_vmcs() when they are
overwritten anyway.  Or, I could be just totally misunderstanding how
KVM consumes the VMCS. :)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 06/27] x86/cea: Export per CPU variable cea_exception_stacks
  2024-10-01 16:12   ` Dave Hansen
@ 2024-10-01 17:51     ` Xin Li
  2024-10-01 18:18       ` Dave Hansen
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-10-01 17:51 UTC (permalink / raw)
  To: Dave Hansen, kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3

On 10/1/2024 9:12 AM, Dave Hansen wrote:
> On 9/30/24 22:00, Xin Li (Intel) wrote:
>> The per CPU variable cea_exception_stacks contains per CPU stacks for
>> NMI, #DB and #DF, which is referenced in KVM to set host FRED RSP[123]
>> each time a vCPU is loaded onto a CPU, thus it needs to be exported.
> 
> Nit: It's not obvious how 'cea_exception_stacks' get used in this
> series.  It's never referenced explicitly.
> 
> I did figure it out by looking for 'RSP[123]' references, but a much
> better changelog would be something like:
> 
> 	The per CPU array 'cea_exception_stacks' points to per CPU
> 	stacks for NMI, #DB and #DF. It is normally referenced via the
> 	#define: __this_cpu_ist_top_va().
> 
> 	FRED introduced new fields in the host-state area of the VMCS
> 	for stack levels 1->3 (HOST_IA32_FRED_RSP[123]). KVM must
> 	populate these each time a vCPU is loaded onto a CPU.
> 

Yeah, this is way clearer.

> See how that explicitly gives the reader greppable strings for
> "__this_cpu_ist_top_va" and "HOST_IA32_FRED_RSP"?  That makes it much
> easier to figure out what is going on.

Nice for a maintainer in 20 years :)

> 
> I was also momentarily confused about why these loads need to be done on
> _every_ vCPU load.  I think it's because the host state can change as
> the vCPU moves around to different physical CPUs and
> __this_cpu_ist_top_va() can and will change.  But it's a detail that I
> think deserves to be explained in the changelog.  There is also this
> note in vmx_vcpu_load_vmcs():

Makes sense to me.

> 
>>                  /*
>>                   * Linux uses per-cpu TSS and GDT, so set these when switching
>>                   * processors.  See 22.2.4.
>>                   */
> 
> which makes me think that it might not be bad to pull *all* of the
> per-cpu VMCS field population code out into a helper since the reasoning
> of why these need to be repopulated is identical.
> 
> Also, what's the purpose of clearing GUEST_IA32_FRED_RSP[123] at
> init_vmcs() time?  I would have thought that those values wouldn't
> matter until the VMCS gets loaded at vmx_vcpu_load_vmcs() when they are
> overwritten anyway.  Or, I could be just totally misunderstanding how
> KVM consumes the VMCS. :)

I don't see any misunderstanding.  However we just do what the SDM
claims, even it seems that it's not a must *logically*.

FRED spec says:
The RESET state of each of the new MSRs is zero. INIT does not change
the value of the new MSRs

Thanks!
     Xin


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 06/27] x86/cea: Export per CPU variable cea_exception_stacks
  2024-10-01 17:51     ` Xin Li
@ 2024-10-01 18:18       ` Dave Hansen
  0 siblings, 0 replies; 81+ messages in thread
From: Dave Hansen @ 2024-10-01 18:18 UTC (permalink / raw)
  To: Xin Li, kvm, linux-kernel, linux-doc
  Cc: seanjc, pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3

On 10/1/24 10:51, Xin Li wrote:
...>> Also, what's the purpose of clearing GUEST_IA32_FRED_RSP[123] at
>> init_vmcs() time?  I would have thought that those values wouldn't
>> matter until the VMCS gets loaded at vmx_vcpu_load_vmcs() when they are
>> overwritten anyway.  Or, I could be just totally misunderstanding how
>> KVM consumes the VMCS. 🙂
> 
> I don't see any misunderstanding.  However we just do what the SDM
> claims, even it seems that it's not a must *logically*.
> 
> FRED spec says:
> The RESET state of each of the new MSRs is zero. INIT does not change
> the value of the new MSRs

Oh, sorry.  I was misreading the "HOST_" and "GUEST_" MSR prefixes.  I
thought the same VMCS field was being written at VMCS load *and* init
time (which it isn't).  Sorry for the noise.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 03/27] KVM: VMX: Add support for the secondary VM exit controls
  2024-10-01  5:00 ` [PATCH v3 03/27] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
@ 2024-10-21  8:28   ` Chao Gao
  2024-10-21 17:03     ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Chao Gao @ 2024-10-21  8:28 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

>@@ -2713,21 +2715,43 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> 				&_vmentry_control))
> 		return -EIO;
> 
>-	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit_pairs); i++) {
>-		u32 n_ctrl = vmcs_entry_exit_pairs[i].entry_control;
>-		u32 x_ctrl = vmcs_entry_exit_pairs[i].exit_control;
>-
>-		if (!(_vmentry_control & n_ctrl) == !(_vmexit_control & x_ctrl))
>+	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
>+		_secondary_vmexit_control =
>+			adjust_vmx_controls64(KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS,
>+					      MSR_IA32_VMX_EXIT_CTLS2);
>+
>+	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit_triplets); i++) {
>+		u32 n_ctrl = vmcs_entry_exit_triplets[i].entry_control;
>+		u32 x_ctrl = vmcs_entry_exit_triplets[i].exit_control;
>+		u64 x_ctrl_2 = vmcs_entry_exit_triplets[i].exit_2nd_control;
>+		bool has_n = n_ctrl && ((_vmentry_control & n_ctrl) == n_ctrl);
>+		bool has_x = x_ctrl && ((_vmexit_control & x_ctrl) == x_ctrl);
>+		bool has_x_2 = x_ctrl_2 && ((_secondary_vmexit_control & x_ctrl_2) == x_ctrl_2);
>+
>+		if (x_ctrl_2) {
>+			/* Only activate secondary VM exit control bit should be set */
>+			if ((_vmexit_control & x_ctrl) == VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
>+				if (has_n == has_x_2)
>+					continue;
>+			} else {
>+				/* The feature should not be supported in any control */
>+				if (!has_n && !has_x && !has_x_2)
>+					continue;
>+			}
>+		} else if (has_n == has_x) {
> 			continue;
>+		}
> 
>-		pr_warn_once("Inconsistent VM-Entry/VM-Exit pair, entry = %x, exit = %x\n",
>-			     _vmentry_control & n_ctrl, _vmexit_control & x_ctrl);
>+		pr_warn_once("Inconsistent VM-Entry/VM-Exit triplet, entry = %x, exit = %x, secondary_exit = %llx\n",
>+			     _vmentry_control & n_ctrl, _vmexit_control & x_ctrl,
>+			     _secondary_vmexit_control & x_ctrl_2);
> 
> 		if (error_on_inconsistent_vmcs_config)
> 			return -EIO;
> 
> 		_vmentry_control &= ~n_ctrl;
> 		_vmexit_control &= ~x_ctrl;

w/ patch 4, VM_EXIT_ACTIVATE_SECONDARY_CONTROLS is cleared if FRED fails in the
consistent check. this means, all features in the secondary vm-exit controls
are removed. it is overkill.

I prefer to maintain a separate table for the secondary VM-exit controls:

 	struct {
 		u32 entry_control;
 		u64 exit2_control;
	} const vmcs_entry_exit2_pairs[] = {
		{ VM_ENTRY_LOAD_IA32_FRED, SECONDARY_VM_EXIT_SAVE_IA32_FRED |
					   SECONDARY_VM_EXIT_LOAD_IA32_FRED},
	};

	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit2_pairs); i++) {
	...
	}

>+		_secondary_vmexit_control &= ~x_ctrl_2;
> 	}
> 
> 	rdmsrl(MSR_IA32_VMX_BASIC, basic_msr);

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 03/27] KVM: VMX: Add support for the secondary VM exit controls
  2024-10-21  8:28   ` Chao Gao
@ 2024-10-21 17:03     ` Xin Li
  2024-10-22  2:47       ` Chao Gao
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-10-21 17:03 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 10/21/2024 1:28 AM, Chao Gao wrote:
>> +	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit_triplets); i++) {
>> +		u32 n_ctrl = vmcs_entry_exit_triplets[i].entry_control;
>> +		u32 x_ctrl = vmcs_entry_exit_triplets[i].exit_control;
>> +		u64 x_ctrl_2 = vmcs_entry_exit_triplets[i].exit_2nd_control;
>> +		bool has_n = n_ctrl && ((_vmentry_control & n_ctrl) == n_ctrl);
>> +		bool has_x = x_ctrl && ((_vmexit_control & x_ctrl) == x_ctrl);
>> +		bool has_x_2 = x_ctrl_2 && ((_secondary_vmexit_control & x_ctrl_2) == x_ctrl_2);
>> +
>> +		if (x_ctrl_2) {
>> +			/* Only activate secondary VM exit control bit should be set */
>> +			if ((_vmexit_control & x_ctrl) == VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
>> +				if (has_n == has_x_2)
>> +					continue;
>> +			} else {
>> +				/* The feature should not be supported in any control */
>> +				if (!has_n && !has_x && !has_x_2)
>> +					continue;
>> +			}
>> +		} else if (has_n == has_x) {
>> 			continue;
>> +		}
>>
>> -		pr_warn_once("Inconsistent VM-Entry/VM-Exit pair, entry = %x, exit = %x\n",
>> -			     _vmentry_control & n_ctrl, _vmexit_control & x_ctrl);
>> +		pr_warn_once("Inconsistent VM-Entry/VM-Exit triplet, entry = %x, exit = %x, secondary_exit = %llx\n",
>> +			     _vmentry_control & n_ctrl, _vmexit_control & x_ctrl,
>> +			     _secondary_vmexit_control & x_ctrl_2);
>>
>> 		if (error_on_inconsistent_vmcs_config)
>> 			return -EIO;
>>
>> 		_vmentry_control &= ~n_ctrl;
>> 		_vmexit_control &= ~x_ctrl;
> 
> w/ patch 4, VM_EXIT_ACTIVATE_SECONDARY_CONTROLS is cleared if FRED fails in the
> consistent check. this means, all features in the secondary vm-exit controls
> are removed. it is overkill.

Good catch!

> 
> I prefer to maintain a separate table for the secondary VM-exit controls:
> 
>   	struct {
>   		u32 entry_control;
>   		u64 exit2_control;
> 	} const vmcs_entry_exit2_pairs[] = {
> 		{ VM_ENTRY_LOAD_IA32_FRED, SECONDARY_VM_EXIT_SAVE_IA32_FRED |
> 					   SECONDARY_VM_EXIT_LOAD_IA32_FRED},
> 	};
> 
> 	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit2_pairs); i++) {
> 	...
> 	}

Hmm, I prefer one table, as it's more straight forward.

> 
>> +		_secondary_vmexit_control &= ~x_ctrl_2;
>> 	}
>>
>> 	rdmsrl(MSR_IA32_VMX_BASIC, basic_msr);


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 03/27] KVM: VMX: Add support for the secondary VM exit controls
  2024-10-21 17:03     ` Xin Li
@ 2024-10-22  2:47       ` Chao Gao
  2024-10-22 16:30         ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Chao Gao @ 2024-10-22  2:47 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Mon, Oct 21, 2024 at 10:03:45AM -0700, Xin Li wrote:
>On 10/21/2024 1:28 AM, Chao Gao wrote:
>> > +	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit_triplets); i++) {
>> > +		u32 n_ctrl = vmcs_entry_exit_triplets[i].entry_control;
>> > +		u32 x_ctrl = vmcs_entry_exit_triplets[i].exit_control;
>> > +		u64 x_ctrl_2 = vmcs_entry_exit_triplets[i].exit_2nd_control;
>> > +		bool has_n = n_ctrl && ((_vmentry_control & n_ctrl) == n_ctrl);
>> > +		bool has_x = x_ctrl && ((_vmexit_control & x_ctrl) == x_ctrl);
>> > +		bool has_x_2 = x_ctrl_2 && ((_secondary_vmexit_control & x_ctrl_2) == x_ctrl_2);
>> > +
>> > +		if (x_ctrl_2) {
>> > +			/* Only activate secondary VM exit control bit should be set */
>> > +			if ((_vmexit_control & x_ctrl) == VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
>> > +				if (has_n == has_x_2)
>> > +					continue;
>> > +			} else {
>> > +				/* The feature should not be supported in any control */
>> > +				if (!has_n && !has_x && !has_x_2)
>> > +					continue;
>> > +			}
>> > +		} else if (has_n == has_x) {
>> > 			continue;
>> > +		}
>> > 
>> > -		pr_warn_once("Inconsistent VM-Entry/VM-Exit pair, entry = %x, exit = %x\n",
>> > -			     _vmentry_control & n_ctrl, _vmexit_control & x_ctrl);
>> > +		pr_warn_once("Inconsistent VM-Entry/VM-Exit triplet, entry = %x, exit = %x, secondary_exit = %llx\n",
>> > +			     _vmentry_control & n_ctrl, _vmexit_control & x_ctrl,
>> > +			     _secondary_vmexit_control & x_ctrl_2);
>> > 
>> > 		if (error_on_inconsistent_vmcs_config)
>> > 			return -EIO;
>> > 
>> > 		_vmentry_control &= ~n_ctrl;
>> > 		_vmexit_control &= ~x_ctrl;
>> 
>> w/ patch 4, VM_EXIT_ACTIVATE_SECONDARY_CONTROLS is cleared if FRED fails in the
>> consistent check. this means, all features in the secondary vm-exit controls
>> are removed. it is overkill.
>
>Good catch!
>
>> 
>> I prefer to maintain a separate table for the secondary VM-exit controls:
>> 
>>   	struct {
>>   		u32 entry_control;
>>   		u64 exit2_control;
>> 	} const vmcs_entry_exit2_pairs[] = {
>> 		{ VM_ENTRY_LOAD_IA32_FRED, SECONDARY_VM_EXIT_SAVE_IA32_FRED |
>> 					   SECONDARY_VM_EXIT_LOAD_IA32_FRED},
>> 	};
>> 
>> 	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit2_pairs); i++) {
>> 	...
>> 	}
>
>Hmm, I prefer one table, as it's more straight forward.

One table is fine if we can fix the issue and improve readability. The three
nested if() statements hurts readability.

I just thought using two tables would eliminate the need for any if() statements.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 05/27] KVM: VMX: Disable FRED if FRED consistency checks fail
  2024-10-01  5:00 ` [PATCH v3 05/27] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
@ 2024-10-22  8:48   ` Chao Gao
  2024-10-22 16:21     ` Xin Li
  2024-11-26 15:32   ` Borislav Petkov
  1 sibling, 1 reply; 81+ messages in thread
From: Chao Gao @ 2024-10-22  8:48 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Mon, Sep 30, 2024 at 10:00:48PM -0700, Xin Li (Intel) wrote:
>From: Xin Li <xin3.li@intel.com>
>
>Do not virtualize FRED if FRED consistency checks fail.
>
>Either on broken hardware, or when run KVM on top of another hypervisor
>before the underlying hypervisor implements nested FRED correctly.
>
>Suggested-by: Chao Gao <chao.gao@intel.com>
>Signed-off-by: Xin Li <xin3.li@intel.com>
>Signed-off-by: Xin Li (Intel) <xin@zytor.com>
>Tested-by: Shan Kang <shan.kang@intel.com>

Reviewed-by: Chao Gao <chao.gao@intel.com>

one nit below,

>---
> arch/x86/kvm/vmx/capabilities.h | 7 +++++++
> arch/x86/kvm/vmx/vmx.c          | 3 +++
> 2 files changed, 10 insertions(+)
>
>diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
>index e8f3ad0f79ee..2962a3bb9747 100644
>--- a/arch/x86/kvm/vmx/capabilities.h
>+++ b/arch/x86/kvm/vmx/capabilities.h
>@@ -400,6 +400,13 @@ static inline bool vmx_pebs_supported(void)
> 	return boot_cpu_has(X86_FEATURE_PEBS) && kvm_pmu_cap.pebs_ept;
> }
> 
>+static inline bool cpu_has_vmx_fred(void)
>+{
>+	/* No need to check FRED VM exit controls. */

how about:

	/*
	 * setup_vmcs_config() guarantees FRED VM-entry/exit controls are
	 * either all set or none. So, no need to check FRED VM-exit controls.
	 */

It is better to call out the reason.

>+	return boot_cpu_has(X86_FEATURE_FRED) &&
>+		(vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_FRED);
>+}
>+

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 07/27] KVM: VMX: Initialize VMCS FRED fields
  2024-10-01  5:00 ` [PATCH v3 07/27] KVM: VMX: Initialize VMCS FRED fields Xin Li (Intel)
@ 2024-10-22  9:06   ` Chao Gao
  2024-10-22 16:18     ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Chao Gao @ 2024-10-22  9:06 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

>@@ -1503,6 +1503,18 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
> 				    (unsigned long)(cpu_entry_stack(cpu) + 1));
> 		}
> 
>+		/* Per-CPU FRED MSRs */
>+		if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
>+#ifdef CONFIG_X86_64
>+			vmcs_write64(HOST_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB));
>+			vmcs_write64(HOST_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI));
>+			vmcs_write64(HOST_IA32_FRED_RSP3, __this_cpu_ist_top_va(DF));
>+#endif
>+			vmcs_write64(HOST_IA32_FRED_SSP1, 0);
>+			vmcs_write64(HOST_IA32_FRED_SSP2, 0);
>+			vmcs_write64(HOST_IA32_FRED_SSP3, 0);

Given SSP[1-3] are constant for now, how about asserting that host SSP[1-3] are
all zeros when KVM is loaded and moving their writes to vmx_set_constant_host_state()?

>+		}
>+
> 		vmx->loaded_vmcs->cpu = cpu;
> 	}
> }
>@@ -4366,6 +4378,12 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
> 	 */
> 	vmcs_write16(HOST_DS_SELECTOR, 0);
> 	vmcs_write16(HOST_ES_SELECTOR, 0);
>+
>+	/* FRED CONFIG and STKLVLS are the same on all CPUs. */
>+	if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
>+		vmcs_write64(HOST_IA32_FRED_CONFIG, kvm_host.fred_config);
>+		vmcs_write64(HOST_IA32_FRED_STKLVLS, kvm_host.fred_stklvls);
>+	}
> #else
> 	vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
> 	vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS);  /* 22.2.4 */

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 07/27] KVM: VMX: Initialize VMCS FRED fields
  2024-10-22  9:06   ` Chao Gao
@ 2024-10-22 16:18     ` Xin Li
  0 siblings, 0 replies; 81+ messages in thread
From: Xin Li @ 2024-10-22 16:18 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 10/22/2024 2:06 AM, Chao Gao wrote:
>> @@ -1503,6 +1503,18 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
>> 				    (unsigned long)(cpu_entry_stack(cpu) + 1));
>> 		}
>>
>> +		/* Per-CPU FRED MSRs */
>> +		if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
>> +#ifdef CONFIG_X86_64
>> +			vmcs_write64(HOST_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB));
>> +			vmcs_write64(HOST_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI));
>> +			vmcs_write64(HOST_IA32_FRED_RSP3, __this_cpu_ist_top_va(DF));
>> +#endif
>> +			vmcs_write64(HOST_IA32_FRED_SSP1, 0);
>> +			vmcs_write64(HOST_IA32_FRED_SSP2, 0);
>> +			vmcs_write64(HOST_IA32_FRED_SSP3, 0);
> 
> Given SSP[1-3] are constant for now, how about asserting that host SSP[1-3] are
> all zeros when KVM is loaded and moving their writes to vmx_set_constant_host_state()?

I like the idea :)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 05/27] KVM: VMX: Disable FRED if FRED consistency checks fail
  2024-10-22  8:48   ` Chao Gao
@ 2024-10-22 16:21     ` Xin Li
  0 siblings, 0 replies; 81+ messages in thread
From: Xin Li @ 2024-10-22 16:21 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 10/22/2024 1:48 AM, Chao Gao wrote:
> On Mon, Sep 30, 2024 at 10:00:48PM -0700, Xin Li (Intel) wrote:
>> From: Xin Li <xin3.li@intel.com>
>>
>> Do not virtualize FRED if FRED consistency checks fail.
>>
>> Either on broken hardware, or when run KVM on top of another hypervisor
>> before the underlying hypervisor implements nested FRED correctly.
>>
>> Suggested-by: Chao Gao <chao.gao@intel.com>
>> Signed-off-by: Xin Li <xin3.li@intel.com>
>> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
>> Tested-by: Shan Kang <shan.kang@intel.com>
> 
> Reviewed-by: Chao Gao <chao.gao@intel.com>
> 
> one nit below,
> 
>> ---
>> arch/x86/kvm/vmx/capabilities.h | 7 +++++++
>> arch/x86/kvm/vmx/vmx.c          | 3 +++
>> 2 files changed, 10 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
>> index e8f3ad0f79ee..2962a3bb9747 100644
>> --- a/arch/x86/kvm/vmx/capabilities.h
>> +++ b/arch/x86/kvm/vmx/capabilities.h
>> @@ -400,6 +400,13 @@ static inline bool vmx_pebs_supported(void)
>> 	return boot_cpu_has(X86_FEATURE_PEBS) && kvm_pmu_cap.pebs_ept;
>> }
>>
>> +static inline bool cpu_has_vmx_fred(void)
>> +{
>> +	/* No need to check FRED VM exit controls. */
> 
> how about:
> 
> 	/*
> 	 * setup_vmcs_config() guarantees FRED VM-entry/exit controls are
> 	 * either all set or none. So, no need to check FRED VM-exit controls.
> 	 */
> 
> It is better to call out the reason.
> 

make sense!

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 03/27] KVM: VMX: Add support for the secondary VM exit controls
  2024-10-22  2:47       ` Chao Gao
@ 2024-10-22 16:30         ` Xin Li
  2025-02-25 17:28           ` Sean Christopherson
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-10-22 16:30 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

>>>> 		_vmentry_control &= ~n_ctrl;
>>>> 		_vmexit_control &= ~x_ctrl;
>>>
>>> w/ patch 4, VM_EXIT_ACTIVATE_SECONDARY_CONTROLS is cleared if FRED fails in the
>>> consistent check. this means, all features in the secondary vm-exit controls
>>> are removed. it is overkill.
>>
>> Good catch!
>>
>>>
>>> I prefer to maintain a separate table for the secondary VM-exit controls:
>>>
>>>    	struct {
>>>    		u32 entry_control;
>>>    		u64 exit2_control;
>>> 	} const vmcs_entry_exit2_pairs[] = {
>>> 		{ VM_ENTRY_LOAD_IA32_FRED, SECONDARY_VM_EXIT_SAVE_IA32_FRED |
>>> 					   SECONDARY_VM_EXIT_LOAD_IA32_FRED},
>>> 	};
>>>
>>> 	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit2_pairs); i++) {
>>> 	...
>>> 	}
>>
>> Hmm, I prefer one table, as it's more straight forward.
> 
> One table is fine if we can fix the issue and improve readability. The three
> nested if() statements hurts readability.

You're right!  Let's try to make it clearer.

> I just thought using two tables would eliminate the need for any if() statements.
>

One more thing, IIUC, Sean prefers to keep
VM_EXIT_ACTIVATE_SECONDARY_CONTROLS set if it's allowed to be set and
even bits in the 2nd VM exit controls are all 0.  I may be able to make
it simpler.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 16/27] KVM: VMX: Virtualize FRED nested exception tracking
  2024-10-01  5:00 ` [PATCH v3 16/27] KVM: VMX: Virtualize FRED nested exception tracking Xin Li (Intel)
@ 2024-10-24  6:24   ` Chao Gao
  2024-10-25  8:04     ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Chao Gao @ 2024-10-24  6:24 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>index b9b82aaea9a3..3830084b569b 100644
>--- a/arch/x86/include/asm/kvm_host.h
>+++ b/arch/x86/include/asm/kvm_host.h
>@@ -736,6 +736,7 @@ struct kvm_queued_exception {
> 	u32 error_code;
> 	unsigned long payload;
> 	bool has_payload;
>+	bool nested;
> 	u64 event_data;

how "nested" is migrated in live migration?

> };

[..]

> 
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index d81144bd648f..03f42b218554 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -1910,8 +1910,11 @@ void vmx_inject_exception(struct kvm_vcpu *vcpu)
> 		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> 			     vmx->vcpu.arch.event_exit_inst_len);
> 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
>-	} else
>+	} else {
> 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
>+		if (ex->nested)
>+			intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;

how about moving the is_fred_enable() check from kvm_multiple_exception() to here? i.e.,

		if (ex->nested && is_fred_enabled(vcpu))
			intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;

It is slightly clearer because FRED details don't bleed into kvm_multiple_exception().

>+	}
> 
> 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
> 
>@@ -7290,6 +7293,7 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
> 		kvm_requeue_exception(vcpu, vector,
> 				      idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK,
> 				      error_code,
>+				      idt_vectoring_info & VECTORING_INFO_NESTED_EXCEPTION_MASK,
> 				      event_data);
> 		break;
> 	}
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index 7a55c1eb5297..8546629166e9 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -874,6 +874,11 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
> 		vcpu->arch.exception.pending = true;
> 		vcpu->arch.exception.injected = false;
> 
>+		vcpu->arch.exception.nested = vcpu->arch.exception.nested ||
>+					      (is_fred_enabled(vcpu) &&
>+					       (vcpu->arch.nmi_injected ||
>+					        vcpu->arch.interrupt.injected));
>+
> 		vcpu->arch.exception.has_error_code = has_error;
> 		vcpu->arch.exception.vector = nr;
> 		vcpu->arch.exception.error_code = error_code;
>@@ -903,8 +908,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
> 		vcpu->arch.exception.injected = false;
> 		vcpu->arch.exception.pending = false;
> 
>+		/* #DF is NOT a nested event, per its definition. */
>+		vcpu->arch.exception.nested = false;
>+
> 		kvm_queue_exception_e(vcpu, DF_VECTOR, 0);
> 	} else {
>+		vcpu->arch.exception.nested = is_fred_enabled(vcpu);
>+
> 		/* replace previous exception with a new one in a hope
> 		   that instruction re-execution will regenerate lost
> 		   exception */

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 17/27] KVM: x86: Mark CR4.FRED as not reserved when guest can use FRED
  2024-10-01  5:01 ` [PATCH v3 17/27] KVM: x86: Mark CR4.FRED as not reserved when guest can use FRED Xin Li (Intel)
@ 2024-10-24  7:18   ` Chao Gao
  2024-12-12 18:48     ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Chao Gao @ 2024-10-24  7:18 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 03f42b218554..bfdd10773136 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -8009,6 +8009,10 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_LAM);
> 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_FRED);
> 
>+	/* Don't allow CR4.FRED=1 before all of FRED KVM support is in place. */
>+	if (!guest_can_use(vcpu, X86_FEATURE_FRED))
>+		vcpu->arch.cr4_guest_rsvd_bits |= X86_CR4_FRED;

is this necessary? __kvm_is_valid_cr4() ensures that guests cannot set any bit
which isn't supported by the hardware.

To account for hardware/KVM caps, I think the following changes will work. This
will fix all other bits besides X86_CR4_FRED.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4a93ac1b9be9..2bec3ba8e47d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1873,6 +1873,7 @@ struct kvm_arch_async_pf {
 extern u32 __read_mostly kvm_nr_uret_msrs;
 extern bool __read_mostly allow_smaller_maxphyaddr;
 extern bool __read_mostly enable_apicv;
+extern u64 __read_mostly cr4_reserved_bits;
 extern struct kvm_x86_ops kvm_x86_ops;
 
 #define kvm_x86_call(func) static_call(kvm_x86_##func)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 2617be544480..57d82fbcfd3f 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -393,8 +393,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
 
 	kvm_pmu_refresh(vcpu);
-	vcpu->arch.cr4_guest_rsvd_bits =
-	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
+	vcpu->arch.cr4_guest_rsvd_bits = cr4_reserved_bits |
+					 __cr4_reserved_bits(guest_cpuid_has, vcpu);
 
 	kvm_hv_set_cpuid(vcpu, kvm_cpuid_has_hyperv(vcpu->arch.cpuid_entries,
 						    vcpu->arch.cpuid_nent));
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 34b52b49f5e6..08b42bbd2342 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -119,7 +119,7 @@ u64 __read_mostly efer_reserved_bits = ~((u64)(EFER_SCE | EFER_LME | EFER_LMA));
 static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);
 #endif
 
-static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS;
+u64 __read_mostly cr4_reserved_bits;
 
 #define KVM_EXIT_HYPERCALL_VALID_MASK (1 << KVM_HC_MAP_GPA_RANGE)
 
@@ -1110,13 +1110,7 @@ EXPORT_SYMBOL_GPL(kvm_emulate_xsetbv);
 
 bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
-	if (cr4 & cr4_reserved_bits)
-		return false;
-
-	if (cr4 & vcpu->arch.cr4_guest_rsvd_bits)
-		return false;
-
-	return true;
+	return !(cr4 & vcpu->arch.cr4_guest_rsvd_bits);
 }
 EXPORT_SYMBOL_GPL(__kvm_is_valid_cr4);
 

>+
> 	vmx_setup_uret_msrs(vmx);
> 
> 	if (cpu_has_secondary_exec_ctrls())
>diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
>index 992e73ee2ec5..0ed91512b757 100644
>--- a/arch/x86/kvm/x86.h
>+++ b/arch/x86/kvm/x86.h
>@@ -561,6 +561,8 @@ enum kvm_msr_access {
> 		__reserved_bits |= X86_CR4_PCIDE;       \
> 	if (!__cpu_has(__c, X86_FEATURE_LAM))           \
> 		__reserved_bits |= X86_CR4_LAM_SUP;     \
>+	if (!__cpu_has(__c, X86_FEATURE_FRED))          \
>+		__reserved_bits |= X86_CR4_FRED;        \
> 	__reserved_bits;                                \
> })
> 
>-- 
>2.46.2
>
>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 18/27] KVM: VMX: Dump FRED context in dump_vmcs()
  2024-10-01  5:01 ` [PATCH v3 18/27] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li (Intel)
@ 2024-10-24  7:23   ` Chao Gao
  2024-10-24 16:50     ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Chao Gao @ 2024-10-24  7:23 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Mon, Sep 30, 2024 at 10:01:01PM -0700, Xin Li (Intel) wrote:
>From: Xin Li <xin3.li@intel.com>
>
>Add FRED related VMCS fields to dump_vmcs() to dump FRED context.

Host/guest SSP[1-3] are not dumped. Is this intentional?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 25/27] KVM: nVMX: Add FRED VMCS fields
  2024-10-01  5:01 ` [PATCH v3 25/27] KVM: nVMX: Add FRED " Xin Li (Intel)
@ 2024-10-24  7:42   ` Chao Gao
  2024-10-25  7:25     ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Chao Gao @ 2024-10-24  7:42 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

>@@ -7197,6 +7250,9 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
> 	msrs->basic |= VMX_BASIC_TRUE_CTLS;
> 	if (cpu_has_vmx_basic_inout())
> 		msrs->basic |= VMX_BASIC_INOUT;
>+
>+	if (cpu_has_vmx_fred())
>+		msrs->basic |= VMX_BASIC_NESTED_EXCEPTION;

why not advertising VMX_BASIC_NESTED_EXCEPTION if the CPU supports it? just like
VMX_BASIC_INOUT right above.


> }
> 
> static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
>diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
>index 2c296b6abb8c..5272f617fcef 100644
>--- a/arch/x86/kvm/vmx/nested.h
>+++ b/arch/x86/kvm/vmx/nested.h
>@@ -251,6 +251,14 @@ static inline bool nested_cpu_has_encls_exit(struct vmcs12 *vmcs12)
> 	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENCLS_EXITING);
> }
> 
>+static inline bool nested_cpu_has_fred(struct vmcs12 *vmcs12)
>+{
>+	return vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED &&
>+	       vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS &&
>+	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_SAVE_IA32_FRED &&
>+	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED;

Is it a requirement in the SDM that the VMM should enable all FRED controls or
none? If not, the VMM is allowed to enable only one or two of them. This means
KVM would need to emulate FRED controls for the L1 VMM as three separate
features.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 21/27] KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup
  2024-10-01  5:01 ` [PATCH v3 21/27] KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup Xin Li (Intel)
@ 2024-10-24  7:49   ` Chao Gao
  2024-10-25  7:34     ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Chao Gao @ 2024-10-24  7:49 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Mon, Sep 30, 2024 at 10:01:04PM -0700, Xin Li (Intel) wrote:
>From: Xin Li <xin3.li@intel.com>
>
>Set VMX CPU capabilities before initializing nested instead of after,
>as it needs to check VMX CPU capabilities to setup the VMX basic MSR
>for nested.

Which VMX CPU capabilities are needed? after reading patch 25, I still
don't get that.

>
>Signed-off-by: Xin Li <xin3.li@intel.com>
>Signed-off-by: Xin Li (Intel) <xin@zytor.com>
>Tested-by: Shan Kang <shan.kang@intel.com>
>---
> arch/x86/kvm/vmx/vmx.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index ef807194ccbd..522ee27a4655 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -8774,6 +8774,12 @@ __init int vmx_hardware_setup(void)
> 
> 	setup_default_sgx_lepubkeyhash();
> 
>+	/*
>+	 * VMX CPU capabilities are required to setup the VMX basic MSR for
>+	 * nested, so this must be done before nested_vmx_setup_ctls_msrs().
>+	 */
>+	vmx_set_cpu_caps();
>+
> 	if (nested) {
> 		nested_vmx_setup_ctls_msrs(&vmcs_config, vmx_capability.ept);
> 
>@@ -8782,8 +8788,6 @@ __init int vmx_hardware_setup(void)
> 			return r;
> 	}
> 
>-	vmx_set_cpu_caps();
>-
> 	r = alloc_kvm_area();
> 	if (r && nested)
> 		nested_vmx_hardware_unsetup();
>-- 
>2.46.2
>
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 18/27] KVM: VMX: Dump FRED context in dump_vmcs()
  2024-10-24  7:23   ` Chao Gao
@ 2024-10-24 16:50     ` Xin Li
  0 siblings, 0 replies; 81+ messages in thread
From: Xin Li @ 2024-10-24 16:50 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 10/24/2024 12:23 AM, Chao Gao wrote:
> On Mon, Sep 30, 2024 at 10:01:01PM -0700, Xin Li (Intel) wrote:
>> From: Xin Li <xin3.li@intel.com>
>>
>> Add FRED related VMCS fields to dump_vmcs() to dump FRED context.
> 
> Host/guest SSP[1-3] are not dumped. Is this intentional?
> 

Right, we will see how to do it after CET gets merged.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 25/27] KVM: nVMX: Add FRED VMCS fields
  2024-10-24  7:42   ` Chao Gao
@ 2024-10-25  7:25     ` Xin Li
  2024-10-28  9:07       ` Chao Gao
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-10-25  7:25 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 10/24/2024 12:42 AM, Chao Gao wrote:
>> @@ -7197,6 +7250,9 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
>> 	msrs->basic |= VMX_BASIC_TRUE_CTLS;
>> 	if (cpu_has_vmx_basic_inout())
>> 		msrs->basic |= VMX_BASIC_INOUT;
>> +
>> +	if (cpu_has_vmx_fred())
>> +		msrs->basic |= VMX_BASIC_NESTED_EXCEPTION;
> 
> why not advertising VMX_BASIC_NESTED_EXCEPTION if the CPU supports it? just like
> VMX_BASIC_INOUT right above.

Because VMX nested-exception support only works with FRED.

We could pass host MSR_IA32_VMX_BASIC.VMX_BASIC_NESTED_EXCEPTION to
nested, but it's meaningless w/o VMX FRED.

> 
> 
>> }
>>
>> static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
>> diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
>> index 2c296b6abb8c..5272f617fcef 100644
>> --- a/arch/x86/kvm/vmx/nested.h
>> +++ b/arch/x86/kvm/vmx/nested.h
>> @@ -251,6 +251,14 @@ static inline bool nested_cpu_has_encls_exit(struct vmcs12 *vmcs12)
>> 	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENCLS_EXITING);
>> }
>>
>> +static inline bool nested_cpu_has_fred(struct vmcs12 *vmcs12)
>> +{
>> +	return vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED &&
>> +	       vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS &&
>> +	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_SAVE_IA32_FRED &&
>> +	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED;
> 
> Is it a requirement in the SDM that the VMM should enable all FRED controls or
> none? If not, the VMM is allowed to enable only one or two of them. This means
> KVM would need to emulate FRED controls for the L1 VMM as three separate
> features.

The SDM doesn't say that.  But FRED states are used during and
immediately after VM entry and exit, I don't see a good reason for a VMM
to enable only one or two of the 3 save/load configs.

Say if VM_ENTRY_LOAD_IA32_FRED is not set, it means a VMM needs to
switch to guest FRED states before it does a VM entry, which is
absolutely a big mess.

TBH I'm not sure this is the question you have in mind.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 21/27] KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup
  2024-10-24  7:49   ` Chao Gao
@ 2024-10-25  7:34     ` Xin Li
  2025-02-25 16:01       ` Sean Christopherson
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-10-25  7:34 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 10/24/2024 12:49 AM, Chao Gao wrote:
> On Mon, Sep 30, 2024 at 10:01:04PM -0700, Xin Li (Intel) wrote:
>> From: Xin Li <xin3.li@intel.com>
>>
>> Set VMX CPU capabilities before initializing nested instead of after,
>> as it needs to check VMX CPU capabilities to setup the VMX basic MSR
>> for nested.
> 
> Which VMX CPU capabilities are needed? after reading patch 25, I still
> don't get that.

Sigh, in v2 I had 'if (kvm_cpu_cap_has(X86_FEATURE_FRED))' in
nested_vmx_setup_basic(), which is changed to 'if (cpu_has_vmx_fred())'
in v3.  So the reason for the change is gone.  But I think logically
the change is still needed; nested setup should be after VMX setup.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 16/27] KVM: VMX: Virtualize FRED nested exception tracking
  2024-10-24  6:24   ` Chao Gao
@ 2024-10-25  8:04     ` Xin Li
  2024-10-28  6:33       ` Chao Gao
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-10-25  8:04 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 10/23/2024 11:24 PM, Chao Gao wrote:
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index b9b82aaea9a3..3830084b569b 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -736,6 +736,7 @@ struct kvm_queued_exception {
>> 	u32 error_code;
>> 	unsigned long payload;
>> 	bool has_payload;
>> +	bool nested;
>> 	u64 event_data;
> 
> how "nested" is migrated in live migration?

Damn, I forgot it!

Looks we need to add it to kvm_vcpu_ioctl_x86_{get,set}_vcpu_events(),
but the real question is how to add it to struct kvm_vcpu_events.

> 
>> };
> 
> [..]
> 
>>
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index d81144bd648f..03f42b218554 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -1910,8 +1910,11 @@ void vmx_inject_exception(struct kvm_vcpu *vcpu)
>> 		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
>> 			     vmx->vcpu.arch.event_exit_inst_len);
>> 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
>> -	} else
>> +	} else {
>> 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
>> +		if (ex->nested)
>> +			intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
> 
> how about moving the is_fred_enable() check from kvm_multiple_exception() to here? i.e.,
> 
> 		if (ex->nested && is_fred_enabled(vcpu))
> 			intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
> 
> It is slightly clearer because FRED details don't bleed into kvm_multiple_exception().

But FRED is all about events, including exception/interrupt/trap/...

logically VMX nested exception only works when FRED is enabled, see how 
it is set at 2 places in kvm_multiple_exception().

> 
>> +	}
>>
>> 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
>>
>> @@ -7290,6 +7293,7 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
>> 		kvm_requeue_exception(vcpu, vector,
>> 				      idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK,
>> 				      error_code,
>> +				      idt_vectoring_info & VECTORING_INFO_NESTED_EXCEPTION_MASK,
>> 				      event_data);
>> 		break;
>> 	}
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 7a55c1eb5297..8546629166e9 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -874,6 +874,11 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
>> 		vcpu->arch.exception.pending = true;
>> 		vcpu->arch.exception.injected = false;
>>
>> +		vcpu->arch.exception.nested = vcpu->arch.exception.nested ||
>> +					      (is_fred_enabled(vcpu) &&
>> +					       (vcpu->arch.nmi_injected ||
>> +					        vcpu->arch.interrupt.injected));
>> +
>> 		vcpu->arch.exception.has_error_code = has_error;
>> 		vcpu->arch.exception.vector = nr;
>> 		vcpu->arch.exception.error_code = error_code;
>> @@ -903,8 +908,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
>> 		vcpu->arch.exception.injected = false;
>> 		vcpu->arch.exception.pending = false;
>>
>> +		/* #DF is NOT a nested event, per its definition. */
>> +		vcpu->arch.exception.nested = false;
>> +
>> 		kvm_queue_exception_e(vcpu, DF_VECTOR, 0);
>> 	} else {
>> +		vcpu->arch.exception.nested = is_fred_enabled(vcpu);
>> +
>> 		/* replace previous exception with a new one in a hope
>> 		   that instruction re-execution will regenerate lost
>> 		   exception */
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 16/27] KVM: VMX: Virtualize FRED nested exception tracking
  2024-10-25  8:04     ` Xin Li
@ 2024-10-28  6:33       ` Chao Gao
  2024-12-05  7:16         ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Chao Gao @ 2024-10-28  6:33 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

>> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> > index d81144bd648f..03f42b218554 100644
>> > --- a/arch/x86/kvm/vmx/vmx.c
>> > +++ b/arch/x86/kvm/vmx/vmx.c
>> > @@ -1910,8 +1910,11 @@ void vmx_inject_exception(struct kvm_vcpu *vcpu)
>> > 		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
>> > 			     vmx->vcpu.arch.event_exit_inst_len);
>> > 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
>> > -	} else
>> > +	} else {
>> > 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
>> > +		if (ex->nested)
>> > +			intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
>> 
>> how about moving the is_fred_enable() check from kvm_multiple_exception() to here? i.e.,
>> 
>> 		if (ex->nested && is_fred_enabled(vcpu))
>> 			intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
>> 
>> It is slightly clearer because FRED details don't bleed into kvm_multiple_exception().
>
>But FRED is all about events, including exception/interrupt/trap/...
>
>logically VMX nested exception only works when FRED is enabled, see how it is
>set at 2 places in kvm_multiple_exception().

"VMX nested exception only works ..." is what I referred to as "FRED details"

I believe there are several reasons to decouple the "nested exception" concept
from FRED:

1. Readers new to FRED can understand kvm_multiple_exception() without needing
   to know FRED details. Readers just need to know nested exceptions are
   exceptions encountered during delivering another event (exception/NMI/interrupts).

2. Developing KVM's generic "nested exception" concept can support other vendors.
   "nested" becomes a property of an exception. only how nested exceptions are
   reported to guests is specific to vendors (i.e., VMX/SVM).

3. This series handles ex->event_data in a similar approach: set it regardless
   of FRED enablement and let VMX/SVM code decide to consume or ignore it.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 25/27] KVM: nVMX: Add FRED VMCS fields
  2024-10-25  7:25     ` Xin Li
@ 2024-10-28  9:07       ` Chao Gao
  2024-10-28 18:27         ` Sean Christopherson
  0 siblings, 1 reply; 81+ messages in thread
From: Chao Gao @ 2024-10-28  9:07 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Fri, Oct 25, 2024 at 12:25:45AM -0700, Xin Li wrote:
>On 10/24/2024 12:42 AM, Chao Gao wrote:
>> > @@ -7197,6 +7250,9 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
>> > 	msrs->basic |= VMX_BASIC_TRUE_CTLS;
>> > 	if (cpu_has_vmx_basic_inout())
>> > 		msrs->basic |= VMX_BASIC_INOUT;
>> > +
>> > +	if (cpu_has_vmx_fred())
>> > +		msrs->basic |= VMX_BASIC_NESTED_EXCEPTION;
>> 
>> why not advertising VMX_BASIC_NESTED_EXCEPTION if the CPU supports it? just like
>> VMX_BASIC_INOUT right above.
>
>Because VMX nested-exception support only works with FRED.
>
>We could pass host MSR_IA32_VMX_BASIC.VMX_BASIC_NESTED_EXCEPTION to
>nested, but it's meaningless w/o VMX FRED.

But it seems KVM cannot benefit from this attempt to avoid meaningless
configurations because on FRED-capable system, the userspace VMM can choose to
hide FRED and expose VMX nested exceptions alone. KVM needs to handle this case
anyway. I suggest not bothering with it.

>
>> 
>> 
>> > }
>> > 
>> > static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
>> > diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
>> > index 2c296b6abb8c..5272f617fcef 100644
>> > --- a/arch/x86/kvm/vmx/nested.h
>> > +++ b/arch/x86/kvm/vmx/nested.h
>> > @@ -251,6 +251,14 @@ static inline bool nested_cpu_has_encls_exit(struct vmcs12 *vmcs12)
>> > 	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENCLS_EXITING);
>> > }
>> > 
>> > +static inline bool nested_cpu_has_fred(struct vmcs12 *vmcs12)
>> > +{
>> > +	return vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED &&
>> > +	       vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS &&
>> > +	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_SAVE_IA32_FRED &&
>> > +	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED;
>> 
>> Is it a requirement in the SDM that the VMM should enable all FRED controls or
>> none? If not, the VMM is allowed to enable only one or two of them. This means
>> KVM would need to emulate FRED controls for the L1 VMM as three separate
>> features.
>
>The SDM doesn't say that.  But FRED states are used during and
>immediately after VM entry and exit, I don't see a good reason for a VMM
>to enable only one or two of the 3 save/load configs.
>
>Say if VM_ENTRY_LOAD_IA32_FRED is not set, it means a VMM needs to
>switch to guest FRED states before it does a VM entry, which is
>absolutely a big mess.

If the VMM doesn't enable FRED, it's fine to load guest FRED states before VM
entry, right?

The key is to emulate hardware behavior accurately without making assumptions
about guests. If some combinations of controls cannot be emulated properly, KVM
should report internal errors at some point.

>
>TBH I'm not sure this is the question you have in mind.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 25/27] KVM: nVMX: Add FRED VMCS fields
  2024-10-28  9:07       ` Chao Gao
@ 2024-10-28 18:27         ` Sean Christopherson
  2024-10-29 17:40           ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Sean Christopherson @ 2024-10-28 18:27 UTC (permalink / raw)
  To: Chao Gao
  Cc: Xin Li, kvm, linux-kernel, linux-doc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Mon, Oct 28, 2024, Chao Gao wrote:
> On Fri, Oct 25, 2024 at 12:25:45AM -0700, Xin Li wrote:
> >> > static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
> >> > diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
> >> > index 2c296b6abb8c..5272f617fcef 100644
> >> > --- a/arch/x86/kvm/vmx/nested.h
> >> > +++ b/arch/x86/kvm/vmx/nested.h
> >> > @@ -251,6 +251,14 @@ static inline bool nested_cpu_has_encls_exit(struct vmcs12 *vmcs12)
> >> > 	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENCLS_EXITING);
> >> > }
> >> > 
> >> > +static inline bool nested_cpu_has_fred(struct vmcs12 *vmcs12)
> >> > +{
> >> > +	return vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED &&
> >> > +	       vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS &&
> >> > +	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_SAVE_IA32_FRED &&
> >> > +	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED;
> >> 
> >> Is it a requirement in the SDM that the VMM should enable all FRED controls or
> >> none? If not, the VMM is allowed to enable only one or two of them. This means
> >> KVM would need to emulate FRED controls for the L1 VMM as three separate
> >> features.
> >
> >The SDM doesn't say that.  But FRED states are used during and
> >immediately after VM entry and exit, I don't see a good reason for a VMM
> >to enable only one or two of the 3 save/load configs.

Not KVM's concern.

> >Say if VM_ENTRY_LOAD_IA32_FRED is not set, it means a VMM needs to
> >switch to guest FRED states before it does a VM entry, which is
> >absolutely a big mess.

Again, not KVM's concern.

> If the VMM doesn't enable FRED, it's fine to load guest FRED states before VM
> entry, right?

Yep.  Or if L1 is simply broken and elects to manually load FRED state before
VM-Enter instead of using VM_ENTRY_LOAD_IA32_FRED, then any badness that happens
is 100% L1's problem to deal with.  KVM's responsiblity is to emulate the
architectural behavior, what L1 may or may not do is irrelevant.

> The key is to emulate hardware behavior accurately without making assumptions
> about guests.

+1000

> If some combinations of controls cannot be emulated properly, KVM
> should report internal errors at some point.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 25/27] KVM: nVMX: Add FRED VMCS fields
  2024-10-28 18:27         ` Sean Christopherson
@ 2024-10-29 17:40           ` Xin Li
  0 siblings, 0 replies; 81+ messages in thread
From: Xin Li @ 2024-10-29 17:40 UTC (permalink / raw)
  To: Sean Christopherson, Chao Gao
  Cc: kvm, linux-kernel, linux-doc, pbonzini, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 10/28/2024 11:27 AM, Sean Christopherson wrote:
> On Mon, Oct 28, 2024, Chao Gao wrote:
>> On Fri, Oct 25, 2024 at 12:25:45AM -0700, Xin Li wrote:
>>>>> static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
>>>>> diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
>>>>> index 2c296b6abb8c..5272f617fcef 100644
>>>>> --- a/arch/x86/kvm/vmx/nested.h
>>>>> +++ b/arch/x86/kvm/vmx/nested.h
>>>>> @@ -251,6 +251,14 @@ static inline bool nested_cpu_has_encls_exit(struct vmcs12 *vmcs12)
>>>>> 	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENCLS_EXITING);
>>>>> }
>>>>>
>>>>> +static inline bool nested_cpu_has_fred(struct vmcs12 *vmcs12)
>>>>> +{
>>>>> +	return vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED &&
>>>>> +	       vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS &&
>>>>> +	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_SAVE_IA32_FRED &&
>>>>> +	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED;
>>>>
>>>> Is it a requirement in the SDM that the VMM should enable all FRED controls or
>>>> none? If not, the VMM is allowed to enable only one or two of them. This means
>>>> KVM would need to emulate FRED controls for the L1 VMM as three separate
>>>> features.
>>>
>>> The SDM doesn't say that.  But FRED states are used during and
>>> immediately after VM entry and exit, I don't see a good reason for a VMM
>>> to enable only one or two of the 3 save/load configs.
> 
> Not KVM's concern.
> 
>>> Say if VM_ENTRY_LOAD_IA32_FRED is not set, it means a VMM needs to
>>> switch to guest FRED states before it does a VM entry, which is
>>> absolutely a big mess.
> 
> Again, not KVM's concern.
> 
>> If the VMM doesn't enable FRED, it's fine to load guest FRED states before VM
>> entry, right?
> 
> Yep.  Or if L1 is simply broken and elects to manually load FRED state before
> VM-Enter instead of using VM_ENTRY_LOAD_IA32_FRED, then any badness that happens
> is 100% L1's problem to deal with.  KVM's responsiblity is to emulate the
> architectural behavior, what L1 may or may not do is irrelevant.

Damn, obviously I COMPLETELY missed this point.

Let me think how should KVM as L0 handle it.

> 
>> The key is to emulate hardware behavior accurately without making assumptions
>> about guests.
> 
> +1000
> 
>> If some combinations of controls cannot be emulated properly, KVM
>> should report internal errors at some point.

Yeah, only if CANNOT.  Otherwise a broken VMM will behave differently on
real hardware and KVM, even if it crashes in a way which it never knows
about, right?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 10/27] KVM: VMX: Set FRED MSR interception
  2024-10-01  5:00 ` [PATCH v3 10/27] KVM: VMX: Set FRED MSR interception Xin Li (Intel)
@ 2024-11-13 11:31   ` Chao Gao
  0 siblings, 0 replies; 81+ messages in thread
From: Chao Gao @ 2024-11-13 11:31 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Mon, Sep 30, 2024 at 10:00:53PM -0700, Xin Li (Intel) wrote:
>From: Xin Li <xin3.li@intel.com>
>
>Add FRED MSRs to the VMX passthrough MSR list and set FRED MSRs
>interception.
>
>8 FRED MSRs, i.e., MSR_IA32_FRED_RSP[123], MSR_IA32_FRED_STKLVLS,
>MSR_IA32_FRED_SSP[123] and MSR_IA32_FRED_CONFIG, are all safe to be
>passthrough, because they all have a pair of corresponding host and
>guest VMCS fields.
>
>Both MSR_IA32_FRED_RSP0 and MSR_IA32_FRED_SSP0 are dedicated for user
>level event delivery only, IOW they are NOT used in any kernel event
>delivery and the execution of ERETS.  Thus KVM can run safely with
>guest values in the 2 MSRs.  As a result, save and restore of their
>guest values are postponed until vCPU context switching and their host
>values are restored on returning to userspace.
>
>Save/restore of MSR_IA32_FRED_RSP0 is done in the next patch.
>
>Note, as MSR_IA32_FRED_SSP0 is an alias of MSR_IA32_PL0_SSP, its save
>and restore is done through the CET supervisor context management.

But CET may be not supported by either the host or the guest. How will
MSR_IA32_FRED_SSP0 be switched in this case? I think that's part of the reason
why Sean suggested [*] intercepting the MSR when CET isn't exposed to the
guest.

[*]: https://lore.kernel.org/kvm/ZvQaNRhrsSJTYji3@google.com/#t

>
>Signed-off-by: Xin Li <xin3.li@intel.com>
>Signed-off-by: Xin Li (Intel) <xin@zytor.com>
>Tested-by: Shan Kang <shan.kang@intel.com>
>---
> arch/x86/kvm/vmx/vmx.c | 34 ++++++++++++++++++++++++++++++++++
> 1 file changed, 34 insertions(+)
>
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 28cf89c97bda..c10c955722a3 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -176,6 +176,16 @@ static u32 vmx_possible_passthrough_msrs[] = {
> 	MSR_FS_BASE,
> 	MSR_GS_BASE,
> 	MSR_KERNEL_GS_BASE,
>+	MSR_IA32_FRED_RSP0,
>+	MSR_IA32_FRED_RSP1,
>+	MSR_IA32_FRED_RSP2,
>+	MSR_IA32_FRED_RSP3,
>+	MSR_IA32_FRED_STKLVLS,
>+	MSR_IA32_FRED_SSP1,
>+	MSR_IA32_FRED_SSP2,
>+	MSR_IA32_FRED_SSP3,
>+	MSR_IA32_FRED_CONFIG,
>+	MSR_IA32_FRED_SSP0,		/* Should be added through CET */
> 	MSR_IA32_XFD,
> 	MSR_IA32_XFD_ERR,
> #endif
>@@ -7880,6 +7890,28 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
> 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
> }
> 
>+static void vmx_set_intercept_for_fred_msr(struct kvm_vcpu *vcpu)
>+{
>+	bool flag = !guest_can_use(vcpu, X86_FEATURE_FRED);
>+
>+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP0, MSR_TYPE_RW, flag);
>+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP1, MSR_TYPE_RW, flag);
>+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP2, MSR_TYPE_RW, flag);
>+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP3, MSR_TYPE_RW, flag);
>+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_STKLVLS, MSR_TYPE_RW, flag);
>+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP1, MSR_TYPE_RW, flag);
>+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP2, MSR_TYPE_RW, flag);
>+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP3, MSR_TYPE_RW, flag);
>+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_CONFIG, MSR_TYPE_RW, flag);
>+
>+	/*
>+	 * flag = !(CET.SUPERVISOR_SHADOW_STACK || FRED)
>+	 *
>+	 * A possible optimization is to intercept SSPs when FRED && !CET.SUPERVISOR_SHADOW_STACK.
>+	 */
>+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP0, MSR_TYPE_RW, flag);

To implement the "optimization", you can simply remove this line. Then the CET
series will take care of the interception of this MSR. And please leave a
comment here to explain why this MSR is treated differently from other FRED
MSRs.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 05/27] KVM: VMX: Disable FRED if FRED consistency checks fail
  2024-10-01  5:00 ` [PATCH v3 05/27] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
  2024-10-22  8:48   ` Chao Gao
@ 2024-11-26 15:32   ` Borislav Petkov
  2024-11-26 18:53     ` Xin Li
  1 sibling, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2024-11-26 15:32 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Mon, Sep 30, 2024 at 10:00:48PM -0700, Xin Li (Intel) wrote:
> +static inline bool cpu_has_vmx_fred(void)
> +{
> +	/* No need to check FRED VM exit controls. */
> +	return boot_cpu_has(X86_FEATURE_FRED) &&

For your whole patchset:

s/boot_cpu_has/cpu_feature_enabled/g

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
  2024-10-01  5:00 ` [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition Xin Li (Intel)
@ 2024-11-26 18:02   ` Borislav Petkov
  2024-11-26 19:22     ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2024-11-26 18:02 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Mon, Sep 30, 2024 at 10:00:52PM -0700, Xin Li (Intel) wrote:
> No need to use MAX_POSSIBLE_PASSTHROUGH_MSRS in the definition of array
> vmx_possible_passthrough_msrs, as the macro name indicates the _possible_
> maximum size of passthrough MSRs.
> 
> Use ARRAY_SIZE instead of MAX_POSSIBLE_PASSTHROUGH_MSRS when the size of
> the array is needed and add a BUILD_BUG_ON to make sure the actual array
> size does not exceed the possible maximum size of passthrough MSRs.

This commit message needs to talk about the why - not the what. Latter should
be visible from the diff itself.

What you're not talking about is the sneaked increase of
MAX_POSSIBLE_PASSTHROUGH_MSRS to 64. Something you *should* mention because
the array is full and blablabla...

> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index e0d76d2460ef..e7409f8f28b1 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -356,7 +356,7 @@ struct vcpu_vmx {
>  	struct lbr_desc lbr_desc;
>  
>  	/* Save desired MSR intercept (read: pass-through) state */
> -#define MAX_POSSIBLE_PASSTHROUGH_MSRS	16
> +#define MAX_POSSIBLE_PASSTHROUGH_MSRS	64
						^^^

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 05/27] KVM: VMX: Disable FRED if FRED consistency checks fail
  2024-11-26 15:32   ` Borislav Petkov
@ 2024-11-26 18:53     ` Xin Li
  2024-11-26 19:04       ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-11-26 18:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 11/26/2024 7:32 AM, Borislav Petkov wrote:
> On Mon, Sep 30, 2024 at 10:00:48PM -0700, Xin Li (Intel) wrote:
>> +static inline bool cpu_has_vmx_fred(void)
>> +{
>> +	/* No need to check FRED VM exit controls. */
>> +	return boot_cpu_has(X86_FEATURE_FRED) &&
> 
> For your whole patchset:
> 
> s/boot_cpu_has/cpu_feature_enabled/g
> 

Already done based on your reply to other patches.

There is a lot of boot_cpu_has() in arch/x86/kvm/, and someone needs to
replace them :-P

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 05/27] KVM: VMX: Disable FRED if FRED consistency checks fail
  2024-11-26 18:53     ` Xin Li
@ 2024-11-26 19:04       ` Borislav Petkov
  0 siblings, 0 replies; 81+ messages in thread
From: Borislav Petkov @ 2024-11-26 19:04 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Tue, Nov 26, 2024 at 10:53:17AM -0800, Xin Li wrote:
> Already done based on your reply to other patches.

Thx.

> There is a lot of boot_cpu_has() in arch/x86/kvm/, and someone needs to
> replace them :-P

There are such all over the tree and it'll happen eventually.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
  2024-11-26 18:02   ` Borislav Petkov
@ 2024-11-26 19:22     ` Xin Li
  2024-11-26 20:06       ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-11-26 19:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 11/26/2024 10:02 AM, Borislav Petkov wrote:
> On Mon, Sep 30, 2024 at 10:00:52PM -0700, Xin Li (Intel) wrote:
>> No need to use MAX_POSSIBLE_PASSTHROUGH_MSRS in the definition of array
>> vmx_possible_passthrough_msrs, as the macro name indicates the _possible_
>> maximum size of passthrough MSRs.
>>
>> Use ARRAY_SIZE instead of MAX_POSSIBLE_PASSTHROUGH_MSRS when the size of
>> the array is needed and add a BUILD_BUG_ON to make sure the actual array
>> size does not exceed the possible maximum size of passthrough MSRs.
> 
> This commit message needs to talk about the why - not the what. Latter should
> be visible from the diff itself.

I should not write such a changelog...

> What you're not talking about is the sneaked increase of
> MAX_POSSIBLE_PASSTHROUGH_MSRS to 64. Something you *should* mention because
> the array is full and blablabla...

It's still far from full in a bitmap on x86-64, but just that the
existing use of MAX_POSSIBLE_PASSTHROUGH_MSRS tastes bad.


A better one?

Per the definition, a bitmap on x86-64 is an array of 'unsigned long',
and is at least 64-bit long.

#define DECLARE_BITMAP(name,bits) \
	unsigned long name[BITS_TO_LONGS(bits)]

It's not accurate and error-prone to use a hard-coded possible size of
a bitmap, Use ARRAY_SIZE with an overflow build check instead.

> 
>> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
>> index e0d76d2460ef..e7409f8f28b1 100644
>> --- a/arch/x86/kvm/vmx/vmx.h
>> +++ b/arch/x86/kvm/vmx/vmx.h
>> @@ -356,7 +356,7 @@ struct vcpu_vmx {
>>   	struct lbr_desc lbr_desc;
>>   
>>   	/* Save desired MSR intercept (read: pass-through) state */
>> -#define MAX_POSSIBLE_PASSTHROUGH_MSRS	16
>> +#define MAX_POSSIBLE_PASSTHROUGH_MSRS	64
> 						^^^
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
  2024-11-26 19:22     ` Xin Li
@ 2024-11-26 20:06       ` Borislav Petkov
  2024-11-27  6:46         ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2024-11-26 20:06 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Tue, Nov 26, 2024 at 11:22:45AM -0800, Xin Li wrote:
> It's still far from full in a bitmap on x86-64, but just that the
> existing use of MAX_POSSIBLE_PASSTHROUGH_MSRS tastes bad.

Far from full?

It is full:

static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
        MSR_IA32_SPEC_CTRL,
        MSR_IA32_PRED_CMD,
        MSR_IA32_FLUSH_CMD,
        MSR_IA32_TSC,
#ifdef CONFIG_X86_64
        MSR_FS_BASE,
        MSR_GS_BASE,
        MSR_KERNEL_GS_BASE,
        MSR_IA32_XFD,
        MSR_IA32_XFD_ERR,
#endif
        MSR_IA32_SYSENTER_CS,
        MSR_IA32_SYSENTER_ESP,
        MSR_IA32_SYSENTER_EIP,
        MSR_CORE_C1_RES,
        MSR_CORE_C3_RESIDENCY,
        MSR_CORE_C6_RESIDENCY,
        MSR_CORE_C7_RESIDENCY,
};

I count 16 here.

If you need to add more, you need to increment MAX_POSSIBLE_PASSTHROUGH_MSRS.

> A better one?

Not really.

You're not explaining why MAX_POSSIBLE_PASSTHROUGH_MSRS becomes 64.

> Per the definition, a bitmap on x86-64 is an array of 'unsigned long',
> and is at least 64-bit long.
> 
> #define DECLARE_BITMAP(name,bits) \
> 	unsigned long name[BITS_TO_LONGS(bits)]
> 
> It's not accurate and error-prone to use a hard-coded possible size of
> a bitmap, Use ARRAY_SIZE with an overflow build check instead.

It becomes 64 because a bitmap has 64 bits?

Not because you need to add more MSRs to it and thus raise the limit?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
  2024-11-26 20:06       ` Borislav Petkov
@ 2024-11-27  6:46         ` Xin Li
  2024-11-27  6:55           ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-11-27  6:46 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 11/26/2024 12:06 PM, Borislav Petkov wrote:
> On Tue, Nov 26, 2024 at 11:22:45AM -0800, Xin Li wrote:
>> It's still far from full in a bitmap on x86-64, but just that the
>> existing use of MAX_POSSIBLE_PASSTHROUGH_MSRS tastes bad.
> 
> Far from full?
> 
> It is full:
> 
> static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
>          MSR_IA32_SPEC_CTRL,
>          MSR_IA32_PRED_CMD,
>          MSR_IA32_FLUSH_CMD,
>          MSR_IA32_TSC,
> #ifdef CONFIG_X86_64
>          MSR_FS_BASE,
>          MSR_GS_BASE,
>          MSR_KERNEL_GS_BASE,
>          MSR_IA32_XFD,
>          MSR_IA32_XFD_ERR,
> #endif
>          MSR_IA32_SYSENTER_CS,
>          MSR_IA32_SYSENTER_ESP,
>          MSR_IA32_SYSENTER_EIP,
>          MSR_CORE_C1_RES,
>          MSR_CORE_C3_RESIDENCY,
>          MSR_CORE_C6_RESIDENCY,
>          MSR_CORE_C7_RESIDENCY,
> };
> 
> I count 16 here.
> 
> If you need to add more, you need to increment MAX_POSSIBLE_PASSTHROUGH_MSRS.

Yes, the most obvious approach is to simply increase
MAX_POSSIBLE_PASSTHROUGH_MSRS by the number of MSRs to be added into the 
array.

However I hate to count it myself, especially we have ARRAY_SIZE.

> 
>> A better one?
> 
> Not really.
> 
> You're not explaining why MAX_POSSIBLE_PASSTHROUGH_MSRS becomes 64.
> 
>> Per the definition, a bitmap on x86-64 is an array of 'unsigned long',
>> and is at least 64-bit long.
>>
>> #define DECLARE_BITMAP(name,bits) \
>> 	unsigned long name[BITS_TO_LONGS(bits)]
>>
>> It's not accurate and error-prone to use a hard-coded possible size of
>> a bitmap, Use ARRAY_SIZE with an overflow build check instead.
> 
> It becomes 64 because a bitmap has 64 bits?

Yes, maybe better to name the macro as MAX_ALLOWED_PASSTHROUGH_MSRS?

> 
> Not because you need to add more MSRs to it and thus raise the limit?

Right.  It triggered me to look at the code further, though, I think the
existing code could be written in a better way no matter whether I need
to add more MSRs.  And whoever wants to add more won't need to increase
MAX_POSSIBLE_PASSTHROUGH_MSRS (ofc unless overflow 64).

Thanks!
     Xin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
  2024-11-27  6:46         ` Xin Li
@ 2024-11-27  6:55           ` Borislav Petkov
  2024-11-27  7:02             ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2024-11-27  6:55 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Tue, Nov 26, 2024 at 10:46:09PM -0800, Xin Li wrote:
> Right.  It triggered me to look at the code further, though, I think the
> existing code could be written in a better way no matter whether I need
> to add more MSRs.  And whoever wants to add more won't need to increase
> MAX_POSSIBLE_PASSTHROUGH_MSRS (ofc unless overflow 64).

But do you see what I mean?

This patch is "all over the place": what are you actually fixing?

And more importantly, why is it part of this series?

Questions over questions.

So can you pls concentrate and spell out for me what is going on here...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
  2024-11-27  6:55           ` Borislav Petkov
@ 2024-11-27  7:02             ` Xin Li
  2024-11-27  7:10               ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-11-27  7:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 11/26/2024 10:55 PM, Borislav Petkov wrote:
> On Tue, Nov 26, 2024 at 10:46:09PM -0800, Xin Li wrote:
>> Right.  It triggered me to look at the code further, though, I think the
>> existing code could be written in a better way no matter whether I need
>> to add more MSRs.  And whoever wants to add more won't need to increase
>> MAX_POSSIBLE_PASSTHROUGH_MSRS (ofc unless overflow 64).
> 
> But do you see what I mean?
> 
> This patch is "all over the place": what are you actually fixing?
> 
> And more importantly, why is it part of this series?
> 
> Questions over questions.
> 
> So can you pls concentrate and spell out for me what is going on here...
> 

This is a patch that cleanup the existing code for better accommodate
new VMX pass-through MSRs.  And it can be a standalone one.

Thanks!
     Xin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
  2024-11-27  7:02             ` Xin Li
@ 2024-11-27  7:10               ` Borislav Petkov
  2024-11-27  7:32                 ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Borislav Petkov @ 2024-11-27  7:10 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Tue, Nov 26, 2024 at 11:02:31PM -0800, Xin Li wrote:
> This is a patch that cleanup the existing code for better accommodate
> new VMX pass-through MSRs.  And it can be a standalone one.

Well, your very *next* patch is adding more MSRs to that array. So it needs to
be part of this series.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
  2024-11-27  7:10               ` Borislav Petkov
@ 2024-11-27  7:32                 ` Xin Li
  2024-11-27  7:58                   ` Borislav Petkov
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-11-27  7:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 11/26/2024 11:10 PM, Borislav Petkov wrote:
> On Tue, Nov 26, 2024 at 11:02:31PM -0800, Xin Li wrote:
>> This is a patch that cleanup the existing code for better accommodate
>> new VMX pass-through MSRs.  And it can be a standalone one.
> 
> Well, your very *next* patch is adding more MSRs to that array. So it needs to
> be part of this series.
> 

It's self-contained.  Another approach is to send cleanup patches in a 
separate preparation patch set.

Thanks!
     Xin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
  2024-11-27  7:32                 ` Xin Li
@ 2024-11-27  7:58                   ` Borislav Petkov
  0 siblings, 0 replies; 81+ messages in thread
From: Borislav Petkov @ 2024-11-27  7:58 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Tue, Nov 26, 2024 at 11:32:13PM -0800, Xin Li wrote:
> It's self-contained. 

It better be. Each patch needs to build and boot on its own.

> Another approach is to send cleanup patches in a separate preparation patch
> set.

Not in this case. The next patch shows *why* you're doing the cleanup so it
makes sense for them going together.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 16/27] KVM: VMX: Virtualize FRED nested exception tracking
  2024-10-28  6:33       ` Chao Gao
@ 2024-12-05  7:16         ` Xin Li
  0 siblings, 0 replies; 81+ messages in thread
From: Xin Li @ 2024-12-05  7:16 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 10/27/2024 11:33 PM, Chao Gao wrote:
>>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>>> index d81144bd648f..03f42b218554 100644
>>>> --- a/arch/x86/kvm/vmx/vmx.c
>>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>>> @@ -1910,8 +1910,11 @@ void vmx_inject_exception(struct kvm_vcpu *vcpu)
>>>> 		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
>>>> 			     vmx->vcpu.arch.event_exit_inst_len);
>>>> 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
>>>> -	} else
>>>> +	} else {
>>>> 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
>>>> +		if (ex->nested)
>>>> +			intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
>>>
>>> how about moving the is_fred_enable() check from kvm_multiple_exception() to here? i.e.,
>>>
>>> 		if (ex->nested && is_fred_enabled(vcpu))
>>> 			intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
>>>
>>> It is slightly clearer because FRED details don't bleed into kvm_multiple_exception().
>>
>> But FRED is all about events, including exception/interrupt/trap/...
>>
>> logically VMX nested exception only works when FRED is enabled, see how it is
>> set at 2 places in kvm_multiple_exception().
> 
> "VMX nested exception only works ..." is what I referred to as "FRED details"
> 
> I believe there are several reasons to decouple the "nested exception" concept
> from FRED:
> 
> 1. Readers new to FRED can understand kvm_multiple_exception() without needing
>     to know FRED details. Readers just need to know nested exceptions are
>     exceptions encountered during delivering another event (exception/NMI/interrupts).
> 
> 2. Developing KVM's generic "nested exception" concept can support other vendors.
>     "nested" becomes a property of an exception. only how nested exceptions are
>     reported to guests is specific to vendors (i.e., VMX/SVM).
> 
> 3. This series handles ex->event_data in a similar approach: set it regardless
>     of FRED enablement and let VMX/SVM code decide to consume or ignore it.

This is a nice way to look at the nature of nested exception, and I have
made the change for the next iteration.

Thanks!
     Xin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 17/27] KVM: x86: Mark CR4.FRED as not reserved when guest can use FRED
  2024-10-24  7:18   ` Chao Gao
@ 2024-12-12 18:48     ` Xin Li
  2024-12-12 19:05       ` Sean Christopherson
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2024-12-12 18:48 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-kernel, linux-doc, seanjc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 10/24/2024 12:18 AM, Chao Gao wrote:
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 03f42b218554..bfdd10773136 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -8009,6 +8009,10 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_LAM);
>> 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_FRED);
>>
>> +	/* Don't allow CR4.FRED=1 before all of FRED KVM support is in place. */
>> +	if (!guest_can_use(vcpu, X86_FEATURE_FRED))
>> +		vcpu->arch.cr4_guest_rsvd_bits |= X86_CR4_FRED;
> 
> is this necessary? __kvm_is_valid_cr4() ensures that guests cannot set any bit
> which isn't supported by the hardware.
> 
> To account for hardware/KVM caps, I think the following changes will work. This
> will fix all other bits besides X86_CR4_FRED.

This seems a generic infra improvement, maybe it's better for you to
send it as an individual patch to Sean and the KVM mailing list?

> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4a93ac1b9be9..2bec3ba8e47d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1873,6 +1873,7 @@ struct kvm_arch_async_pf {
>   extern u32 __read_mostly kvm_nr_uret_msrs;
>   extern bool __read_mostly allow_smaller_maxphyaddr;
>   extern bool __read_mostly enable_apicv;
> +extern u64 __read_mostly cr4_reserved_bits;
>   extern struct kvm_x86_ops kvm_x86_ops;
>   
>   #define kvm_x86_call(func) static_call(kvm_x86_##func)
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 2617be544480..57d82fbcfd3f 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -393,8 +393,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>   	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
>   
>   	kvm_pmu_refresh(vcpu);
> -	vcpu->arch.cr4_guest_rsvd_bits =
> -	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
> +	vcpu->arch.cr4_guest_rsvd_bits = cr4_reserved_bits |
> +					 __cr4_reserved_bits(guest_cpuid_has, vcpu);
>   
>   	kvm_hv_set_cpuid(vcpu, kvm_cpuid_has_hyperv(vcpu->arch.cpuid_entries,
>   						    vcpu->arch.cpuid_nent));
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 34b52b49f5e6..08b42bbd2342 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -119,7 +119,7 @@ u64 __read_mostly efer_reserved_bits = ~((u64)(EFER_SCE | EFER_LME | EFER_LMA));
>   static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);
>   #endif
>   
> -static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS;
> +u64 __read_mostly cr4_reserved_bits;
>   
>   #define KVM_EXIT_HYPERCALL_VALID_MASK (1 << KVM_HC_MAP_GPA_RANGE)
>   
> @@ -1110,13 +1110,7 @@ EXPORT_SYMBOL_GPL(kvm_emulate_xsetbv);
>   
>   bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>   {
> -	if (cr4 & cr4_reserved_bits)
> -		return false;
> -
> -	if (cr4 & vcpu->arch.cr4_guest_rsvd_bits)
> -		return false;
> -
> -	return true;
> +	return !(cr4 & vcpu->arch.cr4_guest_rsvd_bits);
>   }
>   EXPORT_SYMBOL_GPL(__kvm_is_valid_cr4);
>   
> 
>> +
>> 	vmx_setup_uret_msrs(vmx);
>>
>> 	if (cpu_has_secondary_exec_ctrls())
>> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
>> index 992e73ee2ec5..0ed91512b757 100644
>> --- a/arch/x86/kvm/x86.h
>> +++ b/arch/x86/kvm/x86.h
>> @@ -561,6 +561,8 @@ enum kvm_msr_access {
>> 		__reserved_bits |= X86_CR4_PCIDE;       \
>> 	if (!__cpu_has(__c, X86_FEATURE_LAM))           \
>> 		__reserved_bits |= X86_CR4_LAM_SUP;     \
>> +	if (!__cpu_has(__c, X86_FEATURE_FRED))          \
>> +		__reserved_bits |= X86_CR4_FRED;        \
>> 	__reserved_bits;                                \
>> })
>>
>> -- 
>> 2.46.2
>>
>>


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 17/27] KVM: x86: Mark CR4.FRED as not reserved when guest can use FRED
  2024-12-12 18:48     ` Xin Li
@ 2024-12-12 19:05       ` Sean Christopherson
  2024-12-13 18:43         ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Sean Christopherson @ 2024-12-12 19:05 UTC (permalink / raw)
  To: Xin Li
  Cc: Chao Gao, kvm, linux-kernel, linux-doc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Thu, Dec 12, 2024, Xin Li wrote:
> On 10/24/2024 12:18 AM, Chao Gao wrote:
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 03f42b218554..bfdd10773136 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -8009,6 +8009,10 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_LAM);
> > > 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_FRED);
> > > 
> > > +	/* Don't allow CR4.FRED=1 before all of FRED KVM support is in place. */
> > > +	if (!guest_can_use(vcpu, X86_FEATURE_FRED))
> > > +		vcpu->arch.cr4_guest_rsvd_bits |= X86_CR4_FRED;
> > 
> > is this necessary? __kvm_is_valid_cr4() ensures that guests cannot set any bit
> > which isn't supported by the hardware.
> > 
> > To account for hardware/KVM caps, I think the following changes will work. This
> > will fix all other bits besides X86_CR4_FRED.
> 
> This seems a generic infra improvement, maybe it's better for you to
> send it as an individual patch to Sean and the KVM mailing list?

Already ahead of y'all :-)  (I think, I didn't look closely at this).

https://lore.kernel.org/all/20241128013424.4096668-6-seanjc@google.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 17/27] KVM: x86: Mark CR4.FRED as not reserved when guest can use FRED
  2024-12-12 19:05       ` Sean Christopherson
@ 2024-12-13 18:43         ` Xin Li
  0 siblings, 0 replies; 81+ messages in thread
From: Xin Li @ 2024-12-13 18:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, kvm, linux-kernel, linux-doc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 12/12/2024 11:05 AM, Sean Christopherson wrote:
> On Thu, Dec 12, 2024, Xin Li wrote:
>> On 10/24/2024 12:18 AM, Chao Gao wrote:
>>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>>> index 03f42b218554..bfdd10773136 100644
>>>> --- a/arch/x86/kvm/vmx/vmx.c
>>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>>> @@ -8009,6 +8009,10 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>>> 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_LAM);
>>>> 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_FRED);
>>>>
>>>> +	/* Don't allow CR4.FRED=1 before all of FRED KVM support is in place. */
>>>> +	if (!guest_can_use(vcpu, X86_FEATURE_FRED))
>>>> +		vcpu->arch.cr4_guest_rsvd_bits |= X86_CR4_FRED;
>>>
>>> is this necessary? __kvm_is_valid_cr4() ensures that guests cannot set any bit
>>> which isn't supported by the hardware.
>>>
>>> To account for hardware/KVM caps, I think the following changes will work. This
>>> will fix all other bits besides X86_CR4_FRED.
>>
>> This seems a generic infra improvement, maybe it's better for you to
>> send it as an individual patch to Sean and the KVM mailing list?
> 
> Already ahead of y'all :-)  (I think, I didn't look closely at this).
> 
> https://lore.kernel.org/all/20241128013424.4096668-6-seanjc@google.com

Ha, that is nice.  Thank you!


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 00/27] Enable FRED with KVM VMX
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (26 preceding siblings ...)
  2024-10-01  5:01 ` [PATCH v3 27/27] KVM: nVMX: Allow VMX FRED controls Xin Li (Intel)
@ 2025-02-19  0:26 ` Xin Li
  2025-02-25 15:24   ` Sean Christopherson
  2025-02-28 17:06 ` Sean Christopherson
  28 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2025-02-19  0:26 UTC (permalink / raw)
  To: seanjc, kvm, linux-kernel, linux-doc, Chao Gao
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa, luto,
	peterz, andrew.cooper3

On 9/30/2024 10:00 PM, Xin Li (Intel) wrote:
> This patch set enables the Intel flexible return and event delivery
> (FRED) architecture with KVM VMX to allow guests to utilize FRED.
> 
> The FRED architecture defines simple new transitions that change
> privilege level (ring transitions). The FRED architecture was
> designed with the following goals:
> 
> 1) Improve overall performance and response time by replacing event
>     delivery through the interrupt descriptor table (IDT event
>     delivery) and event return by the IRET instruction with lower
>     latency transitions.
> 
> 2) Improve software robustness by ensuring that event delivery
>     establishes the full supervisor context and that event return
>     establishes the full user context.
> 
> The new transitions defined by the FRED architecture are FRED event
> delivery and, for returning from events, two FRED return instructions.
> FRED event delivery can effect a transition from ring 3 to ring 0, but
> it is used also to deliver events incident to ring 0. One FRED
> instruction (ERETU) effects a return from ring 0 to ring 3, while the
> other (ERETS) returns while remaining in ring 0. Collectively, FRED
> event delivery and the FRED return instructions are FRED transitions.
> 
> Intel VMX architecture is extended to run FRED guests, and the major
> changes are:
> 
> 1) New VMCS fields for FRED context management, which includes two new
> event data VMCS fields, eight new guest FRED context VMCS fields and
> eight new host FRED context VMCS fields.
> 
> 2) VMX nested-exception support for proper virtualization of stack
> levels introduced with FRED architecture.
> 
> Search for the latest FRED spec in most search engines with this search
> pattern:
> 
>    site:intel.com FRED (flexible return and event delivery) specification
> 
> The first 20 patches add FRED support to VMX, and the rest 7 patches
> add FRED support to nested VMX.
> 
> 
> Following is the link to the v2 of this patch set:
> https://lore.kernel.org/kvm/20240207172646.3981-1-xin3.li@intel.com/
> 
> Sean Christopherson (3):
>    KVM: x86: Use a dedicated flow for queueing re-injected exceptions
>    KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
>    KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM
> 
> Xin Li (21):
>    KVM: VMX: Add support for the secondary VM exit controls
>    KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
>    KVM: VMX: Disable FRED if FRED consistency checks fail
>    KVM: VMX: Initialize VMCS FRED fields
>    KVM: x86: Use KVM-governed feature framework to track "FRED enabled"
>    KVM: VMX: Set FRED MSR interception
>    KVM: VMX: Save/restore guest FRED RSP0
>    KVM: VMX: Add support for FRED context save/restore
>    KVM: x86: Add a helper to detect if FRED is enabled for a vCPU
>    KVM: VMX: Virtualize FRED event_data
>    KVM: VMX: Virtualize FRED nested exception tracking
>    KVM: x86: Mark CR4.FRED as not reserved when guest can use FRED
>    KVM: VMX: Dump FRED context in dump_vmcs()
>    KVM: x86: Allow FRED/LKGS to be advertised to guests
>    KVM: x86: Allow WRMSRNS to be advertised to guests
>    KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup
>    KVM: nVMX: Add support for the secondary VM exit controls
>    KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros
>    KVM: nVMX: Add FRED VMCS fields
>    KVM: nVMX: Add VMCS FRED states checking
>    KVM: nVMX: Allow VMX FRED controls
> 
> Xin Li (Intel) (3):
>    x86/cea: Export per CPU variable cea_exception_stacks
>    KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition
>    KVM: nVMX: Add a prerequisite to existence of VMCS fields

Hi Sean,

While I'm waiting for the CET patches for native Linux and KVM to be
upstreamed, do you think if it's worth it for you to take the cleanup
and some of the preparation patches first?

Top of my mind are:
     KVM: x86: Use a dedicated flow for queueing re-injected exceptions
     KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
     KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM
     KVM: nVMX: Add a prerequisite to existence of VMCS fields
     KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros

Then specially, the nested exception tracking patch seems a good one as
Chao Gao suggested to decouple the nested tracking from FRED:
     KVM: VMX: Virtualize nested exception tracking

Lastly the patches to add support for the secondary VM exit controls 
might go in early as well:
     KVM: VMX: Add support for the secondary VM exit controls
     KVM: nVMX: Add support for the secondary VM exit controls

But if you don't like the idea please just let me know.

Thanks!
     Xin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 00/27] Enable FRED with KVM VMX
  2025-02-19  0:26 ` [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li
@ 2025-02-25 15:24   ` Sean Christopherson
  2025-02-25 17:04     ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Sean Christopherson @ 2025-02-25 15:24 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-kernel, linux-doc, Chao Gao, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Tue, Feb 18, 2025, Xin Li wrote:
> On 9/30/2024 10:00 PM, Xin Li (Intel) wrote:
> While I'm waiting for the CET patches for native Linux and KVM to be
> upstreamed, do you think if it's worth it for you to take the cleanup
> and some of the preparation patches first?

Yes, definitely.  I'll go through the series and see what I can grab now.

Thanks!

> Top of my mind are:
>     KVM: x86: Use a dedicated flow for queueing re-injected exceptions
>     KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
>     KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM
>     KVM: nVMX: Add a prerequisite to existence of VMCS fields
>     KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros
> 
> Then specially, the nested exception tracking patch seems a good one as
> Chao Gao suggested to decouple the nested tracking from FRED:
>     KVM: VMX: Virtualize nested exception tracking
> 
> Lastly the patches to add support for the secondary VM exit controls might
> go in early as well:
>     KVM: VMX: Add support for the secondary VM exit controls
>     KVM: nVMX: Add support for the secondary VM exit controls
> 
> But if you don't like the idea please just let me know.
> 
> Thanks!
>     Xin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 20/27] KVM: x86: Allow WRMSRNS to be advertised to guests
  2024-10-01  5:01 ` [PATCH v3 20/27] KVM: x86: Allow WRMSRNS " Xin Li (Intel)
@ 2025-02-25 15:41   ` Sean Christopherson
  0 siblings, 0 replies; 81+ messages in thread
From: Sean Christopherson @ 2025-02-25 15:41 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, pbonzini, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Mon, Sep 30, 2024, Xin Li (Intel) wrote:
> From: Xin Li <xin3.li@intel.com>
> 
> Allow WRMSRNS to be advertised to guests.

The shortlog and this sentence are incorrect.  Assuming there are no controls for
WRMSRNS, then KVM isn't allowing anything.  Userspace can advertise WRMSRNS support
whenever it wants, and the guest can cleanly execute WRMSRNS regardless of whether
or not it's advertised in CPUID.  KVM is simply advertising support to userspace.

> WRMSRNS behaves exactly like WRMSR with the only difference being

Nope, not the only difference.

  WRMSR and WRMSRNS use the same basic exit reason (see Appendix C). For WRMSR,
  the exit qualification is 0, while for WRMSRNS it is 1.

And the whole reaosn I went spelunking was to verify that WRMSRNS honors all MSR
exiting controls and generates the same exits.  That information needs to be
explicitly stated.

I'll rewrite the shortlog and changelog when applying.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 21/27] KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup
  2024-10-25  7:34     ` Xin Li
@ 2025-02-25 16:01       ` Sean Christopherson
  0 siblings, 0 replies; 81+ messages in thread
From: Sean Christopherson @ 2025-02-25 16:01 UTC (permalink / raw)
  To: Xin Li
  Cc: Chao Gao, kvm, linux-kernel, linux-doc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Fri, Oct 25, 2024, Xin Li wrote:
> On 10/24/2024 12:49 AM, Chao Gao wrote:
> > On Mon, Sep 30, 2024 at 10:01:04PM -0700, Xin Li (Intel) wrote:
> > > From: Xin Li <xin3.li@intel.com>
> > > 
> > > Set VMX CPU capabilities before initializing nested instead of after,
> > > as it needs to check VMX CPU capabilities to setup the VMX basic MSR
> > > for nested.
> > 
> > Which VMX CPU capabilities are needed? after reading patch 25, I still
> > don't get that.

Heh, I had the same question.  I was worried this was fixing a bug.

> Sigh, in v2 I had 'if (kvm_cpu_cap_has(X86_FEATURE_FRED))' in
> nested_vmx_setup_basic(), which is changed to 'if (cpu_has_vmx_fred())'
> in v3.  So the reason for the change is gone.  But I think logically
> the change is still needed; nested setup should be after VMX setup.

Hmm, no, I don't think we want to allow nested_vmx_setup_ctls_msrs() to consume
any "output" from vmx_set_cpu_caps().  vmx_set_cpu_caps() is called only on the
CPU that loads kvm-intel.ko, whereas nested_vmx_setup_ctls_msrs() is called on
all CPUs to check for consistency between CPUs.

And thinking more about the relevant flows, there's a flaw with kvm_cpu_caps and
vendor module reload.  KVM zeroes kvm_cpu_caps during init, but not until
kvm_set_cpu_caps() is called, i.e. quite some time after KVM has started doing
setup.  If KVM had a bug where it checked a feature kvm_set_cpu_caps(), the bug
could potentially go unnoticed until just the "right" combination of hardware,
module params, and/or Kconfig exposed semi-uninitialized data.

I'll post the below (assuming it actually works) to guard against that.  Ideally,
kvm_cpu_cap_get() would WARN if it's used before caps are finalized, but I don't
think the extra protection would be worth the increase in code footprint.

--
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 97a90689a9dc..8fd48119bd41 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -817,7 +817,8 @@ do {                                                                        \

 void kvm_set_cpu_caps(void)
 {
-       memset(kvm_cpu_caps, 0, sizeof(kvm_cpu_caps));
+       WARN_ON_ONCE(!bitmap_empty((void *)kvm_cpu_caps,
+                                  sizeof(kvm_cpu_caps) * BITS_PER_BYTE));

        BUILD_BUG_ON(sizeof(kvm_cpu_caps) - (NKVMCAPINTS * sizeof(*kvm_cpu_caps)) >
                     sizeof(boot_cpu_data.x86_capability));
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f5685f153e08..075a07412893 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9737,6 +9737,7 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
        }

        memset(&kvm_caps, 0, sizeof(kvm_caps));
+       memset(kvm_cpu_caps, 0, sizeof(kvm_cpu_caps));

        x86_emulator_cache = kvm_alloc_emulator_cache();
        if (!x86_emulator_cache) {

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 24/27] KVM: nVMX: Add a prerequisite to existence of VMCS fields
  2024-10-01  5:01 ` [PATCH v3 24/27] KVM: nVMX: Add a prerequisite to existence of VMCS fields Xin Li (Intel)
@ 2025-02-25 16:22   ` Sean Christopherson
  2025-02-25 16:37     ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Sean Christopherson @ 2025-02-25 16:22 UTC (permalink / raw)
  To: Xin Li (Intel)
  Cc: kvm, linux-kernel, linux-doc, pbonzini, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Mon, Sep 30, 2024, Xin Li (Intel) wrote:
> Add a prerequisite to existence of VMCS fields as some of them exist
> only on processors that support certain CPU features.
> 
> This is required to fix KVM unit test VMX_VMCS_ENUM.MAX_INDEX.

If making the KVM-Unit-Test pass is the driving force for this code, then NAK.
We looked at this in detail a few years back, and came to the conclusion that
trying to precisely track which fields are/aren't supported would likely do more
harm than good.

https://lore.kernel.org/all/1629192673-9911-4-git-send-email-robert.hu@linux.intel.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 24/27] KVM: nVMX: Add a prerequisite to existence of VMCS fields
  2025-02-25 16:22   ` Sean Christopherson
@ 2025-02-25 16:37     ` Xin Li
  2025-02-25 19:32       ` Sean Christopherson
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2025-02-25 16:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-doc, pbonzini, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 2/25/2025 8:22 AM, Sean Christopherson wrote:
> On Mon, Sep 30, 2024, Xin Li (Intel) wrote:
>> Add a prerequisite to existence of VMCS fields as some of them exist
>> only on processors that support certain CPU features.
>>
>> This is required to fix KVM unit test VMX_VMCS_ENUM.MAX_INDEX.
> 
> If making the KVM-Unit-Test pass is the driving force for this code, then NAK.
> We looked at this in detail a few years back, and came to the conclusion that
> trying to precisely track which fields are/aren't supported would likely do more
> harm than good.

I have to agree, it's no fun to track a VMCS field is added by which 
feature(s), and worst part is that one VMCS field could depend on 2+ 
totally irrelevant features, e.g., the secondary VM exit controls field 
exits on CPU that supports:

1) FRED
2) Prematurely busy shadow stack

Thanks for making the ground rule clear.

BTW, why don't we just remove this VMX_VMCS_ENUM.MAX_INDEX test?

     Xin


> 
> https://lore.kernel.org/all/1629192673-9911-4-git-send-email-robert.hu@linux.intel.com
> 



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 00/27] Enable FRED with KVM VMX
  2025-02-25 15:24   ` Sean Christopherson
@ 2025-02-25 17:04     ` Xin Li
  2025-02-25 17:35       ` Sean Christopherson
  0 siblings, 1 reply; 81+ messages in thread
From: Xin Li @ 2025-02-25 17:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-doc, Chao Gao, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 2/25/2025 7:24 AM, Sean Christopherson wrote:
> On Tue, Feb 18, 2025, Xin Li wrote:
>> On 9/30/2024 10:00 PM, Xin Li (Intel) wrote:
>> While I'm waiting for the CET patches for native Linux and KVM to be
>> upstreamed, do you think if it's worth it for you to take the cleanup
>> and some of the preparation patches first?
> 
> Yes, definitely.  I'll go through the series and see what I can grab now.

I planned to do a rebase and fix the conflicts due to the reordering.
But I'm more than happy you do a first round.

BTW, if you plan to take
	KVM: VMX: Virtualize nested exception tracking
Then as Gao Chao suggested we also need a patch to Save/restore the
nested flag of an exception (obviously a corresponding host patch is
needed).  Following is a version that I have.

Thanks!
     Xin

---
KVM: x86: Save/restore the nested flag of an exception

Save/restore the nested flag of an exception during VM save/restore
and live migration to ensure a correct event stack level is chosen
when a nested exception is injected through FRED event delivery.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
---

Change since v3:
* Add live migration support for exception nested flag (Chao Gao).
---
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 2b52eb77e29c..ed171fa6926f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1180,6 +1180,10 @@ The following bits are defined in the flags field:
    fields contain a valid state. This bit will be set whenever
    KVM_CAP_EXCEPTION_PAYLOAD is enabled.

+- KVM_VCPUEVENT_VALID_NESTED_FLAG may be set to inform that the
+  exception is a nested exception. This bit will be set whenever
+  KVM_CAP_EXCEPTION_NESTED_FLAG is enabled.
+
  - KVM_VCPUEVENT_VALID_TRIPLE_FAULT may be set to signal that the
    triple_fault_pending field contains a valid state. This bit will
    be set whenever KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled.
@@ -1279,6 +1283,10 @@ can be set in the flags field to signal that the
  exception_has_payload, exception_payload, and exception.pending fields
  contain a valid state and shall be written into the VCPU.

+If KVM_CAP_EXCEPTION_NESTED_FLAG is enabled, 
KVM_VCPUEVENT_VALID_NESTED_FLAG
+can be set in the flags field to inform that the exception is a nested
+exception and exception_is_nested shall be written into the VCPU.
+
  If KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled, 
KVM_VCPUEVENT_VALID_TRIPLE_FAULT
  can be set in flags field to signal that the triple_fault field contains
  a valid state and shall be written into the VCPU.
@@ -8258,6 +8266,17 @@ KVM exits with the register state of either the 
L1 or L2 guest
  depending on which executed at the time of an exit. Userspace must
  take care to differentiate between these cases.

+7.37 KVM_CAP_EXCEPTION_NESTED_FLAG
+----------------------------------
+
+:Architectures: x86
+:Parameters: args[0] whether feature should be enabled or not
+
+With this capability enabled, an exception is save/restored with the
+additional information of whether it was nested or not. FRED event
+delivery uses this information to ensure a correct event stack level
+is chosen when a VM entry injects a nested exception.
+
  8. Other capabilities.
  ======================

diff --git a/arch/x86/include/asm/kvm_host.h 
b/arch/x86/include/asm/kvm_host.h
index 4cfe1b8f4547..ede2319cee45 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1441,6 +1441,7 @@ struct kvm_arch {

  	bool guest_can_read_msr_platform_info;
  	bool exception_payload_enabled;
+	bool exception_nested_flag_enabled;

  	bool triple_fault_event;

diff --git a/arch/x86/include/uapi/asm/kvm.h 
b/arch/x86/include/uapi/asm/kvm.h
index 9e75da97bce0..f5167e3a7d0f 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -326,6 +326,7 @@ struct kvm_reinject_control {
  #define KVM_VCPUEVENT_VALID_SMM		0x00000008
  #define KVM_VCPUEVENT_VALID_PAYLOAD	0x00000010
  #define KVM_VCPUEVENT_VALID_TRIPLE_FAULT	0x00000020
+#define KVM_VCPUEVENT_VALID_NESTED_FLAG	0x00000040

  /* Interrupt shadow states */
  #define KVM_X86_SHADOW_INT_MOV_SS	0x01
@@ -363,7 +364,8 @@ struct kvm_vcpu_events {
  	struct {
  		__u8 pending;
  	} triple_fault;
-	__u8 reserved[26];
+	__u8 reserved[25];
+	__u8 exception_is_nested;
  	__u8 exception_has_payload;
  	__u64 exception_payload;
  };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 01c945b27f01..80a9fa6ab720 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4675,6 +4675,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
long ext)
  	case KVM_CAP_GET_MSR_FEATURES:
  	case KVM_CAP_MSR_PLATFORM_INFO:
  	case KVM_CAP_EXCEPTION_PAYLOAD:
+	case KVM_CAP_EXCEPTION_NESTED_FLAG:
  	case KVM_CAP_X86_TRIPLE_FAULT_EVENT:
  	case KVM_CAP_SET_GUEST_DEBUG:
  	case KVM_CAP_LAST_CPU:
@@ -5401,6 +5402,7 @@ static void 
kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
  	events->exception.error_code = ex->error_code;
  	events->exception_has_payload = ex->has_payload;
  	events->exception_payload = ex->payload;
+	events->exception_is_nested = ex->nested;

  	events->interrupt.injected =
  		vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
@@ -5426,6 +5428,8 @@ static void 
kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
  			 | KVM_VCPUEVENT_VALID_SMM);
  	if (vcpu->kvm->arch.exception_payload_enabled)
  		events->flags |= KVM_VCPUEVENT_VALID_PAYLOAD;
+	if (vcpu->kvm->arch.exception_nested_flag_enabled)
+		events->flags |= KVM_VCPUEVENT_VALID_NESTED_FLAG;
  	if (vcpu->kvm->arch.triple_fault_event) {
  		events->triple_fault.pending = 
kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu);
  		events->flags |= KVM_VCPUEVENT_VALID_TRIPLE_FAULT;
@@ -5440,7 +5444,8 @@ static int 
kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
  			      | KVM_VCPUEVENT_VALID_SHADOW
  			      | KVM_VCPUEVENT_VALID_SMM
  			      | KVM_VCPUEVENT_VALID_PAYLOAD
-			      | KVM_VCPUEVENT_VALID_TRIPLE_FAULT))
+			      | KVM_VCPUEVENT_VALID_TRIPLE_FAULT
+			      | KVM_VCPUEVENT_VALID_NESTED_FLAG))
  		return -EINVAL;

  	if (events->flags & KVM_VCPUEVENT_VALID_PAYLOAD) {
@@ -5455,6 +5460,13 @@ static int 
kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
  		events->exception_has_payload = 0;
  	}

+	if (events->flags & KVM_VCPUEVENT_VALID_NESTED_FLAG) {
+		if (!vcpu->kvm->arch.exception_nested_flag_enabled)
+			return -EINVAL;
+	} else {
+		events->exception_is_nested = 0;
+	}
+
  	if ((events->exception.injected || events->exception.pending) &&
  	    (events->exception.nr > 31 || events->exception.nr == NMI_VECTOR))
  		return -EINVAL;
@@ -5486,6 +5498,7 @@ static int 
kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
  	vcpu->arch.exception.error_code = events->exception.error_code;
  	vcpu->arch.exception.has_payload = events->exception_has_payload;
  	vcpu->arch.exception.payload = events->exception_payload;
+	vcpu->arch.exception.nested = events->exception_is_nested;

  	vcpu->arch.interrupt.injected = events->interrupt.injected;
  	vcpu->arch.interrupt.nr = events->interrupt.nr;
@@ -6609,6 +6622,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
  		kvm->arch.exception_payload_enabled = cap->args[0];
  		r = 0;
  		break;
+	case KVM_CAP_EXCEPTION_NESTED_FLAG:
+		kvm->arch.exception_nested_flag_enabled = cap->args[0];
+		r = 0;
+		break;
  	case KVM_CAP_X86_TRIPLE_FAULT_EVENT:
  		kvm->arch.triple_fault_event = cap->args[0];
  		r = 0;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 45e6d8fca9b9..b79f3c10a887 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -929,6 +929,7 @@ struct kvm_enable_cap {
  #define KVM_CAP_PRE_FAULT_MEMORY 236
  #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
  #define KVM_CAP_X86_GUEST_MODE 238
+#define KVM_CAP_EXCEPTION_NESTED_FLAG 239

  struct kvm_irq_routing_irqchip {
  	__u32 irqchip;


> 
> Thanks!
> 
>> Top of my mind are:
>>      KVM: x86: Use a dedicated flow for queueing re-injected exceptions
>>      KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
>>      KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM
>>      KVM: nVMX: Add a prerequisite to existence of VMCS fields
>>      KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros
>>
>> Then specially, the nested exception tracking patch seems a good one as
>> Chao Gao suggested to decouple the nested tracking from FRED:
>>      KVM: VMX: Virtualize nested exception tracking
>>
>> Lastly the patches to add support for the secondary VM exit controls might
>> go in early as well:
>>      KVM: VMX: Add support for the secondary VM exit controls
>>      KVM: nVMX: Add support for the secondary VM exit controls
>>
>> But if you don't like the idea please just let me know.
>>
>> Thanks!
>>      Xin
> 



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 03/27] KVM: VMX: Add support for the secondary VM exit controls
  2024-10-22 16:30         ` Xin Li
@ 2025-02-25 17:28           ` Sean Christopherson
  0 siblings, 0 replies; 81+ messages in thread
From: Sean Christopherson @ 2025-02-25 17:28 UTC (permalink / raw)
  To: Xin Li
  Cc: Chao Gao, kvm, linux-kernel, linux-doc, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

[-- Attachment #1: Type: text/plain, Size: 3427 bytes --]

On Tue, Oct 22, 2024, Xin Li wrote:
> > > > > 		_vmentry_control &= ~n_ctrl;
> > > > > 		_vmexit_control &= ~x_ctrl;
> > > > 
> > > > w/ patch 4, VM_EXIT_ACTIVATE_SECONDARY_CONTROLS is cleared if FRED fails in the
> > > > consistent check. this means, all features in the secondary vm-exit controls
> > > > are removed. it is overkill.
> > > 
> > > Good catch!
> > > 
> > > > 
> > > > I prefer to maintain a separate table for the secondary VM-exit controls:
> > > > 
> > > >    	struct {
> > > >    		u32 entry_control;
> > > >    		u64 exit2_control;
> > > > 	} const vmcs_entry_exit2_pairs[] = {
> > > > 		{ VM_ENTRY_LOAD_IA32_FRED, SECONDARY_VM_EXIT_SAVE_IA32_FRED |
> > > > 					   SECONDARY_VM_EXIT_LOAD_IA32_FRED},
> > > > 	};
> > > > 
> > > > 	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit2_pairs); i++) {
> > > > 	...
> > > > 	}
> > > 
> > > Hmm, I prefer one table, as it's more straight forward.

Heh, that's debatable.  Also, calling these triplets is *very* misleading.

> > One table is fine if we can fix the issue and improve readability. The three
> > nested if() statements hurts readability.
> 
> You're right!  Let's try to make it clearer.

I agree with Chao, two tables provides better separation, which makes it easier
to follow what's going on, and avoids "polluting" every entry with empty fields.

If it weren't for the new controls supporting 64 unique bits, and the need to
clear bits in KVM's controls, it'd be trivial to extract processing to a helper
function.  But, it's easy enough to solve that conundrum by using a macro instead
of a function.  And as a bonus, a macro allows for adding compile-time assertions
to detect typos, e.g. can detect if KVM passes in secondary controls (u64) pairs
with the primary controls (u32) variable.

I'll post the attached patch shortly.  I verified it works as expected with a
simulated "bad" FRED CPU.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c9e5576d99d0..4717d48eabe8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2621,6 +2621,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
        u32 _vmentry_control = 0;
        u64 basic_msr;
        u64 misc_msr;
+       u64 _vmexit2_control = BIT_ULL(1);
 
        /*
         * LOAD/SAVE_DEBUG_CONTROLS are absent because both are mandatory.
@@ -2638,6 +2639,13 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
                { VM_ENTRY_LOAD_IA32_RTIT_CTL,          VM_EXIT_CLEAR_IA32_RTIT_CTL },
        };
 
+       struct {
+               u32 entry_control;
+               u64 exit_control;
+       } const vmcs_entry_exit2_pairs[] = {
+               { 0x00800000,                           BIT_ULL(0) | BIT_ULL(1) },
+       };
+
        memset(vmcs_conf, 0, sizeof(*vmcs_conf));
 
        if (adjust_vmx_controls(KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL,
@@ -2728,6 +2736,12 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
                                       _vmentry_control, _vmexit_control))
                return -EIO;
 
+       if (vmx_check_entry_exit_pairs(vmcs_entry_exit2_pairs,
+                                      _vmentry_control, _vmexit2_control))
+               return -EIO;
+
+       WARN_ON_ONCE(_vmexit2_control);
+
        /*
         * Some cpus support VM_{ENTRY,EXIT}_IA32_PERF_GLOBAL_CTRL but they
         * can't be used due to an errata where VM Exit may incorrectly clear

[-- Attachment #2: 0001-KVM-VMX-Extract-checks-on-entry-exit-control-pairs-t.patch --]
[-- Type: text/x-diff, Size: 4751 bytes --]

From b1def684c93990d1a62c169bb23706137b96b727 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 25 Feb 2025 09:10:32 -0800
Subject: [PATCH] KVM: VMX: Extract checks on entry/exit control pairs to a
 helper macro

Extract the checking of entry/exit pairs to a helper macro so that the
code can be reused to process the upcoming "secondary" exit controls (the
primary exit controls field is out of bits).  Use a macro instead of a
function to support different sized variables (all secondary exit controls
will be optional and so the MSR doesn't have the fixed-0/fixed-1 split).
Taking the largest size as input is trivial, but handling the modification
of KVM's to-be-used controls is much trickier, e.g. would require bitmap
games to clear bits from a 32-bit bitmap vs. a 64-bit bitmap.

Opportunistically add sanity checks to ensure the size of the controls
match (yay, macro!), e.g. to detect bugs where KVM passes in the pairs for
primary exit controls, but its variable for the secondary exit controls.

To help users triage mismatches, print the control bits that are checked,
not just the actual value.  For the foreseeable future, that provides
enough information for a user to determine which fields mismatched.  E.g.
until secondary entry controls comes along, all entry bits and thus all
error messages are guaranteed to be unique.

To avoid returning from a macro, which can get quite dangerous, simply
process all pairs even if error_on_inconsistent_vmcs_config is set.  The
speed at which KVM rejects module load is not at all interesting.

Keep the error message a "once" printk, even though it would be nice to
print out all mismatching pairs.  In practice, the most likely scenario is
that a single pair will be mismatch on all CPUs.  Printing all mismatches
generates redundant messages in that situation, and can be extremely noisy
on systems with large numbers of CPUs.  If a CPU has multiple mismatches,
not printing every bad pair is the least of the user's concerns.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 48 +++++++++++++++++++++++++++---------------
 1 file changed, 31 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b71392989609..c9e5576d99d0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2582,6 +2582,34 @@ static u64 adjust_vmx_controls64(u64 ctl_opt, u32 msr)
 	return  ctl_opt & allowed;
 }
 
+#define vmx_check_entry_exit_pairs(pairs, entry_controls, exit_controls)	\
+({										\
+	int i, r = 0;								\
+										\
+	BUILD_BUG_ON(sizeof(pairs[0].entry_control) != sizeof(entry_controls));	\
+	BUILD_BUG_ON(sizeof(pairs[0].exit_control)  != sizeof(exit_controls));	\
+										\
+	for (i = 0; i < ARRAY_SIZE(pairs); i++) {				\
+		typeof(entry_controls) n_ctrl = pairs[i].entry_control;		\
+		typeof(exit_controls) x_ctrl = pairs[i].exit_control;		\
+										\
+		if (!(entry_controls & n_ctrl) == !(exit_controls & x_ctrl))	\
+			continue;						\
+										\
+		pr_warn_once("Inconsistent VM-Entry/VM-Exit pair, " 		\
+			     "entry = %llx (%llx), exit = %llx (%llx)\n",	\
+			    (u64)(entry_controls & n_ctrl), (u64)n_ctrl,	\
+			    (u64)(exit_controls & x_ctrl), (u64)x_ctrl);	\
+										\
+		if (error_on_inconsistent_vmcs_config)				\
+			r = -EIO;						\
+										\
+		entry_controls &= ~n_ctrl;					\
+		exit_controls &= ~x_ctrl;					\
+	}									\
+	r;									\
+})
+
 static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 			     struct vmx_capability *vmx_cap)
 {
@@ -2593,7 +2621,6 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	u32 _vmentry_control = 0;
 	u64 basic_msr;
 	u64 misc_msr;
-	int i;
 
 	/*
 	 * LOAD/SAVE_DEBUG_CONTROLS are absent because both are mandatory.
@@ -2697,22 +2724,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				&_vmentry_control))
 		return -EIO;
 
-	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit_pairs); i++) {
-		u32 n_ctrl = vmcs_entry_exit_pairs[i].entry_control;
-		u32 x_ctrl = vmcs_entry_exit_pairs[i].exit_control;
-
-		if (!(_vmentry_control & n_ctrl) == !(_vmexit_control & x_ctrl))
-			continue;
-
-		pr_warn_once("Inconsistent VM-Entry/VM-Exit pair, entry = %x, exit = %x\n",
-			     _vmentry_control & n_ctrl, _vmexit_control & x_ctrl);
-
-		if (error_on_inconsistent_vmcs_config)
-			return -EIO;
-
-		_vmentry_control &= ~n_ctrl;
-		_vmexit_control &= ~x_ctrl;
-	}
+	if (vmx_check_entry_exit_pairs(vmcs_entry_exit_pairs,
+				       _vmentry_control, _vmexit_control))
+		return -EIO;
 
 	/*
 	 * Some cpus support VM_{ENTRY,EXIT}_IA32_PERF_GLOBAL_CTRL but they

base-commit: fed48e2967f402f561d80075a20c5c9e16866e53
-- 
2.48.1.658.g4767266eb4-goog


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 00/27] Enable FRED with KVM VMX
  2025-02-25 17:04     ` Xin Li
@ 2025-02-25 17:35       ` Sean Christopherson
  2025-02-25 18:48         ` Xin Li
  0 siblings, 1 reply; 81+ messages in thread
From: Sean Christopherson @ 2025-02-25 17:35 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-kernel, linux-doc, Chao Gao, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Tue, Feb 25, 2025, Xin Li wrote:
> On 2/25/2025 7:24 AM, Sean Christopherson wrote:
> > On Tue, Feb 18, 2025, Xin Li wrote:
> > > On 9/30/2024 10:00 PM, Xin Li (Intel) wrote:
> > > While I'm waiting for the CET patches for native Linux and KVM to be
> > > upstreamed, do you think if it's worth it for you to take the cleanup
> > > and some of the preparation patches first?
> > 
> > Yes, definitely.  I'll go through the series and see what I can grab now.
> 
> I planned to do a rebase and fix the conflicts due to the reordering.
> But I'm more than happy you do a first round.

For now, I'm only going to grab these:

  KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM
  KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
  KVM: x86: Use a dedicated flow for queueing re-injected exceptions

and the WRMSRNS patch.  I'll post (and apply, if it looks good) the entry/exit
pairs patch separately.

Easiest thing would be to rebase when all of those hit kvm-x86/next.

> BTW, if you plan to take
> 	KVM: VMX: Virtualize nested exception tracking

I'm not planning on grabbing this in advance of the FRED series, especially if
it's adding new uAPI.  The code doesn't need to exist without FRED, and doesn't
really make much sense to readers without the context of FRED.

> > > Top of my mind are:
> > >      KVM: x86: Use a dedicated flow for queueing re-injected exceptions
> > >      KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
> > >      KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM

As above, I'll grab these now.

> > >      KVM: nVMX: Add a prerequisite to existence of VMCS fields
> > >      KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros

Unless there's a really, really good reason to add precise checking, I strongly
prefer to skip these entirely.

> > > 
> > > Then specially, the nested exception tracking patch seems a good one as
> > > Chao Gao suggested to decouple the nested tracking from FRED:
> > >      KVM: VMX: Virtualize nested exception tracking
> > > 
> > > Lastly the patches to add support for the secondary VM exit controls might
> > > go in early as well:
> > >      KVM: VMX: Add support for the secondary VM exit controls
> > >      KVM: nVMX: Add support for the secondary VM exit controls

Unless there's another feature on the horizon that depends on secondary exit controls,
(and y'all will be posted patches soon), I'd prefer just grab these in the FRED
series.  With the pairs check prep work out of the way, adding support for the
new controls should be very straightforward, and shouldn't conflict with anything.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 00/27] Enable FRED with KVM VMX
  2025-02-25 17:35       ` Sean Christopherson
@ 2025-02-25 18:48         ` Xin Li
  0 siblings, 0 replies; 81+ messages in thread
From: Xin Li @ 2025-02-25 18:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-doc, Chao Gao, pbonzini, corbet, tglx,
	mingo, bp, dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On 2/25/2025 9:35 AM, Sean Christopherson wrote:
> On Tue, Feb 25, 2025, Xin Li wrote:
>> On 2/25/2025 7:24 AM, Sean Christopherson wrote:
>>> On Tue, Feb 18, 2025, Xin Li wrote:
>>>> On 9/30/2024 10:00 PM, Xin Li (Intel) wrote:
>>>> While I'm waiting for the CET patches for native Linux and KVM to be
>>>> upstreamed, do you think if it's worth it for you to take the cleanup
>>>> and some of the preparation patches first?
>>>
>>> Yes, definitely.  I'll go through the series and see what I can grab now.
>>
>> I planned to do a rebase and fix the conflicts due to the reordering.
>> But I'm more than happy you do a first round.
> 
> For now, I'm only going to grab these:
> 
>    KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM
>    KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
>    KVM: x86: Use a dedicated flow for queueing re-injected exceptions
> 
> and the WRMSRNS patch.  I'll post (and apply, if it looks good) the entry/exit
> pairs patch separately.
> 
> Easiest thing would be to rebase when all of those hit kvm-x86/next.

Excellent!


> 
>> BTW, if you plan to take
>> 	KVM: VMX: Virtualize nested exception tracking
> 
> I'm not planning on grabbing this in advance of the FRED series, especially if
> it's adding new uAPI.  The code doesn't need to exist without FRED, and doesn't
> really make much sense to readers without the context of FRED.

Sounds reasonable.

> 
>>>> Top of my mind are:
>>>>       KVM: x86: Use a dedicated flow for queueing re-injected exceptions
>>>>       KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
>>>>       KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM
> 
> As above, I'll grab these now.
> 
>>>>       KVM: nVMX: Add a prerequisite to existence of VMCS fields
>>>>       KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros
> 
> Unless there's a really, really good reason to add precise checking, I strongly
> prefer to skip these entirely.
> 

They are to make kvm-unit-tests happy, as if we have a ground rule, it's
clear that we don't need them.

They can be used to detect whether an OS is running on a VMM or bare
metal, but do we really care the difference? -- We probably care if we
live in a virtual reality ;-)

>>>>
>>>> Then specially, the nested exception tracking patch seems a good one as
>>>> Chao Gao suggested to decouple the nested tracking from FRED:
>>>>       KVM: VMX: Virtualize nested exception tracking
>>>>
>>>> Lastly the patches to add support for the secondary VM exit controls might
>>>> go in early as well:
>>>>       KVM: VMX: Add support for the secondary VM exit controls
>>>>       KVM: nVMX: Add support for the secondary VM exit controls
> 
> Unless there's another feature on the horizon that depends on secondary exit controls,
> (and y'all will be posted patches soon), I'd prefer just grab these in the FRED
> series.  With the pairs check prep work out of the way, adding support for the
> new controls should be very straightforward, and shouldn't conflict with anything.

NP.

Thanks!
     Xin


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 24/27] KVM: nVMX: Add a prerequisite to existence of VMCS fields
  2025-02-25 16:37     ` Xin Li
@ 2025-02-25 19:32       ` Sean Christopherson
  0 siblings, 0 replies; 81+ messages in thread
From: Sean Christopherson @ 2025-02-25 19:32 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-kernel, linux-doc, pbonzini, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, andrew.cooper3

On Tue, Feb 25, 2025, Xin Li wrote:
> On 2/25/2025 8:22 AM, Sean Christopherson wrote:
> > On Mon, Sep 30, 2024, Xin Li (Intel) wrote:
> > > Add a prerequisite to existence of VMCS fields as some of them exist
> > > only on processors that support certain CPU features.
> > > 
> > > This is required to fix KVM unit test VMX_VMCS_ENUM.MAX_INDEX.
> > 
> > If making the KVM-Unit-Test pass is the driving force for this code, then NAK.
> > We looked at this in detail a few years back, and came to the conclusion that
> > trying to precisely track which fields are/aren't supported would likely do more
> > harm than good.
> 
> I have to agree, it's no fun to track a VMCS field is added by which
> feature(s), and worst part is that one VMCS field could depend on 2+ totally
> irrelevant features, e.g., the secondary VM exit controls field exits on CPU
> that supports:
> 
> 1) FRED
> 2) Prematurely busy shadow stack
> 
> Thanks for making the ground rule clear.
> 
> BTW, why don't we just remove this VMX_VMCS_ENUM.MAX_INDEX test?

Because it's still a valid test, albeit with caveats.  KVM's (undocumented?) erratum
is that vmcs12 fields that are supported by KVM are always readable, but that's
mostly an orthogonal issuue to VMX_VMCS_ENUM.MAX_INDEX.  I.e. KVM can and does
report a correct VMX_VMCS_ENUM.MAX_INDEX based on which VMCS fields KVM emulates.

The big caveat is that VMX_VMCS_ENUM.MAX_INDEX will be wrong if a VM is migrated
to a newer KVM and/or to a host with a superset of functionality.  With those
caveats in mind, it's still nice to sanity check that KVM isn't advertising complete
garbage.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v3 00/27] Enable FRED with KVM VMX
  2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (27 preceding siblings ...)
  2025-02-19  0:26 ` [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li
@ 2025-02-28 17:06 ` Sean Christopherson
  28 siblings, 0 replies; 81+ messages in thread
From: Sean Christopherson @ 2025-02-28 17:06 UTC (permalink / raw)
  To: Sean Christopherson, kvm, linux-kernel, linux-doc, Xin Li (Intel)
  Cc: pbonzini, corbet, tglx, mingo, bp, dave.hansen, x86, hpa, luto,
	peterz, andrew.cooper3

On Mon, 30 Sep 2024 22:00:43 -0700, Xin Li (Intel) wrote:
> This patch set enables the Intel flexible return and event delivery
> (FRED) architecture with KVM VMX to allow guests to utilize FRED.
> 
> The FRED architecture defines simple new transitions that change
> privilege level (ring transitions). The FRED architecture was
> designed with the following goals:
> 
> [...]

Applied

[01/27] KVM: x86: Use a dedicated flow for queueing re-injected exceptions
        https://github.com/kvm-x86/linux/commit/b50cb2b1555d

to kvm-x86 misc, and

[02/27] KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
        https://github.com/kvm-x86/linux/commit/3ef0df3f760f
[14/27] KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM
        https://github.com/kvm-x86/linux/commit/d62c02af7a96k

to kvm-x86 vmx.

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2025-02-28 17:06 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-01  5:00 [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li (Intel)
2024-10-01  5:00 ` [PATCH v3 01/27] KVM: x86: Use a dedicated flow for queueing re-injected exceptions Xin Li (Intel)
2024-10-01  5:00 ` [PATCH v3 02/27] KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1 Xin Li (Intel)
2024-10-01  5:00 ` [PATCH v3 03/27] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
2024-10-21  8:28   ` Chao Gao
2024-10-21 17:03     ` Xin Li
2024-10-22  2:47       ` Chao Gao
2024-10-22 16:30         ` Xin Li
2025-02-25 17:28           ` Sean Christopherson
2024-10-01  5:00 ` [PATCH v3 04/27] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config Xin Li (Intel)
2024-10-01  5:00 ` [PATCH v3 05/27] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
2024-10-22  8:48   ` Chao Gao
2024-10-22 16:21     ` Xin Li
2024-11-26 15:32   ` Borislav Petkov
2024-11-26 18:53     ` Xin Li
2024-11-26 19:04       ` Borislav Petkov
2024-10-01  5:00 ` [PATCH v3 06/27] x86/cea: Export per CPU variable cea_exception_stacks Xin Li (Intel)
2024-10-01 16:12   ` Dave Hansen
2024-10-01 17:51     ` Xin Li
2024-10-01 18:18       ` Dave Hansen
2024-10-01  5:00 ` [PATCH v3 07/27] KVM: VMX: Initialize VMCS FRED fields Xin Li (Intel)
2024-10-22  9:06   ` Chao Gao
2024-10-22 16:18     ` Xin Li
2024-10-01  5:00 ` [PATCH v3 08/27] KVM: x86: Use KVM-governed feature framework to track "FRED enabled" Xin Li (Intel)
2024-10-01  5:00 ` [PATCH v3 09/27] KVM: VMX: Do not use MAX_POSSIBLE_PASSTHROUGH_MSRS in array definition Xin Li (Intel)
2024-11-26 18:02   ` Borislav Petkov
2024-11-26 19:22     ` Xin Li
2024-11-26 20:06       ` Borislav Petkov
2024-11-27  6:46         ` Xin Li
2024-11-27  6:55           ` Borislav Petkov
2024-11-27  7:02             ` Xin Li
2024-11-27  7:10               ` Borislav Petkov
2024-11-27  7:32                 ` Xin Li
2024-11-27  7:58                   ` Borislav Petkov
2024-10-01  5:00 ` [PATCH v3 10/27] KVM: VMX: Set FRED MSR interception Xin Li (Intel)
2024-11-13 11:31   ` Chao Gao
2024-10-01  5:00 ` [PATCH v3 11/27] KVM: VMX: Save/restore guest FRED RSP0 Xin Li (Intel)
2024-10-01  5:00 ` [PATCH v3 12/27] KVM: VMX: Add support for FRED context save/restore Xin Li (Intel)
2024-10-01  5:00 ` [PATCH v3 13/27] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU Xin Li (Intel)
2024-10-01  5:00 ` [PATCH v3 14/27] KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM Xin Li (Intel)
2024-10-01  5:00 ` [PATCH v3 15/27] KVM: VMX: Virtualize FRED event_data Xin Li (Intel)
2024-10-01  5:00 ` [PATCH v3 16/27] KVM: VMX: Virtualize FRED nested exception tracking Xin Li (Intel)
2024-10-24  6:24   ` Chao Gao
2024-10-25  8:04     ` Xin Li
2024-10-28  6:33       ` Chao Gao
2024-12-05  7:16         ` Xin Li
2024-10-01  5:01 ` [PATCH v3 17/27] KVM: x86: Mark CR4.FRED as not reserved when guest can use FRED Xin Li (Intel)
2024-10-24  7:18   ` Chao Gao
2024-12-12 18:48     ` Xin Li
2024-12-12 19:05       ` Sean Christopherson
2024-12-13 18:43         ` Xin Li
2024-10-01  5:01 ` [PATCH v3 18/27] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li (Intel)
2024-10-24  7:23   ` Chao Gao
2024-10-24 16:50     ` Xin Li
2024-10-01  5:01 ` [PATCH v3 19/27] KVM: x86: Allow FRED/LKGS to be advertised to guests Xin Li (Intel)
2024-10-01  5:01 ` [PATCH v3 20/27] KVM: x86: Allow WRMSRNS " Xin Li (Intel)
2025-02-25 15:41   ` Sean Christopherson
2024-10-01  5:01 ` [PATCH v3 21/27] KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup Xin Li (Intel)
2024-10-24  7:49   ` Chao Gao
2024-10-25  7:34     ` Xin Li
2025-02-25 16:01       ` Sean Christopherson
2024-10-01  5:01 ` [PATCH v3 22/27] KVM: nVMX: Add support for the secondary VM exit controls Xin Li (Intel)
2024-10-01  5:01 ` [PATCH v3 23/27] KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros Xin Li (Intel)
2024-10-01  5:01 ` [PATCH v3 24/27] KVM: nVMX: Add a prerequisite to existence of VMCS fields Xin Li (Intel)
2025-02-25 16:22   ` Sean Christopherson
2025-02-25 16:37     ` Xin Li
2025-02-25 19:32       ` Sean Christopherson
2024-10-01  5:01 ` [PATCH v3 25/27] KVM: nVMX: Add FRED " Xin Li (Intel)
2024-10-24  7:42   ` Chao Gao
2024-10-25  7:25     ` Xin Li
2024-10-28  9:07       ` Chao Gao
2024-10-28 18:27         ` Sean Christopherson
2024-10-29 17:40           ` Xin Li
2024-10-01  5:01 ` [PATCH v3 26/27] KVM: nVMX: Add VMCS FRED states checking Xin Li (Intel)
2024-10-01  5:01 ` [PATCH v3 27/27] KVM: nVMX: Allow VMX FRED controls Xin Li (Intel)
2025-02-19  0:26 ` [PATCH v3 00/27] Enable FRED with KVM VMX Xin Li
2025-02-25 15:24   ` Sean Christopherson
2025-02-25 17:04     ` Xin Li
2025-02-25 17:35       ` Sean Christopherson
2025-02-25 18:48         ` Xin Li
2025-02-28 17:06 ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).