[PATCH v6 00/20] Enable FRED with KVM VMX

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 00/20] Enable FRED with KVM VMX
@ 2025-08-21 22:36 Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 01/20] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
                   ` (19 more replies)
  0 siblings, 20 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

This patch set enables the Intel flexible return and event delivery
(FRED) architecture with KVM VMX to allow guests to utilize FRED.

The FRED architecture defines simple new transitions that change
privilege level (ring transitions). The FRED architecture was
designed with the following goals:

1) Improve overall performance and response time by replacing event
   delivery through the interrupt descriptor table (IDT event
   delivery) and event return by the IRET instruction with lower
   latency transitions.

2) Improve software robustness by ensuring that event delivery
   establishes the full supervisor context and that event return
   establishes the full user context.

The new transitions defined by the FRED architecture are FRED event
delivery and, for returning from events, two FRED return instructions.
FRED event delivery can effect a transition from ring 3 to ring 0, but
it is used also to deliver events incident to ring 0. One FRED
instruction (ERETU) effects a return from ring 0 to ring 3, while the
other (ERETS) returns while remaining in ring 0. Collectively, FRED
event delivery and the FRED return instructions are FRED transitions.

Intel VMX architecture is extended to run FRED guests, and the major
changes are:

1) New VMCS fields for FRED context management, which includes two new
event data VMCS fields, eight new guest FRED context VMCS fields and
eight new host FRED context VMCS fields.

2) VMX nested-exception support for proper virtualization of stack
levels introduced with FRED architecture.

Search for the latest FRED spec in most search engines with this search
pattern:

  site:intel.com FRED (flexible return and event delivery) specification

Following is the link to the v5 of this patch set:
https://lore.kernel.org/lkml/20250723175341.1284463-1-xin@zytor.com/

Although FRED and CET supervisor shadow stacks are independent CPU
features, FRED unconditionally includes FRED shadow stack pointer
MSRs IA32_FRED_SSP[0123], and IA32_FRED_SSP0 is just an alias of the
CET MSR IA32_PL0_SSP.  IOW, the state management of MSR IA32_PL0_SSP
becomes an overlap area, and Sean requested that FRED virtualization
to land after CET virtualization [1].

This v6 patch set is based on the kvm-x86-next-2025.08.20 tag of the
kvm-x86 repo + v13 of the KVM CET patch set, and also available at
https://github.com/xinli-intel/linux-fred-public.git fred-kvm-v6

Changes in v6:
1) Return KVM_MSR_RET_UNSUPPORTED instead of 1 when FRED is not available
   (Chao Gao)
2) Handle MSR_IA32_PL0_SSP when FRED is enumerated but CET not.
3) Handle FRED MSR pre-vmenter save/restore (Chao Gao).
4) Save FRED MSRs of vmcs02 at VM-Exit even an L1 VMM clears
   SECONDARY_VM_EXIT_SAVE_IA32_FRED.
5) Save FRED MSRs in sync_vmcs02_to_vmcs12() instead of its rare version.

[1]: https://lore.kernel.org/kvm/ZvQaNRhrsSJTYji3@google.com/

Xin Li (18):
  KVM: VMX: Add support for the secondary VM exit controls
  KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config
  KVM: VMX: Disable FRED if FRED consistency checks fail
  KVM: VMX: Initialize VMCS FRED fields
  KVM: VMX: Set FRED MSR intercepts
  KVM: VMX: Save/restore guest FRED RSP0
  KVM: VMX: Add support for FRED context save/restore
  KVM: x86: Add a helper to detect if FRED is enabled for a vCPU
  KVM: VMX: Virtualize FRED event_data
  KVM: VMX: Virtualize FRED nested exception tracking
  KVM: x86: Mark CR4.FRED as not reserved
  KVM: VMX: Dump FRED context in dump_vmcs()
  KVM: x86: Advertise support for FRED
  KVM: nVMX: Add support for the secondary VM exit controls
  KVM: nVMX: Add FRED VMCS fields to nested VMX context handling
  KVM: nVMX: Add FRED-related VMCS field checks
  KVM: nVMX: Add prerequisites to SHADOW_FIELD_R[OW] macros
  KVM: nVMX: Allow VMX FRED controls

Xin Li (Intel) (2):
  x86/cea: Export an API to get per CPU exception stacks for KVM to use
  KVM: x86: Save/restore the nested flag of an exception

 Documentation/virt/kvm/api.rst            |  21 +-
 Documentation/virt/kvm/x86/nested-vmx.rst |  19 ++
 arch/x86/coco/sev/sev-nmi.c               |   4 +-
 arch/x86/coco/sev/vc-handle.c             |   2 +-
 arch/x86/include/asm/cpu_entry_area.h     |  17 +-
 arch/x86/include/asm/kvm_host.h           |   8 +-
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/include/asm/vmx.h                |  48 ++-
 arch/x86/include/uapi/asm/kvm.h           |   4 +-
 arch/x86/kernel/cpu/common.c              |  10 +-
 arch/x86/kernel/fred.c                    |   6 +-
 arch/x86/kernel/traps.c                   |   2 +-
 arch/x86/kvm/cpuid.c                      |   1 +
 arch/x86/kvm/kvm_cache_regs.h             |  15 +
 arch/x86/kvm/svm/svm.c                    |   2 +-
 arch/x86/kvm/vmx/capabilities.h           |  25 +-
 arch/x86/kvm/vmx/nested.c                 | 338 +++++++++++++++++++---
 arch/x86/kvm/vmx/nested.h                 |  22 ++
 arch/x86/kvm/vmx/vmcs.h                   |   1 +
 arch/x86/kvm/vmx/vmcs12.c                 |  19 ++
 arch/x86/kvm/vmx/vmcs12.h                 |  38 +++
 arch/x86/kvm/vmx/vmcs_shadow_fields.h     |  37 ++-
 arch/x86/kvm/vmx/vmx.c                    | 240 ++++++++++++++-
 arch/x86/kvm/vmx/vmx.h                    |  54 +++-
 arch/x86/kvm/x86.c                        | 115 +++++++-
 arch/x86/kvm/x86.h                        |   8 +-
 arch/x86/mm/cpu_entry_area.c              |  21 ++
 arch/x86/mm/fault.c                       |   2 +-
 include/uapi/linux/kvm.h                  |   1 +
 29 files changed, 976 insertions(+), 105 deletions(-)

-- 
2.50.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 01/20] KVM: VMX: Add support for the secondary VM exit controls
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 02/20] KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config Xin Li (Intel)
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Always load the secondary VM exit controls to prepare for FRED enabling.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Add TB from Xuelian Guo.

Changes in v4:
* Fix clearing VM_EXIT_ACTIVATE_SECONDARY_CONTROLS (Chao Gao).
* Check VM exit/entry consistency based on the new macro from Sean
  Christopherson.

Change in v3:
* Do FRED controls consistency checks in the VM exit/entry consistency
  check framework (Sean Christopherson).

Change in v2:
* Always load the secondary VM exit controls (Sean Christopherson).
---
 arch/x86/include/asm/msr-index.h |  1 +
 arch/x86/include/asm/vmx.h       |  3 +++
 arch/x86/kvm/vmx/capabilities.h  |  9 ++++++++-
 arch/x86/kvm/vmx/vmcs.h          |  1 +
 arch/x86/kvm/vmx/vmx.c           | 29 +++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.h           |  7 ++++++-
 6 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 20fa4a79df13..7c59cc5ee044 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1208,6 +1208,7 @@
 #define MSR_IA32_VMX_TRUE_ENTRY_CTLS     0x00000490
 #define MSR_IA32_VMX_VMFUNC             0x00000491
 #define MSR_IA32_VMX_PROCBASED_CTLS3	0x00000492
+#define MSR_IA32_VMX_EXIT_CTLS2		0x00000493
 
 /* Resctrl MSRs: */
 /* - Intel: */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index c85c50019523..1f60c04d11fb 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -107,6 +107,7 @@
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
 #define VM_EXIT_LOAD_CET_STATE                  0x10000000
+#define VM_EXIT_ACTIVATE_SECONDARY_CONTROLS	0x80000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
@@ -262,6 +263,8 @@ enum vmcs_field {
 	SHARED_EPT_POINTER		= 0x0000203C,
 	PID_POINTER_TABLE		= 0x00002042,
 	PID_POINTER_TABLE_HIGH		= 0x00002043,
+	SECONDARY_VM_EXIT_CONTROLS	= 0x00002044,
+	SECONDARY_VM_EXIT_CONTROLS_HIGH	= 0x00002045,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 47b0dec8665a..7b9e306c359d 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -58,8 +58,9 @@ struct vmcs_config {
 	u32 cpu_based_exec_ctrl;
 	u32 cpu_based_2nd_exec_ctrl;
 	u64 cpu_based_3rd_exec_ctrl;
-	u32 vmexit_ctrl;
 	u32 vmentry_ctrl;
+	u32 vmexit_ctrl;
+	u64 vmexit_2nd_ctrl;
 	u64 misc;
 	struct nested_vmx_msrs nested;
 };
@@ -144,6 +145,12 @@ static inline bool cpu_has_tertiary_exec_ctrls(void)
 		CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
 }
 
+static inline bool cpu_has_secondary_vmexit_ctrls(void)
+{
+	return vmcs_config.vmexit_ctrl &
+		VM_EXIT_ACTIVATE_SECONDARY_CONTROLS;
+}
+
 static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index b25625314658..ae152a9d1963 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -47,6 +47,7 @@ struct vmcs_host_state {
 struct vmcs_controls_shadow {
 	u32 vm_entry;
 	u32 vm_exit;
+	u64 secondary_vm_exit;
 	u32 pin;
 	u32 exec;
 	u32 secondary_exec;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 989008f5307e..590e0826ba08 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2596,8 +2596,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	u32 _cpu_based_exec_control = 0;
 	u32 _cpu_based_2nd_exec_control = 0;
 	u64 _cpu_based_3rd_exec_control = 0;
-	u32 _vmexit_control = 0;
 	u32 _vmentry_control = 0;
+	u32 _vmexit_control = 0;
+	u64 _vmexit2_control = 0;
 	u64 basic_msr;
 	u64 misc_msr;
 
@@ -2618,6 +2619,12 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 		{ VM_ENTRY_LOAD_CET_STATE,		VM_EXIT_LOAD_CET_STATE },
 	};
 
+	struct {
+		u32 entry_control;
+		u64 exit_control;
+	} const vmcs_entry_exit2_pairs[] = {
+	};
+
 	memset(vmcs_conf, 0, sizeof(*vmcs_conf));
 
 	if (adjust_vmx_controls(KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL,
@@ -2704,10 +2711,19 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				&_vmentry_control))
 		return -EIO;
 
+	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
+		_vmexit2_control =
+			adjust_vmx_controls64(KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS,
+					      MSR_IA32_VMX_EXIT_CTLS2);
+
 	if (vmx_check_entry_exit_pairs(vmcs_entry_exit_pairs,
 				       _vmentry_control, _vmexit_control))
 		return -EIO;
 
+	if (vmx_check_entry_exit_pairs(vmcs_entry_exit2_pairs,
+				       _vmentry_control, _vmexit2_control))
+		return -EIO;
+
 	/*
 	 * Some cpus support VM_{ENTRY,EXIT}_IA32_PERF_GLOBAL_CTRL but they
 	 * can't be used due to an errata where VM Exit may incorrectly clear
@@ -2756,8 +2772,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
 	vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
 	vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
-	vmcs_conf->vmexit_ctrl         = _vmexit_control;
 	vmcs_conf->vmentry_ctrl        = _vmentry_control;
+	vmcs_conf->vmexit_ctrl         = _vmexit_control;
+	vmcs_conf->vmexit_2nd_ctrl     = _vmexit2_control;
 	vmcs_conf->misc	= misc_msr;
 
 #if IS_ENABLED(CONFIG_HYPERV)
@@ -4406,6 +4423,11 @@ static u32 vmx_vmexit_ctrl(void)
 		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
 }
 
+static u64 vmx_secondary_vmexit_ctrl(void)
+{
+	return vmcs_config.vmexit_2nd_ctrl;
+}
+
 void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4754,6 +4776,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 
 	vm_exit_controls_set(vmx, vmx_vmexit_ctrl());
 
+	if (cpu_has_secondary_vmexit_ctrls())
+		secondary_vm_exit_controls_set(vmx, vmx_secondary_vmexit_ctrl());
+
 	/* 22.2.1, 20.8.1 */
 	vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index ecfdba666465..840e48a2fcc5 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -511,7 +511,11 @@ static inline u8 vmx_get_rvi(void)
 	       VM_EXIT_CLEAR_BNDCFGS |					\
 	       VM_EXIT_PT_CONCEAL_PIP |					\
 	       VM_EXIT_CLEAR_IA32_RTIT_CTL |				\
-	       VM_EXIT_LOAD_CET_STATE)
+	       VM_EXIT_LOAD_CET_STATE |					\
+	       VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
+
+#define KVM_REQUIRED_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
+#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
 
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
@@ -616,6 +620,7 @@ static __always_inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##b
 }
 BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
 BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(secondary_vm_exit, SECONDARY_VM_EXIT_CONTROLS, 64)
 BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 02/20] KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 01/20] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 03/20] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Setup VM entry/exit FRED controls in the global vmcs_config for proper
FRED VMCS fields management:
  1) load guest FRED state upon VM entry.
  2) save guest FRED state during VM exit.
  3) load host FRED state during VM exit.

Also add FRED control consistency checks to the existing VM entry/exit
consistency check framework.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
---

Change in v5:
* Remove the pair VM_ENTRY_LOAD_IA32_FRED/VM_EXIT_ACTIVATE_SECONDARY_CONTROLS,
  since the secondary VM exit controls are unconditionally enabled anyway, and
  there are features other than FRED needing it (Chao Gao).
* Add TB from Xuelian Guo.

Change in v4:
* Do VM exit/entry consistency checks using the new macro from Sean
  Christopherson.

Changes in v3:
* Add FRED control consistency checks to the existing VM entry/exit
  consistency check framework (Sean Christopherson).
* Just do the unnecessary FRED state load/store on every VM entry/exit
  (Sean Christopherson).
---
 arch/x86/include/asm/vmx.h | 4 ++++
 arch/x86/kvm/vmx/vmx.c     | 2 ++
 arch/x86/kvm/vmx/vmx.h     | 7 +++++--
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 1f60c04d11fb..dd79d027ea70 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -109,6 +109,9 @@
 #define VM_EXIT_LOAD_CET_STATE                  0x10000000
 #define VM_EXIT_ACTIVATE_SECONDARY_CONTROLS	0x80000000
 
+#define SECONDARY_VM_EXIT_SAVE_IA32_FRED	BIT_ULL(0)
+#define SECONDARY_VM_EXIT_LOAD_IA32_FRED	BIT_ULL(1)
+
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
 #define VM_ENTRY_LOAD_DEBUG_CONTROLS            0x00000004
@@ -122,6 +125,7 @@
 #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
 #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
 #define VM_ENTRY_LOAD_CET_STATE                 0x00100000
+#define VM_ENTRY_LOAD_IA32_FRED			0x00800000
 
 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 590e0826ba08..3b5e2805a06d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2623,6 +2623,8 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 		u32 entry_control;
 		u64 exit_control;
 	} const vmcs_entry_exit2_pairs[] = {
+		{ VM_ENTRY_LOAD_IA32_FRED,
+			SECONDARY_VM_EXIT_SAVE_IA32_FRED | SECONDARY_VM_EXIT_LOAD_IA32_FRED },
 	};
 
 	memset(vmcs_conf, 0, sizeof(*vmcs_conf));
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 840e48a2fcc5..e577af1003d8 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -488,7 +488,8 @@ static inline u8 vmx_get_rvi(void)
 	 VM_ENTRY_LOAD_BNDCFGS |					\
 	 VM_ENTRY_PT_CONCEAL_PIP |					\
 	 VM_ENTRY_LOAD_IA32_RTIT_CTL |					\
-	 VM_ENTRY_LOAD_CET_STATE)
+	 VM_ENTRY_LOAD_CET_STATE |					\
+	 VM_ENTRY_LOAD_IA32_FRED)
 
 #define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS				\
 	(VM_EXIT_SAVE_DEBUG_CONTROLS |					\
@@ -515,7 +516,9 @@ static inline u8 vmx_get_rvi(void)
 	       VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
 
 #define KVM_REQUIRED_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
-#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
+#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS			\
+	     (SECONDARY_VM_EXIT_SAVE_IA32_FRED |			\
+	      SECONDARY_VM_EXIT_LOAD_IA32_FRED)
 
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 03/20] KVM: VMX: Disable FRED if FRED consistency checks fail
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 01/20] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 02/20] KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 04/20] x86/cea: Export an API to get per CPU exception stacks for KVM to use Xin Li (Intel)
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Do not virtualize FRED if FRED consistency checks fail.

Either on broken hardware, or when run KVM on top of another hypervisor
before the underlying hypervisor implements nested FRED correctly.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Changes in v5:
* Drop the cpu_feature_enabled() in cpu_has_vmx_fred() (Sean).
* Add TB from Xuelian Guo.

Change in v4:
* Call out the reason why not check FRED VM-exit controls in
  cpu_has_vmx_fred() (Chao Gao).
---
 arch/x86/kvm/vmx/capabilities.h | 10 ++++++++++
 arch/x86/kvm/vmx/vmx.c          |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 7b9e306c359d..7fe95a601c9f 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -408,6 +408,16 @@ static inline bool vmx_pebs_supported(void)
 	return boot_cpu_has(X86_FEATURE_PEBS) && kvm_pmu_cap.pebs_ept;
 }
 
+static inline bool cpu_has_vmx_fred(void)
+{
+	/*
+	 * setup_vmcs_config() guarantees FRED VM-entry/exit controls
+	 * are either all set or none.  So, no need to check FRED VM-exit
+	 * controls.
+	 */
+	return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_FRED);
+}
+
 static inline bool cpu_has_notify_vmexit(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3b5e2805a06d..c8b95c215869 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7993,6 +7993,9 @@ static __init void vmx_set_cpu_caps(void)
 		kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
 	}
 
+	if (!cpu_has_vmx_fred())
+		kvm_cpu_cap_clear(X86_FEATURE_FRED);
+
 	if (!enable_pmu)
 		kvm_cpu_cap_clear(X86_FEATURE_PDCM);
 	kvm_caps.supported_perf_cap = vmx_get_perf_capabilities();
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 04/20] x86/cea: Export an API to get per CPU exception stacks for KVM to use
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (2 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 03/20] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-27 17:33   ` Dave Hansen
  2025-08-21 22:36 ` [PATCH v6 05/20] KVM: VMX: Initialize VMCS FRED fields Xin Li (Intel)
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

FRED introduced new fields in the host-state area of the VMCS for
stack levels 1->3 (HOST_IA32_FRED_RSP[123]), each respectively
corresponding to per CPU exception stacks for #DB, NMI and #DF.
KVM must populate these each time a vCPU is loaded onto a CPU.

Convert the __this_cpu_ist_{top,bottom}_va() macros into real
functions and export __this_cpu_ist_top_va().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Export accessor instead of data (Christoph Hellwig).
* Add TB from Xuelian Guo.

Change in v4:
* Rewrite the change log and add comments to the export (Dave Hansen).
---
 arch/x86/coco/sev/sev-nmi.c           |  4 ++--
 arch/x86/coco/sev/vc-handle.c         |  2 +-
 arch/x86/include/asm/cpu_entry_area.h | 17 ++++-------------
 arch/x86/kernel/cpu/common.c          | 10 +++++-----
 arch/x86/kernel/fred.c                |  6 +++---
 arch/x86/kernel/traps.c               |  2 +-
 arch/x86/mm/cpu_entry_area.c          | 21 +++++++++++++++++++++
 arch/x86/mm/fault.c                   |  2 +-
 8 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/arch/x86/coco/sev/sev-nmi.c b/arch/x86/coco/sev/sev-nmi.c
index d8dfaddfb367..73e34ad7a1a9 100644
--- a/arch/x86/coco/sev/sev-nmi.c
+++ b/arch/x86/coco/sev/sev-nmi.c
@@ -30,7 +30,7 @@ static __always_inline bool on_vc_stack(struct pt_regs *regs)
 	if (ip_within_syscall_gap(regs))
 		return false;
 
-	return ((sp >= __this_cpu_ist_bottom_va(VC)) && (sp < __this_cpu_ist_top_va(VC)));
+	return ((sp >= __this_cpu_ist_bottom_va(ESTACK_VC)) && (sp < __this_cpu_ist_top_va(ESTACK_VC)));
 }
 
 /*
@@ -82,7 +82,7 @@ void noinstr __sev_es_ist_exit(void)
 	/* Read IST entry */
 	ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
 
-	if (WARN_ON(ist == __this_cpu_ist_top_va(VC)))
+	if (WARN_ON(ist == __this_cpu_ist_top_va(ESTACK_VC)))
 		return;
 
 	/* Read back old IST entry and write it to the TSS */
diff --git a/arch/x86/coco/sev/vc-handle.c b/arch/x86/coco/sev/vc-handle.c
index c3b4acbde0d8..88b6bc518a5a 100644
--- a/arch/x86/coco/sev/vc-handle.c
+++ b/arch/x86/coco/sev/vc-handle.c
@@ -859,7 +859,7 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 
 static __always_inline bool is_vc2_stack(unsigned long sp)
 {
-	return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2));
+	return (sp >= __this_cpu_ist_bottom_va(ESTACK_VC2) && sp < __this_cpu_ist_top_va(ESTACK_VC2));
 }
 
 static __always_inline bool vc_from_invalid_context(struct pt_regs *regs)
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 462fc34f1317..8e17f0ca74e6 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -46,7 +46,7 @@ struct cea_exception_stacks {
  * The exception stack ordering in [cea_]exception_stacks
  */
 enum exception_stack_ordering {
-	ESTACK_DF,
+	ESTACK_DF = 0,
 	ESTACK_NMI,
 	ESTACK_DB,
 	ESTACK_MCE,
@@ -58,18 +58,15 @@ enum exception_stack_ordering {
 #define CEA_ESTACK_SIZE(st)					\
 	sizeof(((struct cea_exception_stacks *)0)->st## _stack)
 
-#define CEA_ESTACK_BOT(ceastp, st)				\
-	((unsigned long)&(ceastp)->st## _stack)
-
-#define CEA_ESTACK_TOP(ceastp, st)				\
-	(CEA_ESTACK_BOT(ceastp, st) + CEA_ESTACK_SIZE(st))
-
 #define CEA_ESTACK_OFFS(st)					\
 	offsetof(struct cea_exception_stacks, st## _stack)
 
 #define CEA_ESTACK_PAGES					\
 	(sizeof(struct cea_exception_stacks) / PAGE_SIZE)
 
+extern unsigned long __this_cpu_ist_top_va(enum exception_stack_ordering stack);
+extern unsigned long __this_cpu_ist_bottom_va(enum exception_stack_ordering stack);
+
 #endif
 
 #ifdef CONFIG_X86_32
@@ -144,10 +141,4 @@ static __always_inline struct entry_stack *cpu_entry_stack(int cpu)
 	return &get_cpu_entry_area(cpu)->entry_stack_page.stack;
 }
 
-#define __this_cpu_ist_top_va(name)					\
-	CEA_ESTACK_TOP(__this_cpu_read(cea_exception_stacks), name)
-
-#define __this_cpu_ist_bottom_va(name)					\
-	CEA_ESTACK_BOT(__this_cpu_read(cea_exception_stacks), name)
-
 #endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 34a054181c4d..cb14919f92da 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2307,12 +2307,12 @@ static inline void setup_getcpu(int cpu)
 static inline void tss_setup_ist(struct tss_struct *tss)
 {
 	/* Set up the per-CPU TSS IST stacks */
-	tss->x86_tss.ist[IST_INDEX_DF] = __this_cpu_ist_top_va(DF);
-	tss->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(NMI);
-	tss->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(DB);
-	tss->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(MCE);
+	tss->x86_tss.ist[IST_INDEX_DF] = __this_cpu_ist_top_va(ESTACK_DF);
+	tss->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(ESTACK_NMI);
+	tss->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(ESTACK_DB);
+	tss->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(ESTACK_MCE);
 	/* Only mapped when SEV-ES is active */
-	tss->x86_tss.ist[IST_INDEX_VC] = __this_cpu_ist_top_va(VC);
+	tss->x86_tss.ist[IST_INDEX_VC] = __this_cpu_ist_top_va(ESTACK_VC);
 }
 #else /* CONFIG_X86_64 */
 static inline void tss_setup_ist(struct tss_struct *tss) { }
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
index 816187da3a47..06d944a3d051 100644
--- a/arch/x86/kernel/fred.c
+++ b/arch/x86/kernel/fred.c
@@ -87,7 +87,7 @@ void cpu_init_fred_rsps(void)
 	       FRED_STKLVL(X86_TRAP_DF,  FRED_DF_STACK_LEVEL));
 
 	/* The FRED equivalents to IST stacks... */
-	wrmsrq(MSR_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB));
-	wrmsrq(MSR_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI));
-	wrmsrq(MSR_IA32_FRED_RSP3, __this_cpu_ist_top_va(DF));
+	wrmsrq(MSR_IA32_FRED_RSP1, __this_cpu_ist_top_va(ESTACK_DB));
+	wrmsrq(MSR_IA32_FRED_RSP2, __this_cpu_ist_top_va(ESTACK_NMI));
+	wrmsrq(MSR_IA32_FRED_RSP3, __this_cpu_ist_top_va(ESTACK_DF));
 }
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 36354b470590..5c9c5ebf5e73 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -954,7 +954,7 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
 
 	if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||
 	    info.type > STACK_TYPE_EXCEPTION_LAST)
-		sp = __this_cpu_ist_top_va(VC2);
+		sp = __this_cpu_ist_top_va(ESTACK_VC2);
 
 sync:
 	/*
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 575f863f3c75..eedaf103c8ad 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -18,6 +18,27 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
 static DEFINE_PER_CPU_PAGE_ALIGNED(struct exception_stacks, exception_stacks);
 DEFINE_PER_CPU(struct cea_exception_stacks*, cea_exception_stacks);
 
+/*
+ * FRED introduced new fields in the host-state area of the VMCS for
+ * stack levels 1->3 (HOST_IA32_FRED_RSP[123]), each respectively
+ * corresponding to per CPU stacks for #DB, NMI and #DF.  KVM must
+ * populate these each time a vCPU is loaded onto a CPU.
+ *
+ * Called from entry code, so must be noinstr.
+ */
+noinstr unsigned long __this_cpu_ist_top_va(enum exception_stack_ordering stack)
+{
+	unsigned long base = (unsigned long)&(__this_cpu_read(cea_exception_stacks)->DF_stack);
+	return base + EXCEPTION_STKSZ + stack * (EXCEPTION_STKSZ + PAGE_SIZE);
+}
+EXPORT_SYMBOL(__this_cpu_ist_top_va);
+
+noinstr unsigned long __this_cpu_ist_bottom_va(enum exception_stack_ordering stack)
+{
+	unsigned long base = (unsigned long)&(__this_cpu_read(cea_exception_stacks)->DF_stack);
+	return base + stack * (EXCEPTION_STKSZ + PAGE_SIZE);
+}
+
 static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, _cea_offset);
 
 static __always_inline unsigned int cea_offset(unsigned int cpu)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 998bd807fc7b..1804eb86cc14 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -671,7 +671,7 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,
 		 * and then double-fault, though, because we're likely to
 		 * break the console driver and lose most of the stack dump.
 		 */
-		call_on_stack(__this_cpu_ist_top_va(DF) - sizeof(void*),
+		call_on_stack(__this_cpu_ist_top_va(ESTACK_DF) - sizeof(void*),
 			      handle_stack_overflow,
 			      ASM_CALL_ARG3,
 			      , [arg1] "r" (regs), [arg2] "r" (address), [arg3] "r" (&info));
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 05/20] KVM: VMX: Initialize VMCS FRED fields
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (3 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 04/20] x86/cea: Export an API to get per CPU exception stacks for KVM to use Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts Xin Li (Intel)
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Initialize host VMCS FRED fields with host FRED MSRs' value and
guest VMCS FRED fields to 0.

FRED CPU state is managed in 9 new FRED MSRs:
        IA32_FRED_CONFIG,
        IA32_FRED_STKLVLS,
        IA32_FRED_RSP0,
        IA32_FRED_RSP1,
        IA32_FRED_RSP2,
        IA32_FRED_RSP3,
        IA32_FRED_SSP1,
        IA32_FRED_SSP2,
        IA32_FRED_SSP3,
as well as a few existing CPU registers and MSRs:
        CR4.FRED,
        IA32_STAR,
        IA32_KERNEL_GS_BASE,
        IA32_PL0_SSP (also known as IA32_FRED_SSP0).

CR4, IA32_KERNEL_GS_BASE and IA32_STAR are already well managed.
Except IA32_FRED_RSP0 and IA32_FRED_SSP0, all other FRED CPU state
MSRs have corresponding VMCS fields in both the host-state and
guest-state areas.  So KVM just needs to initialize them, and with
proper VM entry/exit FRED controls, a FRED CPU will keep tracking
host and guest FRED CPU state in VMCS automatically.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Add TB from Xuelian Guo.

Change in v4:
* Initialize host SSP[1-3] to 0s in vmx_set_constant_host_state()
  because Linux doesn't support kernel shadow stacks (Chao Gao).

Change in v3:
* Use structure kvm_host_values to keep host fred config & stack levels
  (Sean Christopherson).

Changes in v2:
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() to decouple
  KVM's capability to virtualize a feature and host's enabling of a
  feature (Chao Gao).
* Move guest FRED state init into __vmx_vcpu_reset() (Chao Gao).
---
 arch/x86/include/asm/vmx.h | 32 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c     | 36 ++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.h         |  3 +++
 3 files changed, 71 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index dd79d027ea70..6f8b8947c60c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -293,12 +293,44 @@ enum vmcs_field {
 	GUEST_BNDCFGS_HIGH              = 0x00002813,
 	GUEST_IA32_RTIT_CTL		= 0x00002814,
 	GUEST_IA32_RTIT_CTL_HIGH	= 0x00002815,
+	GUEST_IA32_FRED_CONFIG		= 0x0000281a,
+	GUEST_IA32_FRED_CONFIG_HIGH	= 0x0000281b,
+	GUEST_IA32_FRED_RSP1		= 0x0000281c,
+	GUEST_IA32_FRED_RSP1_HIGH	= 0x0000281d,
+	GUEST_IA32_FRED_RSP2		= 0x0000281e,
+	GUEST_IA32_FRED_RSP2_HIGH	= 0x0000281f,
+	GUEST_IA32_FRED_RSP3		= 0x00002820,
+	GUEST_IA32_FRED_RSP3_HIGH	= 0x00002821,
+	GUEST_IA32_FRED_STKLVLS		= 0x00002822,
+	GUEST_IA32_FRED_STKLVLS_HIGH	= 0x00002823,
+	GUEST_IA32_FRED_SSP1		= 0x00002824,
+	GUEST_IA32_FRED_SSP1_HIGH	= 0x00002825,
+	GUEST_IA32_FRED_SSP2		= 0x00002826,
+	GUEST_IA32_FRED_SSP2_HIGH	= 0x00002827,
+	GUEST_IA32_FRED_SSP3		= 0x00002828,
+	GUEST_IA32_FRED_SSP3_HIGH	= 0x00002829,
 	HOST_IA32_PAT			= 0x00002c00,
 	HOST_IA32_PAT_HIGH		= 0x00002c01,
 	HOST_IA32_EFER			= 0x00002c02,
 	HOST_IA32_EFER_HIGH		= 0x00002c03,
 	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
 	HOST_IA32_PERF_GLOBAL_CTRL_HIGH	= 0x00002c05,
+	HOST_IA32_FRED_CONFIG		= 0x00002c08,
+	HOST_IA32_FRED_CONFIG_HIGH	= 0x00002c09,
+	HOST_IA32_FRED_RSP1		= 0x00002c0a,
+	HOST_IA32_FRED_RSP1_HIGH	= 0x00002c0b,
+	HOST_IA32_FRED_RSP2		= 0x00002c0c,
+	HOST_IA32_FRED_RSP2_HIGH	= 0x00002c0d,
+	HOST_IA32_FRED_RSP3		= 0x00002c0e,
+	HOST_IA32_FRED_RSP3_HIGH	= 0x00002c0f,
+	HOST_IA32_FRED_STKLVLS		= 0x00002c10,
+	HOST_IA32_FRED_STKLVLS_HIGH	= 0x00002c11,
+	HOST_IA32_FRED_SSP1		= 0x00002c12,
+	HOST_IA32_FRED_SSP1_HIGH	= 0x00002c13,
+	HOST_IA32_FRED_SSP2		= 0x00002c14,
+	HOST_IA32_FRED_SSP2_HIGH	= 0x00002c15,
+	HOST_IA32_FRED_SSP3		= 0x00002c16,
+	HOST_IA32_FRED_SSP3_HIGH	= 0x00002c17,
 	PIN_BASED_VM_EXEC_CONTROL       = 0x00004000,
 	CPU_BASED_VM_EXEC_CONTROL       = 0x00004002,
 	EXCEPTION_BITMAP                = 0x00004004,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c8b95c215869..42e179f19c23 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1460,6 +1460,15 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
 				    (unsigned long)(cpu_entry_stack(cpu) + 1));
 		}
 
+		/* Per-CPU FRED MSRs */
+		if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+#ifdef CONFIG_X86_64
+			vmcs_write64(HOST_IA32_FRED_RSP1, __this_cpu_ist_top_va(ESTACK_DB));
+			vmcs_write64(HOST_IA32_FRED_RSP2, __this_cpu_ist_top_va(ESTACK_NMI));
+			vmcs_write64(HOST_IA32_FRED_RSP3, __this_cpu_ist_top_va(ESTACK_DF));
+#endif
+		}
+
 		vmx->loaded_vmcs->cpu = cpu;
 	}
 }
@@ -4307,6 +4316,17 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 	 */
 	vmcs_write16(HOST_DS_SELECTOR, 0);
 	vmcs_write16(HOST_ES_SELECTOR, 0);
+
+	if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+		/* FRED CONFIG and STKLVLS are the same on all CPUs */
+		vmcs_write64(HOST_IA32_FRED_CONFIG, kvm_host.fred_config);
+		vmcs_write64(HOST_IA32_FRED_STKLVLS, kvm_host.fred_stklvls);
+
+		/* Linux doesn't support kernel shadow stacks, thus SSPs are 0s */
+		vmcs_write64(HOST_IA32_FRED_SSP1, 0);
+		vmcs_write64(HOST_IA32_FRED_SSP2, 0);
+		vmcs_write64(HOST_IA32_FRED_SSP3, 0);
+	}
 #else
 	vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
 	vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
@@ -4824,6 +4844,17 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 	}
 
 	vmx_setup_uret_msrs(vmx);
+
+	if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+		vmcs_write64(GUEST_IA32_FRED_CONFIG, 0);
+		vmcs_write64(GUEST_IA32_FRED_RSP1, 0);
+		vmcs_write64(GUEST_IA32_FRED_RSP2, 0);
+		vmcs_write64(GUEST_IA32_FRED_RSP3, 0);
+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, 0);
+		vmcs_write64(GUEST_IA32_FRED_SSP1, 0);
+		vmcs_write64(GUEST_IA32_FRED_SSP2, 0);
+		vmcs_write64(GUEST_IA32_FRED_SSP3, 0);
+	}
 }
 
 static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
@@ -8679,6 +8710,11 @@ __init int vmx_hardware_setup(void)
 
 	kvm_caps.inapplicable_quirks &= ~KVM_X86_QUIRK_IGNORE_GUEST_PAT;
 
+	if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+		rdmsrl(MSR_IA32_FRED_CONFIG, kvm_host.fred_config);
+		rdmsrl(MSR_IA32_FRED_STKLVLS, kvm_host.fred_stklvls);
+	}
+
 	return r;
 }
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d6b21ba41416..b6dc23c478ff 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -52,6 +52,9 @@ struct kvm_host_values {
 	u64 xss;
 	u64 s_cet;
 	u64 arch_capabilities;
+
+	u64 fred_config;
+	u64 fred_stklvls;
 };
 
 void kvm_spurious_fault(void);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (4 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 05/20] KVM: VMX: Initialize VMCS FRED fields Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-25  2:51   ` Xin Li
  2025-08-21 22:36 ` [PATCH v6 07/20] KVM: VMX: Save/restore guest FRED RSP0 Xin Li (Intel)
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

On a userspace MSR filter change, set FRED MSR intercepts.

8 FRED MSRs, i.e., MSR_IA32_FRED_RSP[123], MSR_IA32_FRED_STKLVLS,
MSR_IA32_FRED_SSP[123] and MSR_IA32_FRED_CONFIG, are all safe to
be passthrough, because they all have a pair of corresponding host
and guest VMCS fields.

Both MSR_IA32_FRED_RSP0 and MSR_IA32_FRED_SSP0 are dedicated for
userspace event delivery only, IOW they are NOT used in any kernel
event delivery and the execution of ERETS.  Thus KVM can run safely
with guest values in the two MSRs.  As a result, save and restore of
their guest values are deferred until vCPU context switch and their
host values are restored upon host returning to userspace.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Changes in v5:
* Skip execution of vmx_set_intercept_for_fred_msr() if FRED is
  not available or enabled (Sean).
* Use 'intercept' as the variable name to indicate whether MSR
  interception should be enabled (Sean).
* Add TB from Xuelian Guo.
---
 arch/x86/kvm/vmx/vmx.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 42e179f19c23..8e81230be7af 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4128,6 +4128,43 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
 	}
 }
 
+static void vmx_set_intercept_for_fred_msr(struct kvm_vcpu *vcpu)
+{
+	bool intercept = !guest_cpu_cap_has(vcpu, X86_FEATURE_FRED);
+
+	if (!kvm_cpu_cap_has(X86_FEATURE_FRED))
+		return;
+
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP1, MSR_TYPE_RW, intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP2, MSR_TYPE_RW, intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP3, MSR_TYPE_RW, intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_STKLVLS, MSR_TYPE_RW, intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP1, MSR_TYPE_RW, intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP2, MSR_TYPE_RW, intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP3, MSR_TYPE_RW, intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_CONFIG, MSR_TYPE_RW, intercept);
+
+	/*
+	 * MSR_IA32_FRED_RSP0 and MSR_IA32_PL0_SSP (aka MSR_IA32_FRED_SSP0) are
+	 * designated for event delivery while executing in userspace.  Since
+	 * KVM operates exclusively in kernel mode (the CPL is always 0 after
+	 * any VM exit), KVM can safely retain and operate with the guest-defined
+	 * values for MSR_IA32_FRED_RSP0 and MSR_IA32_PL0_SSP.
+	 *
+	 * Therefore, interception of MSR_IA32_FRED_RSP0 and MSR_IA32_PL0_SSP
+	 * is not required.
+	 *
+	 * Note, save and restore of MSR_IA32_PL0_SSP belong to CET supervisor
+	 * context management.  However the FRED SSP MSRs, including
+	 * MSR_IA32_PL0_SSP, are supported by any processor that enumerates FRED.
+	 * If such a processor does not support CET, FRED transitions will not
+	 * use the MSRs, but the MSRs would still be accessible using MSR-access
+	 * instructions (e.g., RDMSR, WRMSR).
+	 */
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP0, MSR_TYPE_RW, intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP, MSR_TYPE_RW, intercept);
+}
+
 void vmx_recalc_msr_intercepts(struct kvm_vcpu *vcpu)
 {
 	bool intercept;
@@ -4194,6 +4231,8 @@ void vmx_recalc_msr_intercepts(struct kvm_vcpu *vcpu)
 		vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET, MSR_TYPE_RW, intercept);
 	}
 
+	vmx_set_intercept_for_fred_msr(vcpu);
+
 	/*
 	 * x2APIC and LBR MSR intercepts are modified on-demand and cannot be
 	 * filtered by userspace.
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 07/20] KVM: VMX: Save/restore guest FRED RSP0
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (5 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 08/20] KVM: VMX: Add support for FRED context save/restore Xin Li (Intel)
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Save guest FRED RSP0 in vmx_prepare_switch_to_host() and restore it
in vmx_prepare_switch_to_guest() because MSR_IA32_FRED_RSP0 is passed
through to the guest, thus is volatile/unknown.

Note, host FRED RSP0 is restored in arch_exit_to_user_mode_prepare(),
regardless of whether it is modified in KVM.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Changes in v5:
* Remove the cpu_feature_enabled() check when set/get guest
  MSR_IA32_FRED_RSP0, as guest_cpu_cap_has() should suffice (Sean).
* Add a comment when synchronizing current MSR_IA32_FRED_RSP0 MSR to
  the kernel's local cache, because its handling is different from
  the MSR_KERNEL_GS_BASE handling (Sean).
* Add TB from Xuelian Guo.

Changes in v3:
* KVM only needs to save/restore guest FRED RSP0 now as host FRED RSP0
  is restored in arch_exit_to_user_mode_prepare() (Sean Christopherson).

Changes in v2:
* Don't use guest_cpuid_has() in vmx_prepare_switch_to_{host,guest}(),
  which are called from IRQ-disabled context (Chao Gao).
* Reset msr_guest_fred_rsp0 in __vmx_vcpu_reset() (Chao Gao).
---
 arch/x86/kvm/vmx/vmx.c | 14 ++++++++++++++
 arch/x86/kvm/vmx/vmx.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8e81230be7af..714de55f4e8b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1293,6 +1293,10 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	}
 
 	wrmsrq(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
+
+	if (guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
+		wrmsrns(MSR_IA32_FRED_RSP0, vmx->msr_guest_fred_rsp0);
+
 #else
 	savesegment(fs, fs_sel);
 	savesegment(gs, gs_sel);
@@ -1337,6 +1341,16 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx)
 	invalidate_tss_limit();
 #ifdef CONFIG_X86_64
 	wrmsrq(MSR_KERNEL_GS_BASE, vmx->vt.msr_host_kernel_gs_base);
+
+	if (guest_cpu_cap_has(&vmx->vcpu, X86_FEATURE_FRED)) {
+		vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
+		/*
+		 * Synchronize the current value in hardware to the kernel's
+		 * local cache.  The desired host RSP0 will be set when the
+		 * CPU exits to userspace (RSP0 is a per-task value).
+		 */
+		fred_sync_rsp0(vmx->msr_guest_fred_rsp0);
+	}
 #endif
 	load_fixmap_gdt(raw_smp_processor_id());
 	vmx->vt.guest_state_loaded = false;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e577af1003d8..733fa2ef4bea 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -227,6 +227,7 @@ struct vcpu_vmx {
 	bool                  guest_uret_msrs_loaded;
 #ifdef CONFIG_X86_64
 	u64		      msr_guest_kernel_gs_base;
+	u64		      msr_guest_fred_rsp0;
 #endif
 
 	u64		      spec_ctrl;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 08/20] KVM: VMX: Add support for FRED context save/restore
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (6 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 07/20] KVM: VMX: Save/restore guest FRED RSP0 Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 09/20] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU Xin Li (Intel)
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Handle FRED MSR access requests, allowing FRED context to be set/get
from both host and guest.

During VM save/restore and live migration, FRED context needs to be
saved/restored, which requires FRED MSRs to be accessed from userspace,
e.g., Qemu.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v6:
* Return KVM_MSR_RET_UNSUPPORTED instead of 1 when FRED is not available
  (Chao Gao)
* Handle MSR_IA32_PL0_SSP when FRED is enumerated but CET not.

Change in v5:
* Use the newly added guest MSR read/write helpers (Sean).
* Check the size of fred_msr_vmcs_fields[] using static_assert() (Sean).
* Rewrite setting FRED MSRs to make it much easier to read (Sean).
* Add TB from Xuelian Guo.

Changes since v2:
* Add a helper to convert FRED MSR index to VMCS field encoding to
  make the code more compact (Chao Gao).
* Get rid of the "host_initiated" check because userspace has to set
  CPUID before MSRs (Chao Gao & Sean Christopherson).
* Address a few cleanup comments (Sean Christopherson).

Changes since v1:
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() (Chao Gao).
* Fail host requested FRED MSRs access if KVM cannot virtualize FRED
  (Chao Gao).
* Handle the case FRED MSRs are valid but KVM cannot virtualize FRED
  (Chao Gao).
* Add sanity checks when writing to FRED MSRs.
---
 arch/x86/kvm/vmx/vmx.c | 45 +++++++++++++++++++++++++++
 arch/x86/kvm/x86.c     | 69 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 111 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 714de55f4e8b..225c4638ffd7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1388,6 +1388,18 @@ static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data)
 	vmx_write_guest_host_msr(vmx, MSR_KERNEL_GS_BASE, data,
 				 &vmx->msr_guest_kernel_gs_base);
 }
+
+static u64 vmx_read_guest_fred_rsp0(struct vcpu_vmx *vmx)
+{
+	return vmx_read_guest_host_msr(vmx, MSR_IA32_FRED_RSP0,
+				       &vmx->msr_guest_fred_rsp0);
+}
+
+static void vmx_write_guest_fred_rsp0(struct vcpu_vmx *vmx, u64 data)
+{
+	vmx_write_guest_host_msr(vmx, MSR_IA32_FRED_RSP0, data,
+				 &vmx->msr_guest_fred_rsp0);
+}
 #endif
 
 static void grow_ple_window(struct kvm_vcpu *vcpu)
@@ -1989,6 +2001,27 @@ int vmx_get_feature_msr(u32 msr, u64 *data)
 	}
 }
 
+#ifdef CONFIG_X86_64
+static const u32 fred_msr_vmcs_fields[] = {
+	GUEST_IA32_FRED_RSP1,
+	GUEST_IA32_FRED_RSP2,
+	GUEST_IA32_FRED_RSP3,
+	GUEST_IA32_FRED_STKLVLS,
+	GUEST_IA32_FRED_SSP1,
+	GUEST_IA32_FRED_SSP2,
+	GUEST_IA32_FRED_SSP3,
+	GUEST_IA32_FRED_CONFIG,
+};
+
+static_assert(MSR_IA32_FRED_CONFIG - MSR_IA32_FRED_RSP1 ==
+	      ARRAY_SIZE(fred_msr_vmcs_fields) - 1);
+
+static u32 fred_msr_to_vmcs(u32 msr)
+{
+	return fred_msr_vmcs_fields[msr - MSR_IA32_FRED_RSP1];
+}
+#endif
+
 /*
  * Reads an msr value (of 'msr_info->index') into 'msr_info->data'.
  * Returns 0 on success, non-0 otherwise.
@@ -2011,6 +2044,12 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_KERNEL_GS_BASE:
 		msr_info->data = vmx_read_guest_kernel_gs_base(vmx);
 		break;
+	case MSR_IA32_FRED_RSP0:
+		msr_info->data = vmx_read_guest_fred_rsp0(vmx);
+		break;
+	case MSR_IA32_FRED_RSP1 ... MSR_IA32_FRED_CONFIG:
+		msr_info->data = vmcs_read64(fred_msr_to_vmcs(msr_info->index));
+		break;
 #endif
 	case MSR_EFER:
 		return kvm_get_msr_common(vcpu, msr_info);
@@ -2243,6 +2282,12 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			vmx_update_exception_bitmap(vcpu);
 		}
 		break;
+	case MSR_IA32_FRED_RSP0:
+		vmx_write_guest_fred_rsp0(vmx, data);
+		break;
+	case MSR_IA32_FRED_RSP1 ... MSR_IA32_FRED_CONFIG:
+		vmcs_write64(fred_msr_to_vmcs(msr_index), data);
+		break;
 #endif
 	case MSR_IA32_SYSENTER_CS:
 		if (is_guest_mode(vcpu))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9930678f5a3b..1f9a09b34742 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -329,6 +329,9 @@ static const u32 msrs_to_save_base[] = {
 	MSR_STAR,
 #ifdef CONFIG_X86_64
 	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
+	MSR_IA32_FRED_RSP0, MSR_IA32_FRED_RSP1, MSR_IA32_FRED_RSP2,
+	MSR_IA32_FRED_RSP3, MSR_IA32_FRED_STKLVLS, MSR_IA32_FRED_SSP1,
+	MSR_IA32_FRED_SSP2, MSR_IA32_FRED_SSP3, MSR_IA32_FRED_CONFIG,
 #endif
 	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
 	MSR_IA32_FEAT_CTL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
@@ -1910,7 +1913,7 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
 		 * architecture. Intercepting XRSTORS/XSAVES for this
 		 * special case isn't deemed worthwhile.
 		 */
-	case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
+	case MSR_IA32_PL1_SSP ... MSR_IA32_INT_SSP_TAB:
 		if (!guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK))
 			return KVM_MSR_RET_UNSUPPORTED;
 		/*
@@ -1925,6 +1928,48 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
 		if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
 			return 1;
 		break;
+	case MSR_IA32_FRED_STKLVLS:
+		if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
+			return KVM_MSR_RET_UNSUPPORTED;
+		break;
+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_RSP3:
+	case MSR_IA32_FRED_SSP1 ... MSR_IA32_FRED_CONFIG:
+		u64 reserved_bits = 0;
+
+		if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
+			return KVM_MSR_RET_UNSUPPORTED;
+
+		if (is_noncanonical_msr_address(data, vcpu))
+			return 1;
+
+		switch (index) {
+		case MSR_IA32_FRED_CONFIG:
+			reserved_bits = BIT_ULL(11) | GENMASK_ULL(5, 4) | BIT_ULL(2);
+			break;
+		case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_RSP3:
+			reserved_bits = GENMASK_ULL(5, 0);
+			break;
+		case MSR_IA32_FRED_SSP1 ... MSR_IA32_FRED_SSP3:
+			reserved_bits = GENMASK_ULL(2, 0);
+			break;
+		default:
+			WARN_ON_ONCE(1);
+			return 1;
+		}
+		if (data & reserved_bits)
+			return 1;
+		break;
+	case MSR_IA32_PL0_SSP: /* I.e., MSR_IA32_FRED_SSP0 */
+		if (!guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK) &&
+		    !guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
+			return KVM_MSR_RET_UNSUPPORTED;
+
+		if (is_noncanonical_msr_address(data, vcpu))
+			return 1;
+
+		if (!IS_ALIGNED(data, 4))
+			return 1;
+		break;
 	}
 
 	msr.data = data;
@@ -1979,10 +2024,19 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
 		if (!host_initiated)
 			return 1;
 		fallthrough;
-	case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
+	case MSR_IA32_PL1_SSP ... MSR_IA32_INT_SSP_TAB:
 		if (!guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK))
 			return KVM_MSR_RET_UNSUPPORTED;
 		break;
+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
+		if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
+			return KVM_MSR_RET_UNSUPPORTED;
+		break;
+	case MSR_IA32_PL0_SSP: /* I.e., MSR_IA32_FRED_SSP0 */
+		if (!guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK) &&
+		    !guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
+			return KVM_MSR_RET_UNSUPPORTED;
+		break;
 	}
 
 	msr.index = index;
@@ -7603,10 +7657,19 @@ static void kvm_probe_msr_to_save(u32 msr_index)
 		if (!kvm_cpu_cap_has(X86_FEATURE_LM))
 			return;
 		fallthrough;
-	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+	case MSR_IA32_PL1_SSP ... MSR_IA32_PL3_SSP:
 		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
 			return;
 		break;
+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
+		if (!kvm_cpu_cap_has(X86_FEATURE_FRED))
+			return;
+		break;
+	case MSR_IA32_PL0_SSP: /* I.e., MSR_IA32_FRED_SSP0 */
+		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
+		    !kvm_cpu_cap_has(X86_FEATURE_FRED))
+			return;
+		break;
 	default:
 		break;
 	}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 09/20] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (7 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 08/20] KVM: VMX: Add support for FRED context save/restore Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 10/20] KVM: VMX: Virtualize FRED event_data Xin Li (Intel)
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Signed-off-by: Xin Li <xin3.li@intel.com>
[ Sean: removed the "kvm_" prefix from the function name ]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Add TB from Xuelian Guo.
---
 arch/x86/kvm/kvm_cache_regs.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 36a8786db291..31b446b6cbd7 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -204,6 +204,21 @@ static __always_inline bool kvm_is_cr4_bit_set(struct kvm_vcpu *vcpu,
 	return !!kvm_read_cr4_bits(vcpu, cr4_bit);
 }
 
+/*
+ * It's enough to check just CR4.FRED (X86_CR4_FRED) to tell if
+ * a vCPU is running with FRED enabled, because:
+ * 1) CR4.FRED can be set to 1 only _after_ IA32_EFER.LMA = 1.
+ * 2) To leave IA-32e mode, CR4.FRED must be cleared first.
+ */
+static inline bool is_fred_enabled(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_X86_64
+	return kvm_is_cr4_bit_set(vcpu, X86_CR4_FRED);
+#else
+	return false;
+#endif
+}
+
 static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu)
 {
 	if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 10/20] KVM: VMX: Virtualize FRED event_data
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (8 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 09/20] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 11/20] KVM: VMX: Virtualize FRED nested exception tracking Xin Li (Intel)
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Set injected-event data when injecting a #PF, #DB, or #NM caused
by extended feature disable using FRED event delivery, and save
original-event data for being used as injected-event data.

Unlike IDT using some extra CPU register as part of an event
context, e.g., %cr2 for #PF, FRED saves a complete event context
in its stack frame, e.g., FRED saves the faulting linear address
of a #PF into the event data field defined in its stack frame.

Thus a new VMX control field called injected-event data is added
to provide the event data that will be pushed into a FRED stack
frame for VM entries that inject an event using FRED event delivery.
In addition, a new VM exit information field called original-event
data is added to store the event data that would have saved into a
FRED stack frame for VM exits that occur during FRED event delivery.
After such a VM exit is handled to allow the original-event to be
delivered, the data in the original-event data VMCS field needs to
be set into the injected-event data VMCS field for the injection of
the original event.

Signed-off-by: Xin Li <xin3.li@intel.com>
[ Sean: reworked event data injection for nested ]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Add TB from Xuelian Guo.

Change in v3:
* Rework event data injection for nested (Chao Gao & Sean Christopherson).

Changes in v2:
* Document event data should be equal to CR2/DR6/IA32_XFD_ERR instead
  of using WARN_ON() (Chao Gao).
* Zero event data if a #NM was not caused by extended feature disable
  (Chao Gao).
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/include/asm/vmx.h      |  4 ++++
 arch/x86/kvm/svm/svm.c          |  2 +-
 arch/x86/kvm/vmx/vmx.c          | 22 ++++++++++++++++++----
 arch/x86/kvm/x86.c              | 16 +++++++++++++++-
 5 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 061c0cd73d39..dce6471194f7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -759,6 +759,7 @@ struct kvm_queued_exception {
 	u32 error_code;
 	unsigned long payload;
 	bool has_payload;
+	u64 event_data;
 };
 
 /*
@@ -2222,7 +2223,7 @@ void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long payload);
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
-			   bool has_error_code, u32 error_code);
+			   bool has_error_code, u32 error_code, u64 event_data);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
 void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 				    struct x86_exception *fault);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 6f8b8947c60c..539af190ad3e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -269,8 +269,12 @@ enum vmcs_field {
 	PID_POINTER_TABLE_HIGH		= 0x00002043,
 	SECONDARY_VM_EXIT_CONTROLS	= 0x00002044,
 	SECONDARY_VM_EXIT_CONTROLS_HIGH	= 0x00002045,
+	INJECTED_EVENT_DATA		= 0x00002052,
+	INJECTED_EVENT_DATA_HIGH	= 0x00002053,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
+	ORIGINAL_EVENT_DATA		= 0x00002404,
+	ORIGINAL_EVENT_DATA_HIGH	= 0x00002405,
 	VMCS_LINK_POINTER               = 0x00002800,
 	VMCS_LINK_POINTER_HIGH          = 0x00002801,
 	GUEST_IA32_DEBUGCTL             = 0x00002802,
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0e2c60466797..72f54befd0d0 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4144,7 +4144,7 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 
 		kvm_requeue_exception(vcpu, vector,
 				      exitintinfo & SVM_EXITINTINFO_VALID_ERR,
-				      error_code);
+				      error_code, 0);
 		break;
 	}
 	case SVM_EXITINTINFO_TYPE_INTR:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 225c4638ffd7..e1eb55fb3fb8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1862,6 +1862,9 @@ void vmx_inject_exception(struct kvm_vcpu *vcpu)
 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
 
+	if (is_fred_enabled(vcpu))
+		vmcs_write64(INJECTED_EVENT_DATA, ex->event_data);
+
 	vmx_clear_hlt(vcpu);
 }
 
@@ -7262,7 +7265,8 @@ static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx)
 static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 				      u32 idt_vectoring_info,
 				      int instr_len_field,
-				      int error_code_field)
+				      int error_code_field,
+				      int event_data_field)
 {
 	u8 vector;
 	int type;
@@ -7297,13 +7301,17 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 		fallthrough;
 	case INTR_TYPE_HARD_EXCEPTION: {
 		u32 error_code = 0;
+		u64 event_data = 0;
 
 		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK)
 			error_code = vmcs_read32(error_code_field);
+		if (is_fred_enabled(vcpu))
+			event_data = vmcs_read64(event_data_field);
 
 		kvm_requeue_exception(vcpu, vector,
 				      idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK,
-				      error_code);
+				      error_code,
+				      event_data);
 		break;
 	}
 	case INTR_TYPE_SOFT_INTR:
@@ -7321,7 +7329,8 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
 	__vmx_complete_interrupts(&vmx->vcpu, vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
-				  IDT_VECTORING_ERROR_CODE);
+				  IDT_VECTORING_ERROR_CODE,
+				  ORIGINAL_EVENT_DATA);
 }
 
 void vmx_cancel_injection(struct kvm_vcpu *vcpu)
@@ -7329,7 +7338,8 @@ void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 	__vmx_complete_interrupts(vcpu,
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 				  VM_ENTRY_INSTRUCTION_LEN,
-				  VM_ENTRY_EXCEPTION_ERROR_CODE);
+				  VM_ENTRY_EXCEPTION_ERROR_CODE,
+				  INJECTED_EVENT_DATA);
 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
@@ -7483,6 +7493,10 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 
 	vmx_disable_fb_clear(vmx);
 
+	/*
+	 * Note, even though FRED delivers the faulting linear address via the
+	 * event data field on the stack, CR2 is still updated.
+	 */
 	if (vcpu->arch.cr2 != native_read_cr2())
 		native_write_cr2(vcpu->arch.cr2);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1f9a09b34742..f082255852a9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -807,9 +807,22 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
 		 * breakpoint), it is reserved and must be zero in DR6.
 		 */
 		vcpu->arch.dr6 &= ~BIT(12);
+
+		/*
+		 * FRED #DB event data matches DR6, but follows the polarity of
+		 * VMX's pending debug exceptions, not DR6.
+		 */
+		ex->event_data = ex->payload & ~BIT(12);
+		break;
+	case NM_VECTOR:
+		ex->event_data = ex->payload;
 		break;
 	case PF_VECTOR:
 		vcpu->arch.cr2 = ex->payload;
+		ex->event_data = ex->payload;
+		break;
+	default:
+		ex->event_data = 0;
 		break;
 	}
 
@@ -917,7 +930,7 @@ static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
 }
 
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
-			   bool has_error_code, u32 error_code)
+			   bool has_error_code, u32 error_code, u64 event_data)
 {
 
 	/*
@@ -942,6 +955,7 @@ void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
 	vcpu->arch.exception.error_code = error_code;
 	vcpu->arch.exception.has_payload = false;
 	vcpu->arch.exception.payload = 0;
+	vcpu->arch.exception.event_data = event_data;
 }
 EXPORT_SYMBOL_GPL(kvm_requeue_exception);
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 11/20] KVM: VMX: Virtualize FRED nested exception tracking
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (9 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 10/20] KVM: VMX: Virtualize FRED event_data Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 12/20] KVM: x86: Save/restore the nested flag of an exception Xin Li (Intel)
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Set the VMX nested exception bit in VM-entry interruption information
field when injecting a nested exception using FRED event delivery to
ensure:
  1) A nested exception is injected on a correct stack level.
  2) The nested bit defined in FRED stack frame is set.

The event stack level used by FRED event delivery depends on whether
the event was a nested exception encountered during delivery of an
earlier event, because a nested exception is "regarded" as happening
on ring 0.  E.g., when #PF is configured to use stack level 1 in
IA32_FRED_STKLVLS MSR:
  - nested #PF will be delivered on the stack pointed by IA32_FRED_RSP1
    MSR when encountered in ring 3 and ring 0.
  - normal #PF will be delivered on the stack pointed by IA32_FRED_RSP0
    MSR when encountered in ring 3.

The VMX nested-exception support ensures a correct event stack level is
chosen when a VM entry injects a nested exception.

Signed-off-by: Xin Li <xin3.li@intel.com>
[ Sean: reworked kvm_requeue_exception() to simply the code changes ]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Add TB from Xuelian Guo.

Change in v4:
* Move the check is_fred_enable() from kvm_multiple_exception() to
  vmx_inject_exception() thus avoid bleeding FRED details into
  kvm_multiple_exception() (Chao Gao).

Change in v3:
* Rework kvm_requeue_exception() to simply the code changes (Sean
  Christopherson).

Change in v2:
* Set the nested flag when there is an original interrupt (Chao Gao).
---
 arch/x86/include/asm/kvm_host.h |  4 +++-
 arch/x86/include/asm/vmx.h      |  5 ++++-
 arch/x86/kvm/svm/svm.c          |  2 +-
 arch/x86/kvm/vmx/vmx.c          |  6 +++++-
 arch/x86/kvm/x86.c              | 13 ++++++++++++-
 arch/x86/kvm/x86.h              |  1 +
 6 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dce6471194f7..6299c43dfbee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -759,6 +759,7 @@ struct kvm_queued_exception {
 	u32 error_code;
 	unsigned long payload;
 	bool has_payload;
+	bool nested;
 	u64 event_data;
 };
 
@@ -2223,7 +2224,8 @@ void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long payload);
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
-			   bool has_error_code, u32 error_code, u64 event_data);
+			   bool has_error_code, u32 error_code, bool nested,
+			   u64 event_data);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
 void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 				    struct x86_exception *fault);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 539af190ad3e..7b34a9357b28 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -140,6 +140,7 @@
 #define VMX_BASIC_INOUT				BIT_ULL(54)
 #define VMX_BASIC_TRUE_CTLS			BIT_ULL(55)
 #define VMX_BASIC_NO_HW_ERROR_CODE_CC		BIT_ULL(56)
+#define VMX_BASIC_NESTED_EXCEPTION		BIT_ULL(58)
 
 static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
 {
@@ -442,13 +443,15 @@ enum vmcs_field {
 #define INTR_INFO_INTR_TYPE_MASK        0x700           /* 10:8 */
 #define INTR_INFO_DELIVER_CODE_MASK     0x800           /* 11 */
 #define INTR_INFO_UNBLOCK_NMI		0x1000		/* 12 */
+#define INTR_INFO_NESTED_EXCEPTION_MASK	0x2000		/* 13 */
 #define INTR_INFO_VALID_MASK            0x80000000      /* 31 */
-#define INTR_INFO_RESVD_BITS_MASK       0x7ffff000
+#define INTR_INFO_RESVD_BITS_MASK       0x7fffd000
 
 #define VECTORING_INFO_VECTOR_MASK           	INTR_INFO_VECTOR_MASK
 #define VECTORING_INFO_TYPE_MASK        	INTR_INFO_INTR_TYPE_MASK
 #define VECTORING_INFO_DELIVER_CODE_MASK    	INTR_INFO_DELIVER_CODE_MASK
 #define VECTORING_INFO_VALID_MASK       	INTR_INFO_VALID_MASK
+#define VECTORING_INFO_NESTED_EXCEPTION_MASK	INTR_INFO_NESTED_EXCEPTION_MASK
 
 #define INTR_TYPE_EXT_INTR		(EVENT_TYPE_EXTINT << 8)	/* external interrupt */
 #define INTR_TYPE_RESERVED		(EVENT_TYPE_RESERVED << 8)	/* reserved */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 72f54befd0d0..06961098de42 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4144,7 +4144,7 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 
 		kvm_requeue_exception(vcpu, vector,
 				      exitintinfo & SVM_EXITINTINFO_VALID_ERR,
-				      error_code, 0);
+				      error_code, false, 0);
 		break;
 	}
 	case SVM_EXITINTINFO_TYPE_INTR:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e1eb55fb3fb8..7a7856f06f98 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1857,8 +1857,11 @@ void vmx_inject_exception(struct kvm_vcpu *vcpu)
 		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
 			     vmx->vcpu.arch.event_exit_inst_len);
 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
-	} else
+	} else {
 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
+		if (ex->nested && is_fred_enabled(vcpu))
+			intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
+	}
 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
 
@@ -7311,6 +7314,7 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 		kvm_requeue_exception(vcpu, vector,
 				      idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK,
 				      error_code,
+				      idt_vectoring_info & VECTORING_INFO_NESTED_EXCEPTION_MASK,
 				      event_data);
 		break;
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f082255852a9..fbbfa600e2c2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -871,6 +871,10 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
 		vcpu->arch.exception.pending = true;
 		vcpu->arch.exception.injected = false;
 
+		vcpu->arch.exception.nested = vcpu->arch.exception.nested ||
+					      vcpu->arch.nmi_injected ||
+					      vcpu->arch.interrupt.injected;
+
 		vcpu->arch.exception.has_error_code = has_error;
 		vcpu->arch.exception.vector = nr;
 		vcpu->arch.exception.error_code = error_code;
@@ -900,8 +904,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
 		vcpu->arch.exception.injected = false;
 		vcpu->arch.exception.pending = false;
 
+		/* #DF is NOT a nested event, per its definition. */
+		vcpu->arch.exception.nested = false;
+
 		kvm_queue_exception_e(vcpu, DF_VECTOR, 0);
 	} else {
+		vcpu->arch.exception.nested = true;
+
 		/* replace previous exception with a new one in a hope
 		   that instruction re-execution will regenerate lost
 		   exception */
@@ -930,7 +939,8 @@ static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
 }
 
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
-			   bool has_error_code, u32 error_code, u64 event_data)
+			   bool has_error_code, u32 error_code, bool nested,
+			   u64 event_data)
 {
 
 	/*
@@ -955,6 +965,7 @@ void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
 	vcpu->arch.exception.error_code = error_code;
 	vcpu->arch.exception.has_payload = false;
 	vcpu->arch.exception.payload = 0;
+	vcpu->arch.exception.nested = nested;
 	vcpu->arch.exception.event_data = event_data;
 }
 EXPORT_SYMBOL_GPL(kvm_requeue_exception);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index b6dc23c478ff..685eb710b1f2 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -198,6 +198,7 @@ static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.exception.pending = false;
 	vcpu->arch.exception.injected = false;
+	vcpu->arch.exception.nested = false;
 	vcpu->arch.exception_vmexit.pending = false;
 }
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 12/20] KVM: x86: Save/restore the nested flag of an exception
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (10 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 11/20] KVM: VMX: Virtualize FRED nested exception tracking Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 13/20] KVM: x86: Mark CR4.FRED as not reserved Xin Li (Intel)
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

Save/restore the nested flag of an exception during VM save/restore
and live migration to ensure a correct event stack level is chosen
when a nested exception is injected through FRED event delivery.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Add TB from Xuelian Guo.

Change in v4:
* Add live migration support for exception nested flag (Chao Gao).
---
 Documentation/virt/kvm/api.rst  | 21 ++++++++++++++++++++-
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/include/uapi/asm/kvm.h |  4 +++-
 arch/x86/kvm/x86.c              | 19 ++++++++++++++++++-
 include/uapi/linux/kvm.h        |  1 +
 5 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 6aa40ee05a4a..c496b0883a7f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1184,6 +1184,10 @@ The following bits are defined in the flags field:
   fields contain a valid state. This bit will be set whenever
   KVM_CAP_EXCEPTION_PAYLOAD is enabled.
 
+- KVM_VCPUEVENT_VALID_NESTED_FLAG may be set to inform that the
+  exception is a nested exception. This bit will be set whenever
+  KVM_CAP_EXCEPTION_NESTED_FLAG is enabled.
+
 - KVM_VCPUEVENT_VALID_TRIPLE_FAULT may be set to signal that the
   triple_fault_pending field contains a valid state. This bit will
   be set whenever KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled.
@@ -1283,6 +1287,10 @@ can be set in the flags field to signal that the
 exception_has_payload, exception_payload, and exception.pending fields
 contain a valid state and shall be written into the VCPU.
 
+If KVM_CAP_EXCEPTION_NESTED_FLAG is enabled, KVM_VCPUEVENT_VALID_NESTED_FLAG
+can be set in the flags field to inform that the exception is a nested
+exception and exception_is_nested shall be written into the VCPU.
+
 If KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled, KVM_VCPUEVENT_VALID_TRIPLE_FAULT
 can be set in flags field to signal that the triple_fault field contains
 a valid state and shall be written into the VCPU.
@@ -8651,7 +8659,7 @@ given VM.
 When this capability is enabled, KVM resets the VCPU when setting
 MP_STATE_INIT_RECEIVED through IOCTL.  The original MP_STATE is preserved.
 
-7.43 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED
+7.44 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED
 -------------------------------------------
 
 :Architectures: arm64
@@ -8662,6 +8670,17 @@ This capability indicate to the userspace whether a PFNMAP memory region
 can be safely mapped as cacheable. This relies on the presence of
 force write back (FWB) feature support on the hardware.
 
+7.45 KVM_CAP_EXCEPTION_NESTED_FLAG
+----------------------------------
+
+:Architectures: x86
+:Parameters: args[0] whether feature should be enabled or not
+
+With this capability enabled, an exception is save/restored with the
+additional information of whether it was nested or not. FRED event
+delivery uses this information to ensure a correct event stack level
+is chosen when a VM entry injects a nested exception.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6299c43dfbee..7549e5143249 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1484,6 +1484,7 @@ struct kvm_arch {
 	bool has_mapped_host_mmio;
 	bool guest_can_read_msr_platform_info;
 	bool exception_payload_enabled;
+	bool exception_nested_flag_enabled;
 
 	bool triple_fault_event;
 
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 478d9b63a9db..03ea8c46d8cf 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -326,6 +326,7 @@ struct kvm_reinject_control {
 #define KVM_VCPUEVENT_VALID_SMM		0x00000008
 #define KVM_VCPUEVENT_VALID_PAYLOAD	0x00000010
 #define KVM_VCPUEVENT_VALID_TRIPLE_FAULT	0x00000020
+#define KVM_VCPUEVENT_VALID_NESTED_FLAG	0x00000040
 
 /* Interrupt shadow states */
 #define KVM_X86_SHADOW_INT_MOV_SS	0x01
@@ -363,7 +364,8 @@ struct kvm_vcpu_events {
 	struct {
 		__u8 pending;
 	} triple_fault;
-	__u8 reserved[26];
+	__u8 reserved[25];
+	__u8 exception_is_nested;
 	__u8 exception_has_payload;
 	__u64 exception_payload;
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fbbfa600e2c2..7103eedb13e8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4927,6 +4927,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_GET_MSR_FEATURES:
 	case KVM_CAP_MSR_PLATFORM_INFO:
 	case KVM_CAP_EXCEPTION_PAYLOAD:
+	case KVM_CAP_EXCEPTION_NESTED_FLAG:
 	case KVM_CAP_X86_TRIPLE_FAULT_EVENT:
 	case KVM_CAP_SET_GUEST_DEBUG:
 	case KVM_CAP_LAST_CPU:
@@ -5672,6 +5673,7 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 	events->exception.error_code = ex->error_code;
 	events->exception_has_payload = ex->has_payload;
 	events->exception_payload = ex->payload;
+	events->exception_is_nested = ex->nested;
 
 	events->interrupt.injected =
 		vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
@@ -5697,6 +5699,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 			 | KVM_VCPUEVENT_VALID_SMM);
 	if (vcpu->kvm->arch.exception_payload_enabled)
 		events->flags |= KVM_VCPUEVENT_VALID_PAYLOAD;
+	if (vcpu->kvm->arch.exception_nested_flag_enabled)
+		events->flags |= KVM_VCPUEVENT_VALID_NESTED_FLAG;
 	if (vcpu->kvm->arch.triple_fault_event) {
 		events->triple_fault.pending = kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu);
 		events->flags |= KVM_VCPUEVENT_VALID_TRIPLE_FAULT;
@@ -5711,7 +5715,8 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 			      | KVM_VCPUEVENT_VALID_SHADOW
 			      | KVM_VCPUEVENT_VALID_SMM
 			      | KVM_VCPUEVENT_VALID_PAYLOAD
-			      | KVM_VCPUEVENT_VALID_TRIPLE_FAULT))
+			      | KVM_VCPUEVENT_VALID_TRIPLE_FAULT
+			      | KVM_VCPUEVENT_VALID_NESTED_FLAG))
 		return -EINVAL;
 
 	if (events->flags & KVM_VCPUEVENT_VALID_PAYLOAD) {
@@ -5726,6 +5731,13 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 		events->exception_has_payload = 0;
 	}
 
+	if (events->flags & KVM_VCPUEVENT_VALID_NESTED_FLAG) {
+		if (!vcpu->kvm->arch.exception_nested_flag_enabled)
+			return -EINVAL;
+	} else {
+		events->exception_is_nested = 0;
+	}
+
 	if ((events->exception.injected || events->exception.pending) &&
 	    (events->exception.nr > 31 || events->exception.nr == NMI_VECTOR))
 		return -EINVAL;
@@ -5751,6 +5763,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 	vcpu->arch.exception.error_code = events->exception.error_code;
 	vcpu->arch.exception.has_payload = events->exception_has_payload;
 	vcpu->arch.exception.payload = events->exception_payload;
+	vcpu->arch.exception.nested = events->exception_is_nested;
 
 	vcpu->arch.interrupt.injected = events->interrupt.injected;
 	vcpu->arch.interrupt.nr = events->interrupt.nr;
@@ -6800,6 +6813,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		kvm->arch.exception_payload_enabled = cap->args[0];
 		r = 0;
 		break;
+	case KVM_CAP_EXCEPTION_NESTED_FLAG:
+		kvm->arch.exception_nested_flag_enabled = cap->args[0];
+		r = 0;
+		break;
 	case KVM_CAP_X86_TRIPLE_FAULT_EVENT:
 		kvm->arch.triple_fault_event = cap->args[0];
 		r = 0;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f0f0d49d2544..fe4a822b3c09 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -962,6 +962,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_ARM_EL2_E2H0 241
 #define KVM_CAP_RISCV_MP_STATE_RESET 242
 #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
+#define KVM_CAP_EXCEPTION_NESTED_FLAG 244
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 13/20] KVM: x86: Mark CR4.FRED as not reserved
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (11 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 12/20] KVM: x86: Save/restore the nested flag of an exception Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 14/20] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li (Intel)
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

The CR4.FRED bit, i.e., CR4[32], is no longer a reserved bit when
guest cpu cap has FRED, i.e.,
  1) All of FRED KVM support is in place.
  2) Guest enumerates FRED.

Otherwise it is still a reserved bit.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Add TB from Xuelian Guo.

Change in v4:
* Rebase on top of "guest_cpu_cap".

Change in v3:
* Don't allow CR4.FRED=1 before all of FRED KVM support is in place
  (Sean Christopherson).
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/x86.h              | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7549e5143249..0b5857997b22 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -142,7 +142,7 @@
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
 			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
 			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
-			  | X86_CR4_LAM_SUP | X86_CR4_CET))
+			  | X86_CR4_LAM_SUP | X86_CR4_CET | X86_CR4_FRED))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 685eb710b1f2..c9f010862b2a 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -688,6 +688,8 @@ static inline bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 	if (!__cpu_has(__c, X86_FEATURE_SHSTK) &&       \
 	    !__cpu_has(__c, X86_FEATURE_IBT))           \
 		__reserved_bits |= X86_CR4_CET;         \
+	if (!__cpu_has(__c, X86_FEATURE_FRED))          \
+		__reserved_bits |= X86_CR4_FRED;        \
 	__reserved_bits;                                \
 })
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 14/20] KVM: VMX: Dump FRED context in dump_vmcs()
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (12 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 13/20] KVM: x86: Mark CR4.FRED as not reserved Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 15/20] KVM: x86: Advertise support for FRED Xin Li (Intel)
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Add FRED related VMCS fields to dump_vmcs() to dump FRED context.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Changes in v5:
* Read guest FRED RSP0 with vmx_read_guest_fred_rsp0() (Sean).
* Add TB from Xuelian Guo.

Change in v3:
* Use (vmentry_ctrl & VM_ENTRY_LOAD_IA32_FRED) instead of is_fred_enabled()
  (Chao Gao).

Changes in v2:
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() (Chao Gao).
* Dump guest FRED states only if guest has FRED enabled (Nikolay Borisov).
---
 arch/x86/kvm/vmx/vmx.c | 43 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7a7856f06f98..ac76cb33f3de 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1400,6 +1400,9 @@ static void vmx_write_guest_fred_rsp0(struct vcpu_vmx *vmx, u64 data)
 	vmx_write_guest_host_msr(vmx, MSR_IA32_FRED_RSP0, data,
 				 &vmx->msr_guest_fred_rsp0);
 }
+#else
+/* To make sure dump_vmcs() compile on 32-bit */
+static u64 vmx_read_guest_fred_rsp0(struct vcpu_vmx *vmx) { return 0; }
 #endif
 
 static void grow_ple_window(struct kvm_vcpu *vcpu)
@@ -6429,7 +6432,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 vmentry_ctl, vmexit_ctl;
 	u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
-	u64 tertiary_exec_control;
+	u64 tertiary_exec_control, secondary_vmexit_ctl;
 	unsigned long cr4;
 	int efer_slot;
 
@@ -6440,6 +6443,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 
 	vmentry_ctl = vmcs_read32(VM_ENTRY_CONTROLS);
 	vmexit_ctl = vmcs_read32(VM_EXIT_CONTROLS);
+	secondary_vmexit_ctl = cpu_has_secondary_vmexit_ctrls() ?
+			       vmcs_read64(SECONDARY_VM_EXIT_CONTROLS) : 0;
 	cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
 	cr4 = vmcs_readl(GUEST_CR4);
@@ -6486,6 +6491,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	vmx_dump_sel("LDTR:", GUEST_LDTR_SELECTOR);
 	vmx_dump_dtsel("IDTR:", GUEST_IDTR_LIMIT);
 	vmx_dump_sel("TR:  ", GUEST_TR_SELECTOR);
+	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_FRED)
+		pr_err("FRED guest: config=0x%016llx, stack_levels=0x%016llx\n"
+		       "RSP0=0x%016llx, RSP1=0x%016llx\n"
+		       "RSP2=0x%016llx, RSP3=0x%016llx\n",
+		       vmcs_read64(GUEST_IA32_FRED_CONFIG),
+		       vmcs_read64(GUEST_IA32_FRED_STKLVLS),
+		       vmx_read_guest_fred_rsp0(vmx),
+		       vmcs_read64(GUEST_IA32_FRED_RSP1),
+		       vmcs_read64(GUEST_IA32_FRED_RSP2),
+		       vmcs_read64(GUEST_IA32_FRED_RSP3));
 	efer_slot = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest, MSR_EFER);
 	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_EFER)
 		pr_err("EFER= 0x%016llx\n", vmcs_read64(GUEST_IA32_EFER));
@@ -6537,6 +6552,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	       vmcs_readl(HOST_TR_BASE));
 	pr_err("GDTBase=%016lx IDTBase=%016lx\n",
 	       vmcs_readl(HOST_GDTR_BASE), vmcs_readl(HOST_IDTR_BASE));
+	if (vmexit_ctl & SECONDARY_VM_EXIT_LOAD_IA32_FRED)
+		pr_err("FRED host: config=0x%016llx, stack_levels=0x%016llx\n"
+		       "RSP0=0x%016lx, RSP1=0x%016llx\n"
+		       "RSP2=0x%016llx, RSP3=0x%016llx\n",
+		       vmcs_read64(HOST_IA32_FRED_CONFIG),
+		       vmcs_read64(HOST_IA32_FRED_STKLVLS),
+		       (unsigned long)task_stack_page(current) + THREAD_SIZE,
+		       vmcs_read64(HOST_IA32_FRED_RSP1),
+		       vmcs_read64(HOST_IA32_FRED_RSP2),
+		       vmcs_read64(HOST_IA32_FRED_RSP3));
 	pr_err("CR0=%016lx CR3=%016lx CR4=%016lx\n",
 	       vmcs_readl(HOST_CR0), vmcs_readl(HOST_CR3),
 	       vmcs_readl(HOST_CR4));
@@ -6562,25 +6587,29 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	pr_err("*** Control State ***\n");
 	pr_err("CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
 	       cpu_based_exec_ctrl, secondary_exec_control, tertiary_exec_control);
-	pr_err("PinBased=0x%08x EntryControls=%08x ExitControls=%08x\n",
-	       pin_based_exec_ctrl, vmentry_ctl, vmexit_ctl);
+	pr_err("PinBased=0x%08x EntryControls=0x%08x\n",
+	       pin_based_exec_ctrl, vmentry_ctl);
+	pr_err("ExitControls=0x%08x SecondaryExitControls=0x%016llx\n",
+	       vmexit_ctl, secondary_vmexit_ctl);
 	pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
 	       vmcs_read32(EXCEPTION_BITMAP),
 	       vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK),
 	       vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH));
-	pr_err("VMEntry: intr_info=%08x errcode=%08x ilen=%08x\n",
+	pr_err("VMEntry: intr_info=%08x errcode=%08x ilen=%08x event_data=%016llx\n",
 	       vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 	       vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE),
-	       vmcs_read32(VM_ENTRY_INSTRUCTION_LEN));
+	       vmcs_read32(VM_ENTRY_INSTRUCTION_LEN),
+	       kvm_cpu_cap_has(X86_FEATURE_FRED) ? vmcs_read64(INJECTED_EVENT_DATA) : 0);
 	pr_err("VMExit: intr_info=%08x errcode=%08x ilen=%08x\n",
 	       vmcs_read32(VM_EXIT_INTR_INFO),
 	       vmcs_read32(VM_EXIT_INTR_ERROR_CODE),
 	       vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
 	pr_err("        reason=%08x qualification=%016lx\n",
 	       vmcs_read32(VM_EXIT_REASON), vmcs_readl(EXIT_QUALIFICATION));
-	pr_err("IDTVectoring: info=%08x errcode=%08x\n",
+	pr_err("IDTVectoring: info=%08x errcode=%08x event_data=%016llx\n",
 	       vmcs_read32(IDT_VECTORING_INFO_FIELD),
-	       vmcs_read32(IDT_VECTORING_ERROR_CODE));
+	       vmcs_read32(IDT_VECTORING_ERROR_CODE),
+	       kvm_cpu_cap_has(X86_FEATURE_FRED) ? vmcs_read64(ORIGINAL_EVENT_DATA) : 0);
 	pr_err("TSC Offset = 0x%016llx\n", vmcs_read64(TSC_OFFSET));
 	if (secondary_exec_control & SECONDARY_EXEC_TSC_SCALING)
 		pr_err("TSC Multiplier = 0x%016llx\n",
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 15/20] KVM: x86: Advertise support for FRED
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (13 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 14/20] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 16/20] KVM: nVMX: Add support for the secondary VM exit controls Xin Li (Intel)
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Advertise support for FRED to userspace after changes required to enable
FRED in a KVM guest are in place.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Don't advertise FRED/LKGS together, LKGS can be advertised as an
  independent feature (Sean).
* Add TB from Xuelian Guo.
---
 arch/x86/kvm/cpuid.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ee05b876c656..1f15aad02c68 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -994,6 +994,7 @@ void kvm_set_cpu_caps(void)
 		F(FSRS),
 		F(FSRC),
 		F(WRMSRNS),
+		X86_64_F(FRED),
 		X86_64_F(LKGS),
 		F(AMX_FP16),
 		F(AVX_IFMA),
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 16/20] KVM: nVMX: Add support for the secondary VM exit controls
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (14 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 15/20] KVM: x86: Advertise support for FRED Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 17/20] KVM: nVMX: Add FRED VMCS fields to nested VMX context handling Xin Li (Intel)
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Enable the secondary VM exit controls to prepare for nested FRED.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Changes in v5:
* Allow writing MSR_IA32_VMX_EXIT_CTLS2 (Sean).
* Add TB from Xuelian Guo.

Change in v3:
* Read secondary VM exit controls from vmcs_conf insteasd of the hardware
  MSR MSR_IA32_VMX_EXIT_CTLS2 to avoid advertising features to L1 that KVM
  itself doesn't support, e.g. because the expected entry+exit pairs aren't
  supported. (Sean Christopherson)
---
 Documentation/virt/kvm/x86/nested-vmx.rst |  1 +
 arch/x86/kvm/vmx/capabilities.h           |  1 +
 arch/x86/kvm/vmx/nested.c                 | 26 ++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmcs12.c                 |  1 +
 arch/x86/kvm/vmx/vmcs12.h                 |  2 ++
 arch/x86/kvm/x86.h                        |  2 +-
 6 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
index ac2095d41f02..e64ef231f310 100644
--- a/Documentation/virt/kvm/x86/nested-vmx.rst
+++ b/Documentation/virt/kvm/x86/nested-vmx.rst
@@ -217,6 +217,7 @@ struct shadow_vmcs is ever changed.
 		u16 host_fs_selector;
 		u16 host_gs_selector;
 		u16 host_tr_selector;
+		u64 secondary_vm_exit_controls;
 	};
 
 
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 7fe95a601c9f..c9f00b6594d9 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -37,6 +37,7 @@ struct nested_vmx_msrs {
 	u32 pinbased_ctls_high;
 	u32 exit_ctls_low;
 	u32 exit_ctls_high;
+	u64 secondary_exit_ctls;
 	u32 entry_ctls_low;
 	u32 entry_ctls_high;
 	u32 misc_low;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d7e2fb30fc1a..e4de8372b9f9 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1531,6 +1531,11 @@ int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
 			return -EINVAL;
 		vmx->nested.msrs.vmfunc_controls = data;
 		return 0;
+	case MSR_IA32_VMX_EXIT_CTLS2:
+		if (data & ~vmcs_config.nested.secondary_exit_ctls)
+			return -EINVAL;
+		vmx->nested.msrs.secondary_exit_ctls = data;
+		return 0;
 	default:
 		/*
 		 * The rest of the VMX capability MSRs do not support restore.
@@ -1570,6 +1575,9 @@ int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata)
 		if (msr_index == MSR_IA32_VMX_EXIT_CTLS)
 			*pdata |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 		break;
+	case MSR_IA32_VMX_EXIT_CTLS2:
+		*pdata = msrs->secondary_exit_ctls;
+		break;
 	case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
 	case MSR_IA32_VMX_ENTRY_CTLS:
 		*pdata = vmx_control_msr(
@@ -2520,6 +2528,11 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
 		exec_control &= ~VM_EXIT_LOAD_IA32_EFER;
 	vm_exit_controls_set(vmx, exec_control);
 
+	if (exec_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
+		exec_control = __secondary_vm_exit_controls_get(vmcs01);
+		secondary_vm_exit_controls_set(vmx, exec_control);
+	}
+
 	/*
 	 * Interrupt/Exception Fields
 	 */
@@ -7176,7 +7189,8 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
 #endif
 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
-		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
+		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE |
+		VM_EXIT_ACTIVATE_SECONDARY_CONTROLS;
 	msrs->exit_ctls_high |=
 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -7185,6 +7199,16 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
 
 	/* We support free control of debug control saving. */
 	msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS;
+
+	if (msrs->exit_ctls_high & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
+		msrs->secondary_exit_ctls = vmcs_conf->vmexit_2nd_ctrl;
+		/*
+		 * As the secondary VM exit control is always loaded, do not
+		 * advertise any feature in it to nVMX until its nVMX support
+		 * is ready.
+		 */
+		msrs->secondary_exit_ctls &= 0;
+	}
 }
 
 static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 4233b5ca9461..3b01175f392a 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -66,6 +66,7 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD64(HOST_IA32_PAT, host_ia32_pat),
 	FIELD64(HOST_IA32_EFER, host_ia32_efer),
 	FIELD64(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl),
+	FIELD64(SECONDARY_VM_EXIT_CONTROLS, secondary_vm_exit_controls),
 	FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
 	FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
 	FIELD(EXCEPTION_BITMAP, exception_bitmap),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 4ad6b16525b9..7866fdce7a23 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -191,6 +191,7 @@ struct __packed vmcs12 {
 	u16 host_gs_selector;
 	u16 host_tr_selector;
 	u16 guest_pml_index;
+	u64 secondary_vm_exit_controls;
 };
 
 /*
@@ -372,6 +373,7 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(host_gs_selector, 992);
 	CHECK_OFFSET(host_tr_selector, 994);
 	CHECK_OFFSET(guest_pml_index, 996);
+	CHECK_OFFSET(secondary_vm_exit_controls, 998);
 }
 
 extern const unsigned short vmcs12_field_offsets[];
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index c9f010862b2a..88a4eaafc81b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -95,7 +95,7 @@ do {											\
  * associated feature that KVM supports for nested virtualization.
  */
 #define KVM_FIRST_EMULATED_VMX_MSR	MSR_IA32_VMX_BASIC
-#define KVM_LAST_EMULATED_VMX_MSR	MSR_IA32_VMX_VMFUNC
+#define KVM_LAST_EMULATED_VMX_MSR	MSR_IA32_VMX_EXIT_CTLS2
 
 #define KVM_DEFAULT_PLE_GAP		128
 #define KVM_VMX_DEFAULT_PLE_WINDOW	4096
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 17/20] KVM: nVMX: Add FRED VMCS fields to nested VMX context handling
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (15 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 16/20] KVM: nVMX: Add support for the secondary VM exit controls Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 18/20] KVM: nVMX: Add FRED-related VMCS field checks Xin Li (Intel)
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Extend nested VMX context management to include FRED-related VMCS fields.
This enables proper handling of FRED state during nested virtualization.

Because KVM always sets SECONDARY_VM_EXIT_SAVE_IA32_FRED, FRED MSRs are
always saved to vmcs02.  However an L1 VMM may choose to clear this bit,
i.e., not to save FRED MSRs to vmcs12.  This is not a problem when the L1
VMM sets SECONDARY_VM_EXIT_LOAD_IA32_FRED, as KVM then immediately loads
host FRED MSRs of vmcs12 to guest FRED MSRs of vmcs01.  However if the L1
VMM clears SECONDARY_VM_EXIT_LOAD_IA32_FRED, KVM should retain FRED MSRs
to run the L1 VMM.

To propagate guest FRED MSRs from vmcs02 to vmcs01, save them in
sync_vmcs02_to_vmcs12() regardless of whether
SECONDARY_VM_EXIT_SAVE_IA32_FRED is set in vmcs12.  Then, use the saved
values to set guest FRED MSRs in vmcs01 within load_vmcs12_host_state()
when !nested_cpu_load_host_fred_state().

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v6:
* Handle FRED MSR pre-vmenter save/restore (Chao Gao).
* Save FRED MSRs of vmcs02 at VM-Exit even an L1 VMM clears
  SECONDARY_VM_EXIT_SAVE_IA32_FRED.
* Save FRED MSRs in sync_vmcs02_to_vmcs12() instead of its rare version.

Change in v5:
* Add TB from Xuelian Guo.

Changes in v4:
* Advertise VMX nested exception as if the CPU supports it (Chao Gao).
* Split FRED state management controls (Chao Gao).

Changes in v3:
* Add and use nested_cpu_has_fred(vmcs12) because vmcs02 should be set
  from vmcs12 if and only if the field is enabled in L1's VMX config
  (Sean Christopherson).
* Fix coding style issues (Sean Christopherson).

Changes in v2:
* Remove hyperv TLFS related changes (Jeremi Piotrowski).
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() (Chao Gao).
---
 Documentation/virt/kvm/x86/nested-vmx.rst |  18 ++++
 arch/x86/kvm/vmx/capabilities.h           |   5 +
 arch/x86/kvm/vmx/nested.c                 | 113 +++++++++++++++++++++-
 arch/x86/kvm/vmx/nested.h                 |  22 +++++
 arch/x86/kvm/vmx/vmcs12.c                 |  18 ++++
 arch/x86/kvm/vmx/vmcs12.h                 |  36 +++++++
 arch/x86/kvm/vmx/vmcs_shadow_fields.h     |   4 +
 arch/x86/kvm/vmx/vmx.h                    |  41 ++++++++
 8 files changed, 255 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
index e64ef231f310..87fa9f3877ab 100644
--- a/Documentation/virt/kvm/x86/nested-vmx.rst
+++ b/Documentation/virt/kvm/x86/nested-vmx.rst
@@ -218,6 +218,24 @@ struct shadow_vmcs is ever changed.
 		u16 host_gs_selector;
 		u16 host_tr_selector;
 		u64 secondary_vm_exit_controls;
+		u64 guest_ia32_fred_config;
+		u64 guest_ia32_fred_rsp1;
+		u64 guest_ia32_fred_rsp2;
+		u64 guest_ia32_fred_rsp3;
+		u64 guest_ia32_fred_stklvls;
+		u64 guest_ia32_fred_ssp1;
+		u64 guest_ia32_fred_ssp2;
+		u64 guest_ia32_fred_ssp3;
+		u64 host_ia32_fred_config;
+		u64 host_ia32_fred_rsp1;
+		u64 host_ia32_fred_rsp2;
+		u64 host_ia32_fred_rsp3;
+		u64 host_ia32_fred_stklvls;
+		u64 host_ia32_fred_ssp1;
+		u64 host_ia32_fred_ssp2;
+		u64 host_ia32_fred_ssp3;
+		u64 injected_event_data;
+		u64 original_event_data;
 	};
 
 
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index c9f00b6594d9..86e6d4b14011 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -83,6 +83,11 @@ static inline bool cpu_has_vmx_basic_no_hw_errcode(void)
 	return	vmcs_config.basic & VMX_BASIC_NO_HW_ERROR_CODE_CC;
 }
 
+static inline bool cpu_has_vmx_nested_exception(void)
+{
+	return vmcs_config.basic & VMX_BASIC_NESTED_EXCEPTION;
+}
+
 static inline bool cpu_has_virtual_nmis(void)
 {
 	return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index e4de8372b9f9..0cb9a2e43ad2 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -705,6 +705,9 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_FRED_RSP0, MSR_TYPE_RW);
 #endif
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_IA32_SPEC_CTRL, MSR_TYPE_RW);
@@ -1291,9 +1294,11 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
 	const u64 feature_bits = VMX_BASIC_DUAL_MONITOR_TREATMENT |
 				 VMX_BASIC_INOUT |
 				 VMX_BASIC_TRUE_CTLS |
-				 VMX_BASIC_NO_HW_ERROR_CODE_CC;
+				 VMX_BASIC_NO_HW_ERROR_CODE_CC |
+				 VMX_BASIC_NESTED_EXCEPTION;
 
-	const u64 reserved_bits = GENMASK_ULL(63, 57) |
+	const u64 reserved_bits = GENMASK_ULL(63, 59) |
+				  BIT_ULL(57) |
 				  GENMASK_ULL(47, 45) |
 				  BIT_ULL(31);
 
@@ -2545,6 +2550,8 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
 			     vmcs12->vm_entry_instruction_len);
 		vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
 			     vmcs12->guest_interruptibility_info);
+		if (cpu_has_vmx_fred())
+			vmcs_write64(INJECTED_EVENT_DATA, vmcs12->injected_event_data);
 		vmx->loaded_vmcs->nmi_known_unmasked =
 			!(vmcs12->guest_interruptibility_info & GUEST_INTR_STATE_NMI);
 	} else {
@@ -2699,6 +2706,17 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 				     vmcs12->guest_ssp, vmcs12->guest_ssp_tbl);
 
 	set_cr4_guest_host_mask(vmx);
+
+	if (nested_cpu_load_guest_fred_state(vmcs12)) {
+		vmcs_write64(GUEST_IA32_FRED_CONFIG, vmcs12->guest_ia32_fred_config);
+		vmcs_write64(GUEST_IA32_FRED_RSP1, vmcs12->guest_ia32_fred_rsp1);
+		vmcs_write64(GUEST_IA32_FRED_RSP2, vmcs12->guest_ia32_fred_rsp2);
+		vmcs_write64(GUEST_IA32_FRED_RSP3, vmcs12->guest_ia32_fred_rsp3);
+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, vmcs12->guest_ia32_fred_stklvls);
+		vmcs_write64(GUEST_IA32_FRED_SSP1, vmcs12->guest_ia32_fred_ssp1);
+		vmcs_write64(GUEST_IA32_FRED_SSP2, vmcs12->guest_ia32_fred_ssp2);
+		vmcs_write64(GUEST_IA32_FRED_SSP3, vmcs12->guest_ia32_fred_ssp3);
+	}
 }
 
 /*
@@ -2765,6 +2783,18 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 		vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
 	}
 
+	if (!vmx->nested.nested_run_pending ||
+	    !nested_cpu_load_guest_fred_state(vmcs12)) {
+		vmcs_write64(GUEST_IA32_FRED_CONFIG, vmx->nested.pre_vmenter_fred_config);
+		vmcs_write64(GUEST_IA32_FRED_RSP1, vmx->nested.pre_vmenter_fred_rsp1);
+		vmcs_write64(GUEST_IA32_FRED_RSP2, vmx->nested.pre_vmenter_fred_rsp2);
+		vmcs_write64(GUEST_IA32_FRED_RSP3, vmx->nested.pre_vmenter_fred_rsp3);
+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, vmx->nested.pre_vmenter_fred_stklvls);
+		vmcs_write64(GUEST_IA32_FRED_SSP1, vmx->nested.pre_vmenter_fred_ssp1);
+		vmcs_write64(GUEST_IA32_FRED_SSP2, vmx->nested.pre_vmenter_fred_ssp2);
+		vmcs_write64(GUEST_IA32_FRED_SSP3, vmx->nested.pre_vmenter_fred_ssp3);
+	}
+
 	vcpu->arch.tsc_offset = kvm_calc_nested_tsc_offset(
 			vcpu->arch.l1_tsc_offset,
 			vmx_get_l2_tsc_offset(vcpu),
@@ -3679,6 +3709,18 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 				    &vmx->nested.pre_vmenter_ssp,
 				    &vmx->nested.pre_vmenter_ssp_tbl);
 
+	if (!vmx->nested.nested_run_pending ||
+	    !nested_cpu_load_guest_fred_state(vmcs12)) {
+		vmx->nested.pre_vmenter_fred_config = vmcs_read64(GUEST_IA32_FRED_CONFIG);
+		vmx->nested.pre_vmenter_fred_rsp1 = vmcs_read64(GUEST_IA32_FRED_RSP1);
+		vmx->nested.pre_vmenter_fred_rsp2 = vmcs_read64(GUEST_IA32_FRED_RSP2);
+		vmx->nested.pre_vmenter_fred_rsp3 = vmcs_read64(GUEST_IA32_FRED_RSP3);
+		vmx->nested.pre_vmenter_fred_stklvls = vmcs_read64(GUEST_IA32_FRED_STKLVLS);
+		vmx->nested.pre_vmenter_fred_ssp1 = vmcs_read64(GUEST_IA32_FRED_SSP1);
+		vmx->nested.pre_vmenter_fred_ssp2 = vmcs_read64(GUEST_IA32_FRED_SSP2);
+		vmx->nested.pre_vmenter_fred_ssp3 = vmcs_read64(GUEST_IA32_FRED_SSP3);
+	}
+
 	/*
 	 * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
 	 * nested early checks are disabled.  In the event of a "late" VM-Fail,
@@ -3986,6 +4028,8 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
 	u32 idt_vectoring;
 	unsigned int nr;
 
+	vmcs12->original_event_data = 0;
+
 	/*
 	 * Per the SDM, VM-Exits due to double and triple faults are never
 	 * considered to occur during event delivery, even if the double/triple
@@ -4024,6 +4068,13 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
 				vcpu->arch.exception.error_code;
 		}
 
+		if ((vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) &&
+		    (vmcs12->guest_cr4 & X86_CR4_FRED) &&
+		    (vcpu->arch.exception.nested))
+			idt_vectoring |= VECTORING_INFO_NESTED_EXCEPTION_MASK;
+
+		vmcs12->original_event_data = vcpu->arch.exception.event_data;
+
 		vmcs12->idt_vectoring_info_field = idt_vectoring;
 	} else if (vcpu->arch.nmi_injected) {
 		vmcs12->idt_vectoring_info_field =
@@ -4766,6 +4817,26 @@ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	vmcs_read_cet_state(&vmx->vcpu, &vmcs12->guest_s_cet,
 			    &vmcs12->guest_ssp,
 			    &vmcs12->guest_ssp_tbl);
+
+	vmx->nested.fred_msr_at_vmexit.fred_config = vmcs_read64(GUEST_IA32_FRED_CONFIG);
+	vmx->nested.fred_msr_at_vmexit.fred_rsp1 = vmcs_read64(GUEST_IA32_FRED_RSP1);
+	vmx->nested.fred_msr_at_vmexit.fred_rsp2 = vmcs_read64(GUEST_IA32_FRED_RSP2);
+	vmx->nested.fred_msr_at_vmexit.fred_rsp3 = vmcs_read64(GUEST_IA32_FRED_RSP3);
+	vmx->nested.fred_msr_at_vmexit.fred_stklvls = vmcs_read64(GUEST_IA32_FRED_STKLVLS);
+	vmx->nested.fred_msr_at_vmexit.fred_ssp1 = vmcs_read64(GUEST_IA32_FRED_SSP1);
+	vmx->nested.fred_msr_at_vmexit.fred_ssp2 = vmcs_read64(GUEST_IA32_FRED_SSP2);
+	vmx->nested.fred_msr_at_vmexit.fred_ssp3 = vmcs_read64(GUEST_IA32_FRED_SSP3);
+
+	if (nested_cpu_save_guest_fred_state(vmcs12)) {
+		vmcs12->guest_ia32_fred_config = vmx->nested.fred_msr_at_vmexit.fred_config;
+		vmcs12->guest_ia32_fred_rsp1 = vmx->nested.fred_msr_at_vmexit.fred_rsp1;
+		vmcs12->guest_ia32_fred_rsp2 = vmx->nested.fred_msr_at_vmexit.fred_rsp2;
+		vmcs12->guest_ia32_fred_rsp3 = vmx->nested.fred_msr_at_vmexit.fred_rsp3;
+		vmcs12->guest_ia32_fred_stklvls = vmx->nested.fred_msr_at_vmexit.fred_stklvls;
+		vmcs12->guest_ia32_fred_ssp1 = vmx->nested.fred_msr_at_vmexit.fred_ssp1;
+		vmcs12->guest_ia32_fred_ssp2 = vmx->nested.fred_msr_at_vmexit.fred_ssp2;
+		vmcs12->guest_ia32_fred_ssp3 = vmx->nested.fred_msr_at_vmexit.fred_ssp3;
+	}
 }
 
 /*
@@ -4810,6 +4881,21 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 
 		vmcs12->vm_exit_intr_info = exit_intr_info;
 		vmcs12->vm_exit_instruction_len = exit_insn_len;
+
+		/*
+		 * When there is a valid original event, the exiting event is a nested
+		 * event during delivery of the earlier original event.
+		 *
+		 * FRED event delivery reflects this relationship by setting the value
+		 * of the nested exception bit of VM-exit interruption information
+		 * (aka exiting-event identification) to that of the valid bit of the
+		 * IDT-vectoring information (aka original-event identification).
+		 */
+		if ((vmcs12->idt_vectoring_info_field & VECTORING_INFO_VALID_MASK) &&
+		    (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) &&
+		    (vmcs12->guest_cr4 & X86_CR4_FRED))
+			vmcs12->vm_exit_intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
+
 		vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
 
 		/*
@@ -4838,6 +4924,7 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 				   struct vmcs12 *vmcs12)
 {
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	enum vm_entry_failure_code ignored;
 	struct kvm_segment seg;
 
@@ -4912,6 +4999,26 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 		WARN_ON_ONCE(__kvm_emulate_msr_write(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
 						     vmcs12->host_ia32_perf_global_ctrl));
 
+	if (nested_cpu_load_host_fred_state(vmcs12)) {
+		vmcs_write64(GUEST_IA32_FRED_CONFIG, vmcs12->host_ia32_fred_config);
+		vmcs_write64(GUEST_IA32_FRED_RSP1, vmcs12->host_ia32_fred_rsp1);
+		vmcs_write64(GUEST_IA32_FRED_RSP2, vmcs12->host_ia32_fred_rsp2);
+		vmcs_write64(GUEST_IA32_FRED_RSP3, vmcs12->host_ia32_fred_rsp3);
+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, vmcs12->host_ia32_fred_stklvls);
+		vmcs_write64(GUEST_IA32_FRED_SSP1, vmcs12->host_ia32_fred_ssp1);
+		vmcs_write64(GUEST_IA32_FRED_SSP2, vmcs12->host_ia32_fred_ssp2);
+		vmcs_write64(GUEST_IA32_FRED_SSP3, vmcs12->host_ia32_fred_ssp3);
+	} else {
+		vmcs_write64(GUEST_IA32_FRED_CONFIG, vmx->nested.fred_msr_at_vmexit.fred_config);
+		vmcs_write64(GUEST_IA32_FRED_RSP1, vmx->nested.fred_msr_at_vmexit.fred_rsp1);
+		vmcs_write64(GUEST_IA32_FRED_RSP2, vmx->nested.fred_msr_at_vmexit.fred_rsp2);
+		vmcs_write64(GUEST_IA32_FRED_RSP3, vmx->nested.fred_msr_at_vmexit.fred_rsp3);
+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, vmx->nested.fred_msr_at_vmexit.fred_stklvls);
+		vmcs_write64(GUEST_IA32_FRED_SSP1, vmx->nested.fred_msr_at_vmexit.fred_ssp1);
+		vmcs_write64(GUEST_IA32_FRED_SSP2, vmx->nested.fred_msr_at_vmexit.fred_ssp2);
+		vmcs_write64(GUEST_IA32_FRED_SSP3, vmx->nested.fred_msr_at_vmexit.fred_ssp3);
+	}
+
 	/* Set L1 segment info according to Intel SDM
 	    27.5.2 Loading Host Segment and Descriptor-Table Registers */
 	seg = (struct kvm_segment) {
@@ -7379,6 +7486,8 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
 		msrs->basic |= VMX_BASIC_INOUT;
 	if (cpu_has_vmx_basic_no_hw_errcode())
 		msrs->basic |= VMX_BASIC_NO_HW_ERROR_CODE_CC;
+	if (cpu_has_vmx_nested_exception())
+		msrs->basic |= VMX_BASIC_NESTED_EXCEPTION;
 }
 
 static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index 983484d42ebf..a99d3d83d58e 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -249,6 +249,11 @@ static inline bool nested_cpu_has_save_preemption_timer(struct vmcs12 *vmcs12)
 	    VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
 }
 
+static inline bool nested_cpu_has_secondary_vm_exit_controls(struct vmcs12 *vmcs12)
+{
+	return vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS;
+}
+
 static inline bool nested_exit_on_nmi(struct kvm_vcpu *vcpu)
 {
 	return nested_cpu_has_nmi_exiting(get_vmcs12(vcpu));
@@ -269,6 +274,23 @@ static inline bool nested_cpu_has_encls_exit(struct vmcs12 *vmcs12)
 	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENCLS_EXITING);
 }
 
+static inline bool nested_cpu_load_guest_fred_state(struct vmcs12 *vmcs12)
+{
+	return vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED;
+}
+
+static inline bool nested_cpu_save_guest_fred_state(struct vmcs12 *vmcs12)
+{
+	return nested_cpu_has_secondary_vm_exit_controls(vmcs12) &&
+	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_SAVE_IA32_FRED;
+}
+
+static inline bool nested_cpu_load_host_fred_state(struct vmcs12 *vmcs12)
+{
+	return nested_cpu_has_secondary_vm_exit_controls(vmcs12) &&
+	       vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED;
+}
+
 /*
  * if fixed0[i] == 1: val[i] must be 1
  * if fixed1[i] == 0: val[i] must be 0
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 3b01175f392a..9691e709061f 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -67,6 +67,24 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD64(HOST_IA32_EFER, host_ia32_efer),
 	FIELD64(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl),
 	FIELD64(SECONDARY_VM_EXIT_CONTROLS, secondary_vm_exit_controls),
+	FIELD64(INJECTED_EVENT_DATA, injected_event_data),
+	FIELD64(ORIGINAL_EVENT_DATA, original_event_data),
+	FIELD64(GUEST_IA32_FRED_CONFIG, guest_ia32_fred_config),
+	FIELD64(GUEST_IA32_FRED_RSP1, guest_ia32_fred_rsp1),
+	FIELD64(GUEST_IA32_FRED_RSP2, guest_ia32_fred_rsp2),
+	FIELD64(GUEST_IA32_FRED_RSP3, guest_ia32_fred_rsp3),
+	FIELD64(GUEST_IA32_FRED_STKLVLS, guest_ia32_fred_stklvls),
+	FIELD64(GUEST_IA32_FRED_SSP1, guest_ia32_fred_ssp1),
+	FIELD64(GUEST_IA32_FRED_SSP2, guest_ia32_fred_ssp2),
+	FIELD64(GUEST_IA32_FRED_SSP3, guest_ia32_fred_ssp3),
+	FIELD64(HOST_IA32_FRED_CONFIG, host_ia32_fred_config),
+	FIELD64(HOST_IA32_FRED_RSP1, host_ia32_fred_rsp1),
+	FIELD64(HOST_IA32_FRED_RSP2, host_ia32_fred_rsp2),
+	FIELD64(HOST_IA32_FRED_RSP3, host_ia32_fred_rsp3),
+	FIELD64(HOST_IA32_FRED_STKLVLS, host_ia32_fred_stklvls),
+	FIELD64(HOST_IA32_FRED_SSP1, host_ia32_fred_ssp1),
+	FIELD64(HOST_IA32_FRED_SSP2, host_ia32_fred_ssp2),
+	FIELD64(HOST_IA32_FRED_SSP3, host_ia32_fred_ssp3),
 	FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
 	FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
 	FIELD(EXCEPTION_BITMAP, exception_bitmap),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 7866fdce7a23..a3853536a575 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -192,6 +192,24 @@ struct __packed vmcs12 {
 	u16 host_tr_selector;
 	u16 guest_pml_index;
 	u64 secondary_vm_exit_controls;
+	u64 guest_ia32_fred_config;
+	u64 guest_ia32_fred_rsp1;
+	u64 guest_ia32_fred_rsp2;
+	u64 guest_ia32_fred_rsp3;
+	u64 guest_ia32_fred_stklvls;
+	u64 guest_ia32_fred_ssp1;
+	u64 guest_ia32_fred_ssp2;
+	u64 guest_ia32_fred_ssp3;
+	u64 host_ia32_fred_config;
+	u64 host_ia32_fred_rsp1;
+	u64 host_ia32_fred_rsp2;
+	u64 host_ia32_fred_rsp3;
+	u64 host_ia32_fred_stklvls;
+	u64 host_ia32_fred_ssp1;
+	u64 host_ia32_fred_ssp2;
+	u64 host_ia32_fred_ssp3;
+	u64 injected_event_data;
+	u64 original_event_data;
 };
 
 /*
@@ -374,6 +392,24 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(host_tr_selector, 994);
 	CHECK_OFFSET(guest_pml_index, 996);
 	CHECK_OFFSET(secondary_vm_exit_controls, 998);
+	CHECK_OFFSET(guest_ia32_fred_config, 1006);
+	CHECK_OFFSET(guest_ia32_fred_rsp1, 1014);
+	CHECK_OFFSET(guest_ia32_fred_rsp2, 1022);
+	CHECK_OFFSET(guest_ia32_fred_rsp3, 1030);
+	CHECK_OFFSET(guest_ia32_fred_stklvls, 1038);
+	CHECK_OFFSET(guest_ia32_fred_ssp1, 1046);
+	CHECK_OFFSET(guest_ia32_fred_ssp2, 1054);
+	CHECK_OFFSET(guest_ia32_fred_ssp3, 1062);
+	CHECK_OFFSET(host_ia32_fred_config, 1070);
+	CHECK_OFFSET(host_ia32_fred_rsp1, 1078);
+	CHECK_OFFSET(host_ia32_fred_rsp2, 1086);
+	CHECK_OFFSET(host_ia32_fred_rsp3, 1094);
+	CHECK_OFFSET(host_ia32_fred_stklvls, 1102);
+	CHECK_OFFSET(host_ia32_fred_ssp1, 1110);
+	CHECK_OFFSET(host_ia32_fred_ssp2, 1118);
+	CHECK_OFFSET(host_ia32_fred_ssp3, 1126);
+	CHECK_OFFSET(injected_event_data, 1134);
+	CHECK_OFFSET(original_event_data, 1142);
 }
 
 extern const unsigned short vmcs12_field_offsets[];
diff --git a/arch/x86/kvm/vmx/vmcs_shadow_fields.h b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
index cad128d1657b..da338327c2b3 100644
--- a/arch/x86/kvm/vmx/vmcs_shadow_fields.h
+++ b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
@@ -74,6 +74,10 @@ SHADOW_FIELD_RW(HOST_GS_BASE, host_gs_base)
 /* 64-bit */
 SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS, guest_physical_address)
 SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH, guest_physical_address)
+SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA, original_event_data)
+SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA_HIGH, original_event_data)
+SHADOW_FIELD_RW(INJECTED_EVENT_DATA, injected_event_data)
+SHADOW_FIELD_RW(INJECTED_EVENT_DATA_HIGH, injected_event_data)
 
 #undef SHADOW_FIELD_RO
 #undef SHADOW_FIELD_RW
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 733fa2ef4bea..825e68acd5e9 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -67,6 +67,37 @@ struct pt_desc {
 	struct pt_ctx guest;
 };
 
+/*
+ * Used to snapshot FRED MSRs that may NOT be saved to vmcs12 as specified
+ * in the VM-Exit controls of vmcs12 configured by L1 VMM.
+ *
+ * FRED MSRs are *always* saved into vmcs02 because KVM always sets
+ * SECONDARY_VM_EXIT_SAVE_IA32_FRED.  However an L1 VMM may choose to clear
+ * this bit, resulting in FRED MSRs not being propagated to vmcs12 from
+ * vmcs02.  When the L1 VMM sets SECONDARY_VM_EXIT_LOAD_IA32_FRED, this is
+ * not a problem, since KVM then immediately loads the host FRED MSRs of
+ * vmcs12 to the guest FRED MSRs of vmcs01.
+ *
+ * But if the L1 VMM clears SECONDARY_VM_EXIT_LOAD_IA32_FRED, KVM should
+ * retain the FRED MSRs, i.e., propagate the guest FRED MSRs of vmcs02 to
+ * the guest FRED MSRs of vmcs01.
+ *
+ * This structure stores guest FRED MSRs that an L1 VMM opts not to save
+ * during VM-Exits from L2 to L1.  These MSRs may still be retained for
+ * running the L1 VMM if SECONDARY_VM_EXIT_LOAD_IA32_FRED is cleared in
+ * vmcs12.
+ */
+struct fred_msr_at_vmexit {
+	u64 fred_config;
+	u64 fred_rsp1;
+	u64 fred_rsp2;
+	u64 fred_rsp3;
+	u64 fred_stklvls;
+	u64 fred_ssp1;
+	u64 fred_ssp2;
+	u64 fred_ssp3;
+};
+
 /*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
@@ -184,6 +215,16 @@ struct nested_vmx {
 	u64 pre_vmenter_s_cet;
 	u64 pre_vmenter_ssp;
 	u64 pre_vmenter_ssp_tbl;
+	u64 pre_vmenter_fred_config;
+	u64 pre_vmenter_fred_rsp1;
+	u64 pre_vmenter_fred_rsp2;
+	u64 pre_vmenter_fred_rsp3;
+	u64 pre_vmenter_fred_stklvls;
+	u64 pre_vmenter_fred_ssp1;
+	u64 pre_vmenter_fred_ssp2;
+	u64 pre_vmenter_fred_ssp3;
+
+	struct fred_msr_at_vmexit fred_msr_at_vmexit;
 
 	/* to migrate it to L1 if L2 writes to L1's CR8 directly */
 	int l1_tpr_threshold;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 18/20] KVM: nVMX: Add FRED-related VMCS field checks
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (16 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 17/20] KVM: nVMX: Add FRED VMCS fields to nested VMX context handling Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 19/20] KVM: nVMX: Add prerequisites to SHADOW_FIELD_R[OW] macros Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 20/20] KVM: nVMX: Allow VMX FRED controls Xin Li (Intel)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

As with real hardware, nested VMX validates various VMCS fields, including
control and guest/host state fields.  This patch adds checks for FRED-related
VMCS fields to support nested FRED functionality.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Add TB from Xuelian Guo.
---
 arch/x86/kvm/vmx/nested.c | 117 +++++++++++++++++++++++++++++++++-----
 1 file changed, 104 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 0cb9a2e43ad2..b56bbac36749 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3031,6 +3031,8 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 					  struct vmcs12 *vmcs12)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	bool fred_enabled = (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) &&
+			    (vmcs12->guest_cr4 & X86_CR4_FRED);
 
 	if (CC(!vmx_control_verify(vmcs12->vm_entry_controls,
 				    vmx->nested.msrs.entry_ctls_low,
@@ -3048,22 +3050,11 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 		u8 vector = intr_info & INTR_INFO_VECTOR_MASK;
 		u32 intr_type = intr_info & INTR_INFO_INTR_TYPE_MASK;
 		bool has_error_code = intr_info & INTR_INFO_DELIVER_CODE_MASK;
+		bool has_nested_exception = vmx->nested.msrs.basic & VMX_BASIC_NESTED_EXCEPTION;
 		bool urg = nested_cpu_has2(vmcs12,
 					   SECONDARY_EXEC_UNRESTRICTED_GUEST);
 		bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
 
-		/* VM-entry interruption-info field: interruption type */
-		if (CC(intr_type == INTR_TYPE_RESERVED) ||
-		    CC(intr_type == INTR_TYPE_OTHER_EVENT &&
-		       !nested_cpu_supports_monitor_trap_flag(vcpu)))
-			return -EINVAL;
-
-		/* VM-entry interruption-info field: vector */
-		if (CC(intr_type == INTR_TYPE_NMI_INTR && vector != NMI_VECTOR) ||
-		    CC(intr_type == INTR_TYPE_HARD_EXCEPTION && vector > 31) ||
-		    CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
-			return -EINVAL;
-
 		/*
 		 * Cannot deliver error code in real mode or if the interrupt
 		 * type is not hardware exception. For other cases, do the
@@ -3088,8 +3079,28 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 		if (CC(intr_info & INTR_INFO_RESVD_BITS_MASK))
 			return -EINVAL;
 
-		/* VM-entry instruction length */
+		/*
+		 * When the CPU enumerates VMX nested-exception support, bit 13
+		 * (set to indicate a nested exception) of the intr info field
+		 * may have value 1.  Otherwise bit 13 is reserved.
+		 */
+		if (CC(!(has_nested_exception && intr_type == INTR_TYPE_HARD_EXCEPTION) &&
+		       intr_info & INTR_INFO_NESTED_EXCEPTION_MASK))
+			return -EINVAL;
+
 		switch (intr_type) {
+		case INTR_TYPE_EXT_INTR:
+			break;
+		case INTR_TYPE_RESERVED:
+			return -EINVAL;
+		case INTR_TYPE_NMI_INTR:
+			if (CC(vector != NMI_VECTOR))
+				return -EINVAL;
+			break;
+		case INTR_TYPE_HARD_EXCEPTION:
+			if (CC(vector > 31))
+				return -EINVAL;
+			break;
 		case INTR_TYPE_SOFT_EXCEPTION:
 		case INTR_TYPE_SOFT_INTR:
 		case INTR_TYPE_PRIV_SW_EXCEPTION:
@@ -3097,6 +3108,24 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 			    CC(vmcs12->vm_entry_instruction_len == 0 &&
 			    CC(!nested_cpu_has_zero_length_injection(vcpu))))
 				return -EINVAL;
+			break;
+		case INTR_TYPE_OTHER_EVENT:
+			switch (vector) {
+			case 0:
+				if (CC(!nested_cpu_supports_monitor_trap_flag(vcpu)))
+					return -EINVAL;
+				break;
+			case 1:
+			case 2:
+				if (CC(!fred_enabled))
+					return -EINVAL;
+				if (CC(vmcs12->vm_entry_instruction_len > X86_MAX_INSTRUCTION_LENGTH))
+					return -EINVAL;
+				break;
+			default:
+				return -EINVAL;
+			}
+			break;
 		}
 	}
 
@@ -3184,9 +3213,29 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
 	if (ia32e) {
 		if (CC(!(vmcs12->host_cr4 & X86_CR4_PAE)))
 			return -EINVAL;
+		if (vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS &&
+		    vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED) {
+			if (CC(vmcs12->host_ia32_fred_config &
+			       (BIT_ULL(11) | GENMASK_ULL(5, 4) | BIT_ULL(2))) ||
+			    CC(vmcs12->host_ia32_fred_rsp1 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->host_ia32_fred_rsp2 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->host_ia32_fred_rsp3 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->host_ia32_fred_ssp1 & GENMASK_ULL(2, 0)) ||
+			    CC(vmcs12->host_ia32_fred_ssp2 & GENMASK_ULL(2, 0)) ||
+			    CC(vmcs12->host_ia32_fred_ssp3 & GENMASK_ULL(2, 0)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_config & PAGE_MASK, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_rsp1, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_rsp2, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_rsp3, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_ssp1, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_ssp2, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_ssp3, vcpu)))
+				return -EINVAL;
+		}
 	} else {
 		if (CC(vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) ||
 		    CC(vmcs12->host_cr4 & X86_CR4_PCIDE) ||
+		    CC(vmcs12->host_cr4 & X86_CR4_FRED) ||
 		    CC((vmcs12->host_rip) >> 32))
 			return -EINVAL;
 	}
@@ -3354,6 +3403,48 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
 	     CC((vmcs12->guest_bndcfgs & MSR_IA32_BNDCFGS_RSVD))))
 		return -EINVAL;
 
+	if (ia32e) {
+		if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED) {
+			if (CC(vmcs12->guest_ia32_fred_config &
+			       (BIT_ULL(11) | GENMASK_ULL(5, 4) | BIT_ULL(2))) ||
+			    CC(vmcs12->guest_ia32_fred_rsp1 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->guest_ia32_fred_rsp2 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->guest_ia32_fred_rsp3 & GENMASK_ULL(5, 0)) ||
+			    CC(vmcs12->guest_ia32_fred_ssp1 & GENMASK_ULL(2, 0)) ||
+			    CC(vmcs12->guest_ia32_fred_ssp2 & GENMASK_ULL(2, 0)) ||
+			    CC(vmcs12->guest_ia32_fred_ssp3 & GENMASK_ULL(2, 0)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_config & PAGE_MASK, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_rsp1, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_rsp2, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_rsp3, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_ssp1, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_ssp2, vcpu)) ||
+			    CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_ssp3, vcpu)))
+				return -EINVAL;
+		}
+		if (vmcs12->guest_cr4 & X86_CR4_FRED) {
+			unsigned int ss_dpl = VMX_AR_DPL(vmcs12->guest_ss_ar_bytes);
+			switch (ss_dpl) {
+			case 0:
+				if (CC(!(vmcs12->guest_cs_ar_bytes & VMX_AR_L_MASK)))
+					return -EINVAL;
+				break;
+			case 1:
+			case 2:
+				return -EINVAL;
+			case 3:
+				if (CC(vmcs12->guest_rflags & X86_EFLAGS_IOPL))
+					return -EINVAL;
+				if (CC(vmcs12->guest_interruptibility_info & GUEST_INTR_STATE_STI))
+					return -EINVAL;
+				break;
+			}
+		}
+	} else {
+		if (CC(vmcs12->guest_cr4 & X86_CR4_FRED))
+			return -EINVAL;
+	}
+
 	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE) {
 		if (CC(!is_valid_cet_state(vcpu, vmcs12->guest_s_cet, vmcs12->guest_ssp,
 					   vmcs12->guest_ssp_tbl)))
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 19/20] KVM: nVMX: Add prerequisites to SHADOW_FIELD_R[OW] macros
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (17 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 18/20] KVM: nVMX: Add FRED-related VMCS field checks Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  2025-08-21 22:36 ` [PATCH v6 20/20] KVM: nVMX: Allow VMX FRED controls Xin Li (Intel)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Add VMX feature checks before accessing VMCS fields via SHADOW_FIELD_R[OW]
macros, as some fields may not be supported on all CPUs.

Functions like copy_shadow_to_vmcs12() and copy_vmcs12_to_shadow() access
VMCS fields that may not exist on certain hardware, such as
INJECTED_EVENT_DATA.  To avoid VMREAD/VMWRITE warnings, skip syncing fields
tied to unsupported VMX features.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Add TB from Xuelian Guo.

Change since v2:
* Add __SHADOW_FIELD_R[OW] for better readability or maintability (Sean).
---
 arch/x86/kvm/vmx/nested.c             | 79 +++++++++++++++++++--------
 arch/x86/kvm/vmx/vmcs_shadow_fields.h | 41 +++++++++-----
 2 files changed, 83 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b56bbac36749..266115525b9e 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -55,14 +55,14 @@ struct shadow_vmcs_field {
 	u16	offset;
 };
 static struct shadow_vmcs_field shadow_read_only_fields[] = {
-#define SHADOW_FIELD_RO(x, y) { x, offsetof(struct vmcs12, y) },
+#define __SHADOW_FIELD_RO(x, y, c) { x, offsetof(struct vmcs12, y) },
 #include "vmcs_shadow_fields.h"
 };
 static int max_shadow_read_only_fields =
 	ARRAY_SIZE(shadow_read_only_fields);
 
 static struct shadow_vmcs_field shadow_read_write_fields[] = {
-#define SHADOW_FIELD_RW(x, y) { x, offsetof(struct vmcs12, y) },
+#define __SHADOW_FIELD_RW(x, y, c) { x, offsetof(struct vmcs12, y) },
 #include "vmcs_shadow_fields.h"
 };
 static int max_shadow_read_write_fields =
@@ -85,6 +85,17 @@ static void init_vmcs_shadow_fields(void)
 			pr_err("Missing field from shadow_read_only_field %x\n",
 			       field + 1);
 
+		switch (field) {
+#define __SHADOW_FIELD_RO(x, y, c)		\
+		case x:				\
+			if (!(c))		\
+				continue;	\
+			break;
+#include "vmcs_shadow_fields.h"
+		default:
+			break;
+		}
+
 		clear_bit(field, vmx_vmread_bitmap);
 		if (field & 1)
 #ifdef CONFIG_X86_64
@@ -110,24 +121,13 @@ static void init_vmcs_shadow_fields(void)
 			  field <= GUEST_TR_AR_BYTES,
 			  "Update vmcs12_write_any() to drop reserved bits from AR_BYTES");
 
-		/*
-		 * PML and the preemption timer can be emulated, but the
-		 * processor cannot vmwrite to fields that don't exist
-		 * on bare metal.
-		 */
 		switch (field) {
-		case GUEST_PML_INDEX:
-			if (!cpu_has_vmx_pml())
-				continue;
-			break;
-		case VMX_PREEMPTION_TIMER_VALUE:
-			if (!cpu_has_vmx_preemption_timer())
-				continue;
-			break;
-		case GUEST_INTR_STATUS:
-			if (!cpu_has_vmx_apicv())
-				continue;
+#define __SHADOW_FIELD_RW(x, y, c)		\
+		case x:				\
+			if (!(c))		\
+				continue;	\
 			break;
+#include "vmcs_shadow_fields.h"
 		default:
 			break;
 		}
@@ -1633,8 +1633,8 @@ int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata)
 /*
  * Copy the writable VMCS shadow fields back to the VMCS12, in case they have
  * been modified by the L1 guest.  Note, "writable" in this context means
- * "writable by the guest", i.e. tagged SHADOW_FIELD_RW; the set of
- * fields tagged SHADOW_FIELD_RO may or may not align with the "read-only"
+ * "writable by the guest", i.e. tagged __SHADOW_FIELD_RW; the set of
+ * fields tagged __SHADOW_FIELD_RO may or may not align with the "read-only"
  * VM-exit information fields (which are actually writable if the vCPU is
  * configured to support "VMWRITE to any supported field in the VMCS").
  */
@@ -1655,6 +1655,18 @@ static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx)
 
 	for (i = 0; i < max_shadow_read_write_fields; i++) {
 		field = shadow_read_write_fields[i];
+
+		switch (field.encoding) {
+#define __SHADOW_FIELD_RW(x, y, c)		\
+		case x:				\
+			if (!(c))		\
+				continue;	\
+			break;
+#include "vmcs_shadow_fields.h"
+		default:
+			break;
+		}
+
 		val = __vmcs_readl(field.encoding);
 		vmcs12_write_any(vmcs12, field.encoding, field.offset, val);
 	}
@@ -1689,6 +1701,23 @@ static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx)
 	for (q = 0; q < ARRAY_SIZE(fields); q++) {
 		for (i = 0; i < max_fields[q]; i++) {
 			field = fields[q][i];
+
+			switch (field.encoding) {
+#define __SHADOW_FIELD_RO(x, y, c)			\
+			case x:				\
+				if (!(c))		\
+					continue;	\
+				break;
+#define __SHADOW_FIELD_RW(x, y, c)			\
+			case x:				\
+				if (!(c))		\
+					continue;	\
+				break;
+#include "vmcs_shadow_fields.h"
+			default:
+				break;
+			}
+
 			val = vmcs12_read_any(vmcs12, field.encoding,
 					      field.offset);
 			__vmcs_writel(field.encoding, val);
@@ -5997,9 +6026,10 @@ static int handle_vmread(struct kvm_vcpu *vcpu)
 static bool is_shadow_field_rw(unsigned long field)
 {
 	switch (field) {
-#define SHADOW_FIELD_RW(x, y) case x:
+#define __SHADOW_FIELD_RW(x, y, c)	\
+	case x:				\
+		return c;
 #include "vmcs_shadow_fields.h"
-		return true;
 	default:
 		break;
 	}
@@ -6009,9 +6039,10 @@ static bool is_shadow_field_rw(unsigned long field)
 static bool is_shadow_field_ro(unsigned long field)
 {
 	switch (field) {
-#define SHADOW_FIELD_RO(x, y) case x:
+#define __SHADOW_FIELD_RO(x, y, c)	\
+	case x:				\
+		return c;
 #include "vmcs_shadow_fields.h"
-		return true;
 	default:
 		break;
 	}
diff --git a/arch/x86/kvm/vmx/vmcs_shadow_fields.h b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
index da338327c2b3..607945ada35f 100644
--- a/arch/x86/kvm/vmx/vmcs_shadow_fields.h
+++ b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
@@ -1,14 +1,17 @@
-#if !defined(SHADOW_FIELD_RO) && !defined(SHADOW_FIELD_RW)
+#if !defined(__SHADOW_FIELD_RO) && !defined(__SHADOW_FIELD_RW)
 BUILD_BUG_ON(1)
 #endif
 
-#ifndef SHADOW_FIELD_RO
-#define SHADOW_FIELD_RO(x, y)
+#ifndef __SHADOW_FIELD_RO
+#define __SHADOW_FIELD_RO(x, y, c)
 #endif
-#ifndef SHADOW_FIELD_RW
-#define SHADOW_FIELD_RW(x, y)
+#ifndef __SHADOW_FIELD_RW
+#define __SHADOW_FIELD_RW(x, y, c)
 #endif
 
+#define SHADOW_FIELD_RO(x, y) __SHADOW_FIELD_RO(x, y, true)
+#define SHADOW_FIELD_RW(x, y) __SHADOW_FIELD_RW(x, y, true)
+
 /*
  * We do NOT shadow fields that are modified when L0
  * traps and emulates any vmx instruction (e.g. VMPTRLD,
@@ -32,8 +35,12 @@ BUILD_BUG_ON(1)
  */
 
 /* 16-bits */
-SHADOW_FIELD_RW(GUEST_INTR_STATUS, guest_intr_status)
-SHADOW_FIELD_RW(GUEST_PML_INDEX, guest_pml_index)
+__SHADOW_FIELD_RW(GUEST_INTR_STATUS, guest_intr_status, cpu_has_vmx_apicv())
+/*
+ * PML can be emulated, but the processor cannot vmwrite to the VMCS field
+ * GUEST_PML_INDEX that doesn't exist on bare metal.
+ */
+__SHADOW_FIELD_RW(GUEST_PML_INDEX, guest_pml_index, cpu_has_vmx_pml())
 SHADOW_FIELD_RW(HOST_FS_SELECTOR, host_fs_selector)
 SHADOW_FIELD_RW(HOST_GS_SELECTOR, host_gs_selector)
 
@@ -41,9 +48,9 @@ SHADOW_FIELD_RW(HOST_GS_SELECTOR, host_gs_selector)
 SHADOW_FIELD_RO(VM_EXIT_REASON, vm_exit_reason)
 SHADOW_FIELD_RO(VM_EXIT_INTR_INFO, vm_exit_intr_info)
 SHADOW_FIELD_RO(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len)
+SHADOW_FIELD_RO(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code)
 SHADOW_FIELD_RO(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field)
 SHADOW_FIELD_RO(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code)
-SHADOW_FIELD_RO(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code)
 SHADOW_FIELD_RO(GUEST_CS_AR_BYTES, guest_cs_ar_bytes)
 SHADOW_FIELD_RO(GUEST_SS_AR_BYTES, guest_ss_ar_bytes)
 SHADOW_FIELD_RW(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control)
@@ -54,7 +61,12 @@ SHADOW_FIELD_RW(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field)
 SHADOW_FIELD_RW(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len)
 SHADOW_FIELD_RW(TPR_THRESHOLD, tpr_threshold)
 SHADOW_FIELD_RW(GUEST_INTERRUPTIBILITY_INFO, guest_interruptibility_info)
-SHADOW_FIELD_RW(VMX_PREEMPTION_TIMER_VALUE, vmx_preemption_timer_value)
+/*
+ * The preemption timer can be emulated, but the processor cannot vmwrite to
+ * the VMCS field VMX_PREEMPTION_TIMER_VALUE that doesn't exist on bare metal.
+ */
+__SHADOW_FIELD_RW(VMX_PREEMPTION_TIMER_VALUE, vmx_preemption_timer_value,
+		  cpu_has_vmx_preemption_timer())
 
 /* Natural width */
 SHADOW_FIELD_RO(EXIT_QUALIFICATION, exit_qualification)
@@ -74,10 +86,13 @@ SHADOW_FIELD_RW(HOST_GS_BASE, host_gs_base)
 /* 64-bit */
 SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS, guest_physical_address)
 SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH, guest_physical_address)
-SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA, original_event_data)
-SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA_HIGH, original_event_data)
-SHADOW_FIELD_RW(INJECTED_EVENT_DATA, injected_event_data)
-SHADOW_FIELD_RW(INJECTED_EVENT_DATA_HIGH, injected_event_data)
+__SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA, original_event_data, cpu_has_vmx_fred())
+__SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA_HIGH, original_event_data, cpu_has_vmx_fred())
+__SHADOW_FIELD_RW(INJECTED_EVENT_DATA, injected_event_data, cpu_has_vmx_fred())
+__SHADOW_FIELD_RW(INJECTED_EVENT_DATA_HIGH, injected_event_data, cpu_has_vmx_fred())
 
 #undef SHADOW_FIELD_RO
 #undef SHADOW_FIELD_RW
+
+#undef __SHADOW_FIELD_RO
+#undef __SHADOW_FIELD_RW
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 20/20] KVM: nVMX: Allow VMX FRED controls
  2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
                   ` (18 preceding siblings ...)
  2025-08-21 22:36 ` [PATCH v6 19/20] KVM: nVMX: Add prerequisites to SHADOW_FIELD_R[OW] macros Xin Li (Intel)
@ 2025-08-21 22:36 ` Xin Li (Intel)
  19 siblings, 0 replies; 33+ messages in thread
From: Xin Li (Intel) @ 2025-08-21 22:36 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	xin, luto, peterz, andrew.cooper3, chao.gao, hch

From: Xin Li <xin3.li@intel.com>

Allow nVMX FRED controls as nested FRED support is in place.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---

Change in v5:
* Add TB from Xuelian Guo.
---
 arch/x86/kvm/vmx/nested.c | 5 +++--
 arch/x86/kvm/vmx/vmx.c    | 1 +
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 266115525b9e..0b266e95db60 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -7436,7 +7436,8 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
 		 * advertise any feature in it to nVMX until its nVMX support
 		 * is ready.
 		 */
-		msrs->secondary_exit_ctls &= 0;
+		msrs->secondary_exit_ctls &= SECONDARY_VM_EXIT_SAVE_IA32_FRED |
+					     SECONDARY_VM_EXIT_LOAD_IA32_FRED;
 	}
 }
 
@@ -7452,7 +7453,7 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
 		VM_ENTRY_IA32E_MODE |
 #endif
 		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
-		VM_ENTRY_LOAD_CET_STATE;
+		VM_ENTRY_LOAD_CET_STATE | VM_ENTRY_LOAD_IA32_FRED;
 	msrs->entry_ctls_high |=
 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
 		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ac76cb33f3de..99106750b1e3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7957,6 +7957,7 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
 
 	entry = kvm_find_cpuid_entry_index(vcpu, 0x7, 1);
 	cr4_fixed1_update(X86_CR4_LAM_SUP,    eax, feature_bit(LAM));
+	cr4_fixed1_update(X86_CR4_FRED,       eax, feature_bit(FRED));
 
 #undef cr4_fixed1_update
 }
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts
  2025-08-21 22:36 ` [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts Xin Li (Intel)
@ 2025-08-25  2:51   ` Xin Li
  2025-08-26 18:11     ` Sean Christopherson
  2025-08-26 18:50     ` Andrew Cooper
  0 siblings, 2 replies; 33+ messages in thread
From: Xin Li @ 2025-08-25  2:51 UTC (permalink / raw)
  To: linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, chao.gao, hch

On 8/21/2025 3:36 PM, Xin Li (Intel) wrote:
> +	/*
> +	 * MSR_IA32_FRED_RSP0 and MSR_IA32_PL0_SSP (aka MSR_IA32_FRED_SSP0) are
> +	 * designated for event delivery while executing in userspace.  Since
> +	 * KVM operates exclusively in kernel mode (the CPL is always 0 after
> +	 * any VM exit), KVM can safely retain and operate with the guest-defined
> +	 * values for MSR_IA32_FRED_RSP0 and MSR_IA32_PL0_SSP.
> +	 *
> +	 * Therefore, interception of MSR_IA32_FRED_RSP0 and MSR_IA32_PL0_SSP
> +	 * is not required.
> +	 *
> +	 * Note, save and restore of MSR_IA32_PL0_SSP belong to CET supervisor
> +	 * context management.  However the FRED SSP MSRs, including
> +	 * MSR_IA32_PL0_SSP, are supported by any processor that enumerates FRED.
> +	 * If such a processor does not support CET, FRED transitions will not
> +	 * use the MSRs, but the MSRs would still be accessible using MSR-access
> +	 * instructions (e.g., RDMSR, WRMSR).
> +	 */
> +	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP0, MSR_TYPE_RW, intercept);
> +	vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP, MSR_TYPE_RW, intercept);

Hi Sean,

I'd like to bring up an issue concerning MSR_IA32_PL0_SSP.

The FRED spec claims:

The FRED SSP MSRs are supported by any processor that enumerates
CPUID.(EAX=7,ECX=1):EAX.FRED[bit 17] as 1. If such a processor does not
support CET, FRED transitions will not use the MSRs (because shadow stacks
are not enabled), but the MSRs would still be accessible using MSR-access
instructions (e.g., RDMSR, WRMSR).

It means KVM needs to handle MSR_IA32_PL0_SSP even when FRED is supported
but CET is not.  And this can be broken down into two subtasks:

1) Allow such a guest to access MSR_IA32_PL0_SSP w/o triggering #GP.  And
this behavior is already implemented in patch 8 of this series.

2) Save and restore MSR_IA32_PL0_SSP in both KVM and Qemu for such a guest.

I have the patches for 2) but they are not included in this series, because

1) how much do we care the value in MSR_IA32_PL0_SSP in such a guest?

Yes, Chao told me that you are the one saying that MSRs can be used as
clobber registers and KVM should preserve the value.  Does MSR_IA32_PL0_SSP
in such a guest count?

2) Saving/restoring MSR_IA32_PL0_SSP adds complexity, though it's seldom
used.  Is it worth it?

BTW I'm still working on a KVM unit test for it, using a L1 VMM that
enumerates FRED but not CET.

Thanks!
     Xin

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts
  2025-08-25  2:51   ` Xin Li
@ 2025-08-26 18:11     ` Sean Christopherson
  2025-08-26 21:59       ` Xin Li
  2025-08-26 18:50     ` Andrew Cooper
  1 sibling, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2025-08-26 18:11 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, kvm, linux-doc, pbonzini, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, andrew.cooper3, chao.gao,
	hch

On Sun, Aug 24, 2025, Xin Li wrote:
> On 8/21/2025 3:36 PM, Xin Li (Intel) wrote:
> > +	/*
> > +	 * MSR_IA32_FRED_RSP0 and MSR_IA32_PL0_SSP (aka MSR_IA32_FRED_SSP0) are
> > +	 * designated for event delivery while executing in userspace.  Since
> > +	 * KVM operates exclusively in kernel mode (the CPL is always 0 after
> > +	 * any VM exit), KVM can safely retain and operate with the guest-defined
> > +	 * values for MSR_IA32_FRED_RSP0 and MSR_IA32_PL0_SSP.
> > +	 *
> > +	 * Therefore, interception of MSR_IA32_FRED_RSP0 and MSR_IA32_PL0_SSP
> > +	 * is not required.
> > +	 *
> > +	 * Note, save and restore of MSR_IA32_PL0_SSP belong to CET supervisor
> > +	 * context management.  However the FRED SSP MSRs, including
> > +	 * MSR_IA32_PL0_SSP, are supported by any processor that enumerates FRED.
> > +	 * If such a processor does not support CET, FRED transitions will not
> > +	 * use the MSRs, but the MSRs would still be accessible using MSR-access
> > +	 * instructions (e.g., RDMSR, WRMSR).
> > +	 */
> > +	vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP0, MSR_TYPE_RW, intercept);
> > +	vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP, MSR_TYPE_RW, intercept);
> 
> Hi Sean,
> 
> I'd like to bring up an issue concerning MSR_IA32_PL0_SSP.
> 
> The FRED spec claims:
> 
> The FRED SSP MSRs are supported by any processor that enumerates
> CPUID.(EAX=7,ECX=1):EAX.FRED[bit 17] as 1. If such a processor does not
> support CET, FRED transitions will not use the MSRs (because shadow stacks
> are not enabled), but the MSRs would still be accessible using MSR-access
> instructions (e.g., RDMSR, WRMSR).
> 
> It means KVM needs to handle MSR_IA32_PL0_SSP even when FRED is supported
> but CET is not.  And this can be broken down into two subtasks:
> 
> 1) Allow such a guest to access MSR_IA32_PL0_SSP w/o triggering #GP.  And
> this behavior is already implemented in patch 8 of this series.
> 
> 2) Save and restore MSR_IA32_PL0_SSP in both KVM and Qemu for such a guest.

What novel work needs to be done in KVM?  For QEMU, I assume it's just adding an
"or FRED" somewhere.  For KVM, I'm missing what additional work would be required
that wouldn't be naturally covered by patch 8 (assuming patch 8 is bug-free).

> I have the patches for 2) but they are not included in this series, because
> 
> 1) how much do we care the value in MSR_IA32_PL0_SSP in such a guest?
> 
> Yes, Chao told me that you are the one saying that MSRs can be used as
> clobber registers and KVM should preserve the value.  Does MSR_IA32_PL0_SSP
> in such a guest count?

If the architecture says that MSR_IA32_PL0_SSP exists and is accessible, then
KVM needs to honor that.

> 2) Saving/restoring MSR_IA32_PL0_SSP adds complexity, though it's seldom
> used.  Is it worth it?

Honoring the architecture is generally not optional.  There are extreme cases
where KVM violates that rule and takes (often undocumented) erratum, e.g. APIC
base relocation would require an absurd amount of complexity for no real world
benefit.  But I would be very surprised if the complexity in KVM or QEMU to support
this scenario is at all meaningful, let alone enough to justify diverging from
the architectural spec.

> BTW I'm still working on a KVM unit test for it, using a L1 VMM that
> enumerates FRED but not CET.
> 
> Thanks!
>     Xin

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts
  2025-08-25  2:51   ` Xin Li
  2025-08-26 18:11     ` Sean Christopherson
@ 2025-08-26 18:50     ` Andrew Cooper
  2025-08-26 22:03       ` Xin Li
  1 sibling, 1 reply; 33+ messages in thread
From: Andrew Cooper @ 2025-08-26 18:50 UTC (permalink / raw)
  To: Xin Li, linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, chao.gao, hch

On 25/08/2025 3:51 am, Xin Li wrote:
> On 8/21/2025 3:36 PM, Xin Li (Intel) wrote:
>> +    /*
>> +     * MSR_IA32_FRED_RSP0 and MSR_IA32_PL0_SSP (aka
>> MSR_IA32_FRED_SSP0) are
>> +     * designated for event delivery while executing in userspace. 
>> Since
>> +     * KVM operates exclusively in kernel mode (the CPL is always 0
>> after
>> +     * any VM exit), KVM can safely retain and operate with the
>> guest-defined
>> +     * values for MSR_IA32_FRED_RSP0 and MSR_IA32_PL0_SSP.
>> +     *
>> +     * Therefore, interception of MSR_IA32_FRED_RSP0 and
>> MSR_IA32_PL0_SSP
>> +     * is not required.
>> +     *
>> +     * Note, save and restore of MSR_IA32_PL0_SSP belong to CET
>> supervisor
>> +     * context management.  However the FRED SSP MSRs, including
>> +     * MSR_IA32_PL0_SSP, are supported by any processor that
>> enumerates FRED.
>> +     * If such a processor does not support CET, FRED transitions
>> will not
>> +     * use the MSRs, but the MSRs would still be accessible using
>> MSR-access
>> +     * instructions (e.g., RDMSR, WRMSR).
>> +     */
>> +    vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP0, MSR_TYPE_RW,
>> intercept);
>> +    vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP, MSR_TYPE_RW,
>> intercept);
>
> Hi Sean,
>
> I'd like to bring up an issue concerning MSR_IA32_PL0_SSP.
>
> The FRED spec claims:
>
> The FRED SSP MSRs are supported by any processor that enumerates
> CPUID.(EAX=7,ECX=1):EAX.FRED[bit 17] as 1. If such a processor does not
> support CET, FRED transitions will not use the MSRs (because shadow
> stacks
> are not enabled), but the MSRs would still be accessible using MSR-access
> instructions (e.g., RDMSR, WRMSR).

This is silly.  AIUI, all CPUs that have FRED also have CET-SS, so in
practice they all have these MSRs.

But from an architectural point of view, if CET-SS isn't available,
these MSRs shouldn't be either.  A guest which can't use CET-SS has no
reason to touch these MSRs at all.

MSR_PL0_SSP (== MSR_FRED_SSP_SL0) is gated on CET-SS alone (it already
exists in CPUs), while MSR_FRED_SSP_SL{1..3} should be gated on CET-SS
&& FRED, and should be reserved[1] otherwise.

This distinction only matters for guests, and adding the CET-SS
precondition makes things simpler overall for both VMMs and guests.  So
can't this just be fixed up before being integrated into the SDM?

~Andrew

[1] I have a sneaking suspicion there's a SKU reason why the spec is
written that way, and "Reserved" is still the right behaviour to have
for !CET-SS || !FRED.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts
  2025-08-26 18:11     ` Sean Christopherson
@ 2025-08-26 21:59       ` Xin Li
  2025-08-26 22:17         ` Sean Christopherson
  0 siblings, 1 reply; 33+ messages in thread
From: Xin Li @ 2025-08-26 21:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, linux-doc, pbonzini, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, andrew.cooper3, chao.gao,
	hch


>> Hi Sean,
>>
>> I'd like to bring up an issue concerning MSR_IA32_PL0_SSP.
>>
>> The FRED spec claims:
>>
>> The FRED SSP MSRs are supported by any processor that enumerates
>> CPUID.(EAX=7,ECX=1):EAX.FRED[bit 17] as 1. If such a processor does not
>> support CET, FRED transitions will not use the MSRs (because shadow stacks
>> are not enabled), but the MSRs would still be accessible using MSR-access
>> instructions (e.g., RDMSR, WRMSR).
>>
>> It means KVM needs to handle MSR_IA32_PL0_SSP even when FRED is supported
>> but CET is not.  And this can be broken down into two subtasks:
>>
>> 1) Allow such a guest to access MSR_IA32_PL0_SSP w/o triggering #GP.  And
>> this behavior is already implemented in patch 8 of this series.
>>
>> 2) Save and restore MSR_IA32_PL0_SSP in both KVM and Qemu for such a guest.
> 
> What novel work needs to be done in KVM?  For QEMU, I assume it's just adding an
> "or FRED" somewhere.  For KVM, I'm missing what additional work would be required
> that wouldn't be naturally covered by patch 8 (assuming patch 8 is bug-free).

Extra patches:

1) A patch to save/restore guest MSR_IA32_PL0_SSP (i.e., FRED SSP0), as
what we have done for RSP0, following is the patch on top of the patch 
saving/restoring RSP0:

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 449a5e02c7de..0bf684342a71 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1294,8 +1294,13 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)

  	wrmsrq(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);

-	if (guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
+	if (guest_cpu_cap_has(vcpu, X86_FEATURE_FRED)) {
  		wrmsrns(MSR_IA32_FRED_RSP0, vmx->msr_guest_fred_rsp0);
+
+		/* XSAVES/XRSTORS do not cover SSP MSRs */
+		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+			wrmsrns(MSR_IA32_FRED_SSP0, vmx->msr_guest_fred_ssp0);
+	}
  #else
  	savesegment(fs, fs_sel);
  	savesegment(gs, gs_sel);
@@ -1349,6 +1354,10 @@ static void vmx_prepare_switch_to_host(struct 
vcpu_vmx *vmx)
  		 * CPU exits to userspace (RSP0 is a per-task value).
  		 */
  		fred_sync_rsp0(vmx->msr_guest_fred_rsp0);
+
+		/* XSAVES/XRSTORS do not cover SSP MSRs */
+		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+			vmx->msr_guest_fred_ssp0 = read_msr(MSR_IA32_FRED_SSP0);
  	}
  #endif
  	load_fixmap_gdt(raw_smp_processor_id());
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 733fa2ef4bea..12c1cf827cb7 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -228,6 +228,7 @@ struct vcpu_vmx {
  #ifdef CONFIG_X86_64
  	u64		      msr_guest_kernel_gs_base;
  	u64		      msr_guest_fred_rsp0;
+	u64		      msr_guest_fred_ssp0;
  #endif

  	u64		      spec_ctrl;

And We might want to zero host MSR_IA32_PL0_SSP when switching to host.


2) Add vmx_read_guest_fred_ssp0()/vmx_write_guest_fred_ssp0(), and use
them to read/write MSR_IA32_PL0_SSP in patch 8:

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 99106750b1e3..cbdc67682d27 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1400,9 +1408,23 @@ static void vmx_write_guest_fred_rsp0(struct 
vcpu_vmx *vmx, u64 data)
  	vmx_write_guest_host_msr(vmx, MSR_IA32_FRED_RSP0, data,
  				 &vmx->msr_guest_fred_rsp0);
  }
+
+static u64 vmx_read_guest_fred_ssp0(struct vcpu_vmx *vmx)
+{
+	return vmx_read_guest_host_msr(vmx, MSR_IA32_FRED_SSP0,
+				       &vmx->msr_guest_fred_ssp0);
+}
+
+static void vmx_write_guest_fred_ssp0(struct vcpu_vmx *vmx, u64 data)
+{
+	vmx_write_guest_host_msr(vmx, MSR_IA32_FRED_SSP0, data,
+				 &vmx->msr_guest_fred_ssp0);
+}
  #endif

  static void grow_ple_window(struct kvm_vcpu *vcpu)
@@ -2189,6 +2211,18 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
  	case MSR_IA32_DEBUGCTLMSR:
  		msr_info->data = vmx_guest_debugctl_read();
  		break;
+	case MSR_IA32_PL0_SSP:
+		/*
+		 * If kvm_cpu_cap_has(X86_FEATURE_SHSTK) but
+		 * !guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK), XSAVES/XRSTORS
+		 * cover SSP MSRs.
+		 */
+		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
+		    guest_cpu_cap_has(vcpu, X86_FEATURE_FRED)) {
+			msr_info->data = vmx_read_guest_fred_ssp0(vmx);
+			break;
+		}
+		fallthrough;
  	default:
  	find_uret_msr:
  		msr = vmx_find_uret_msr(vmx, msr_info->index);
@@ -2540,7 +2574,18 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
  		}
  		ret = kvm_set_msr_common(vcpu, msr_info);
  		break;
-
+	case MSR_IA32_PL0_SSP:
+		/*
+		 * If kvm_cpu_cap_has(X86_FEATURE_SHSTK) but
+		 * !guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK), XSAVES/XRSTORS
+		 * cover SSP MSRs.
+		 */
+		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
+		    guest_cpu_cap_has(vcpu, X86_FEATURE_FRED)) {
+			vmx_write_guest_fred_ssp0(vmx, data);
+			break;
+		}
+		fallthrough;
  	default:
  	find_uret_msr:
  		msr = vmx_find_uret_msr(vmx, msr_index);


3) Another change I was discussing with Chao:
https://lore.kernel.org/lkml/2ed04dff-e778-46c6-bd5f-51295763af06@zytor.com/

> 
>> I have the patches for 2) but they are not included in this series, because
>>
>> 1) how much do we care the value in MSR_IA32_PL0_SSP in such a guest?
>>
>> Yes, Chao told me that you are the one saying that MSRs can be used as
>> clobber registers and KVM should preserve the value.  Does MSR_IA32_PL0_SSP
>> in such a guest count?
> 
> If the architecture says that MSR_IA32_PL0_SSP exists and is accessible, then
> KVM needs to honor that.
> 
>> 2) Saving/restoring MSR_IA32_PL0_SSP adds complexity, though it's seldom
>> used.  Is it worth it?
> 
> Honoring the architecture is generally not optional.  There are extreme cases
> where KVM violates that rule and takes (often undocumented) erratum, e.g. APIC
> base relocation would require an absurd amount of complexity for no real world
> benefit.  But I would be very surprised if the complexity in KVM or QEMU to support
> this scenario is at all meaningful, let alone enough to justify diverging from
> the architectural spec.

Let me post v7 which includes all the required changes.

Thanks!
     Xin


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts
  2025-08-26 18:50     ` Andrew Cooper
@ 2025-08-26 22:03       ` Xin Li
  2025-08-26 22:20         ` Andrew Cooper
  0 siblings, 1 reply; 33+ messages in thread
From: Xin Li @ 2025-08-26 22:03 UTC (permalink / raw)
  To: Andrew Cooper, linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, chao.gao, hch

On 8/26/2025 11:50 AM, Andrew Cooper wrote:
> This distinction only matters for guests, and adding the CET-SS
> precondition makes things simpler overall for both VMMs and guests.  So
> can't this just be fixed up before being integrated into the SDM?

+1 :)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts
  2025-08-26 21:59       ` Xin Li
@ 2025-08-26 22:17         ` Sean Christopherson
  2025-08-27 22:24           ` Xin Li
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2025-08-26 22:17 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, kvm, linux-doc, pbonzini, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, andrew.cooper3, chao.gao,
	hch

On Tue, Aug 26, 2025, Xin Li wrote:
> 
> > > Hi Sean,
> > > 
> > > I'd like to bring up an issue concerning MSR_IA32_PL0_SSP.
> > > 
> > > The FRED spec claims:
> > > 
> > > The FRED SSP MSRs are supported by any processor that enumerates
> > > CPUID.(EAX=7,ECX=1):EAX.FRED[bit 17] as 1. If such a processor does not
> > > support CET, FRED transitions will not use the MSRs (because shadow stacks
> > > are not enabled), but the MSRs would still be accessible using MSR-access
> > > instructions (e.g., RDMSR, WRMSR).
> > > 
> > > It means KVM needs to handle MSR_IA32_PL0_SSP even when FRED is supported
> > > but CET is not.  And this can be broken down into two subtasks:
> > > 
> > > 1) Allow such a guest to access MSR_IA32_PL0_SSP w/o triggering #GP.  And
> > > this behavior is already implemented in patch 8 of this series.
> > > 
> > > 2) Save and restore MSR_IA32_PL0_SSP in both KVM and Qemu for such a guest.
> > 
> > What novel work needs to be done in KVM?  For QEMU, I assume it's just adding an
> > "or FRED" somewhere.  For KVM, I'm missing what additional work would be required
> > that wouldn't be naturally covered by patch 8 (assuming patch 8 is bug-free).
> 
> Extra patches:
> 
> 1) A patch to save/restore guest MSR_IA32_PL0_SSP (i.e., FRED SSP0), as
> what we have done for RSP0, following is the patch on top of the patch
> saving/restoring RSP0:
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 449a5e02c7de..0bf684342a71 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1294,8 +1294,13 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> 
>  	wrmsrq(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
> 
> -	if (guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
> +	if (guest_cpu_cap_has(vcpu, X86_FEATURE_FRED)) {
>  		wrmsrns(MSR_IA32_FRED_RSP0, vmx->msr_guest_fred_rsp0);
> +
> +		/* XSAVES/XRSTORS do not cover SSP MSRs */

Eww.  I'm with Andrew, fix the SDM.  This is silly.

> +		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> +			wrmsrns(MSR_IA32_FRED_SSP0, vmx->msr_guest_fred_ssp0);

FWIW, if we can't get an SDM change, don't bother with RDMSR/WRMSRNS, just
configure KVM to intercept accesses.  Then in kvm_set_msr_common(), pivot on
X86_FEATURE_SHSTK, e.g.

	case MSR_IA32_U_CET:
	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
			WARN_ON_ONCE(msr != MSR_IA32_FRED_SSP0);
			vcpu->arch.fred_rsp0_fallback = data;
			break;
		}

		kvm_set_xstate_msr(vcpu, msr_info);
		break;

and

	case MSR_IA32_U_CET:
	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
			WARN_ON_ONCE(msr_info->index != MSR_IA32_FRED_SSP0);
			vcpu->arch.fred_rsp0_fallback = msr_info->data;
			break;
		}

		kvm_get_xstate_msr(vcpu, msr_info);
		break;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts
  2025-08-26 22:03       ` Xin Li
@ 2025-08-26 22:20         ` Andrew Cooper
  0 siblings, 0 replies; 33+ messages in thread
From: Andrew Cooper @ 2025-08-26 22:20 UTC (permalink / raw)
  To: Xin Li, linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, chao.gao, hch

On 26/08/2025 11:03 pm, Xin Li wrote:
> On 8/26/2025 11:50 AM, Andrew Cooper wrote:
>> This distinction only matters for guests, and adding the CET-SS
>> precondition makes things simpler overall for both VMMs and guests.  So
>> can't this just be fixed up before being integrated into the SDM?
>
> +1 :)

I've just realised why these MSRs are tied together in this way.

As written, the VMX Entry/Exit Load/Save FRED controls do not allow for
a logical configuration of FRED && !CET-SS.  Both sets of stack pointers
are treated the same.

This is horrible.  I'm less certain if this can simply be fixed by
changing the SDM.

~Andrew

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 04/20] x86/cea: Export an API to get per CPU exception stacks for KVM to use
  2025-08-21 22:36 ` [PATCH v6 04/20] x86/cea: Export an API to get per CPU exception stacks for KVM to use Xin Li (Intel)
@ 2025-08-27 17:33   ` Dave Hansen
  2025-08-27 22:18     ` Xin Li
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2025-08-27 17:33 UTC (permalink / raw)
  To: Xin Li (Intel), linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, chao.gao, hch

On 8/21/25 15:36, Xin Li (Intel) wrote:
> FRED introduced new fields in the host-state area of the VMCS for
> stack levels 1->3 (HOST_IA32_FRED_RSP[123]), each respectively
> corresponding to per CPU exception stacks for #DB, NMI and #DF.
> KVM must populate these each time a vCPU is loaded onto a CPU.
> 
> Convert the __this_cpu_ist_{top,bottom}_va() macros into real
> functions and export __this_cpu_ist_top_va().
> 
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>

Nit: I wouldn't use Suggested-by unless the person basically asked for
the *entire* patch. Christoph and I were asking for specific bits of
this, but neither of us asked for this patch as a whole.

> diff --git a/arch/x86/coco/sev/sev-nmi.c b/arch/x86/coco/sev/sev-nmi.c
> index d8dfaddfb367..73e34ad7a1a9 100644
> --- a/arch/x86/coco/sev/sev-nmi.c
> +++ b/arch/x86/coco/sev/sev-nmi.c
> @@ -30,7 +30,7 @@ static __always_inline bool on_vc_stack(struct pt_regs *regs)
>  	if (ip_within_syscall_gap(regs))
>  		return false;
>  
> -	return ((sp >= __this_cpu_ist_bottom_va(VC)) && (sp < __this_cpu_ist_top_va(VC)));
> +	return ((sp >= __this_cpu_ist_bottom_va(ESTACK_VC)) && (sp < __this_cpu_ist_top_va(ESTACK_VC)));
>  }

This rename is one of those things that had me scratching my head for a
minute. It wasn't obvious at _all_ why the VC=>ESTACK_VC "rename" is
necessary.

This needs to have been mentioned in the changelog.

Better yet would have been to do this in a separate patch because a big
chunk of this patch is just rename noise.

>  /*
> @@ -82,7 +82,7 @@ void noinstr __sev_es_ist_exit(void)
>  	/* Read IST entry */
>  	ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
>  
> -	if (WARN_ON(ist == __this_cpu_ist_top_va(VC)))
> +	if (WARN_ON(ist == __this_cpu_ist_top_va(ESTACK_VC)))
>  		return;
>  
>  	/* Read back old IST entry and write it to the TSS */
> diff --git a/arch/x86/coco/sev/vc-handle.c b/arch/x86/coco/sev/vc-handle.c
> index c3b4acbde0d8..88b6bc518a5a 100644
> --- a/arch/x86/coco/sev/vc-handle.c
> +++ b/arch/x86/coco/sev/vc-handle.c
> @@ -859,7 +859,7 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
>  
>  static __always_inline bool is_vc2_stack(unsigned long sp)
>  {
> -	return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2));
> +	return (sp >= __this_cpu_ist_bottom_va(ESTACK_VC2) && sp < __this_cpu_ist_top_va(ESTACK_VC2));
>  }
>  
>  static __always_inline bool vc_from_invalid_context(struct pt_regs *regs)
> diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
> index 462fc34f1317..8e17f0ca74e6 100644
> --- a/arch/x86/include/asm/cpu_entry_area.h
> +++ b/arch/x86/include/asm/cpu_entry_area.h
> @@ -46,7 +46,7 @@ struct cea_exception_stacks {
>   * The exception stack ordering in [cea_]exception_stacks
>   */
>  enum exception_stack_ordering {
> -	ESTACK_DF,
> +	ESTACK_DF = 0,
>  	ESTACK_NMI,
>  	ESTACK_DB,
>  	ESTACK_MCE,

Is this really required? I thought the first enum was always 0? Is this
just trying to ensure that ESTACKS_MEMBERS() defines a matching number
of N_EXCEPTION_STACKS stacks?

If that's the case, shouldn't this be represented with a BUILD_BUG_ON()?

> @@ -58,18 +58,15 @@ enum exception_stack_ordering {
>  #define CEA_ESTACK_SIZE(st)					\
>  	sizeof(((struct cea_exception_stacks *)0)->st## _stack)
>  
> -#define CEA_ESTACK_BOT(ceastp, st)				\
> -	((unsigned long)&(ceastp)->st## _stack)
> -
> -#define CEA_ESTACK_TOP(ceastp, st)				\
> -	(CEA_ESTACK_BOT(ceastp, st) + CEA_ESTACK_SIZE(st))
> -
>  #define CEA_ESTACK_OFFS(st)					\
>  	offsetof(struct cea_exception_stacks, st## _stack)
>  
>  #define CEA_ESTACK_PAGES					\
>  	(sizeof(struct cea_exception_stacks) / PAGE_SIZE)
>  
> +extern unsigned long __this_cpu_ist_top_va(enum exception_stack_ordering stack);
> +extern unsigned long __this_cpu_ist_bottom_va(enum exception_stack_ordering stack);
> +
>  #endif
>  
>  #ifdef CONFIG_X86_32
> @@ -144,10 +141,4 @@ static __always_inline struct entry_stack *cpu_entry_stack(int cpu)
>  	return &get_cpu_entry_area(cpu)->entry_stack_page.stack;
>  }
>  
> -#define __this_cpu_ist_top_va(name)					\
> -	CEA_ESTACK_TOP(__this_cpu_read(cea_exception_stacks), name)
> -
> -#define __this_cpu_ist_bottom_va(name)					\
> -	CEA_ESTACK_BOT(__this_cpu_read(cea_exception_stacks), name)
> -
>  #endif
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 34a054181c4d..cb14919f92da 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -2307,12 +2307,12 @@ static inline void setup_getcpu(int cpu)
>  static inline void tss_setup_ist(struct tss_struct *tss)
>  {
>  	/* Set up the per-CPU TSS IST stacks */
> -	tss->x86_tss.ist[IST_INDEX_DF] = __this_cpu_ist_top_va(DF);
> -	tss->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(NMI);
> -	tss->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(DB);
> -	tss->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(MCE);
> +	tss->x86_tss.ist[IST_INDEX_DF] = __this_cpu_ist_top_va(ESTACK_DF);
> +	tss->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(ESTACK_NMI);
> +	tss->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(ESTACK_DB);
> +	tss->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(ESTACK_MCE);

If you respin this, please vertically align these.

> +/*
> + * FRED introduced new fields in the host-state area of the VMCS for
> + * stack levels 1->3 (HOST_IA32_FRED_RSP[123]), each respectively
> + * corresponding to per CPU stacks for #DB, NMI and #DF.  KVM must
> + * populate these each time a vCPU is loaded onto a CPU.
> + *
> + * Called from entry code, so must be noinstr.
> + */
> +noinstr unsigned long __this_cpu_ist_top_va(enum exception_stack_ordering stack)
> +{
> +	unsigned long base = (unsigned long)&(__this_cpu_read(cea_exception_stacks)->DF_stack);
> +	return base + EXCEPTION_STKSZ + stack * (EXCEPTION_STKSZ + PAGE_SIZE);
> +}
> +EXPORT_SYMBOL(__this_cpu_ist_top_va);
> +
> +noinstr unsigned long __this_cpu_ist_bottom_va(enum exception_stack_ordering stack)
> +{
> +	unsigned long base = (unsigned long)&(__this_cpu_read(cea_exception_stacks)->DF_stack);
> +	return base + stack * (EXCEPTION_STKSZ + PAGE_SIZE);
> +}

These are basically treating 'struct exception_stacks' like an array.
There's no type safety or anything here. It's just an open-coded array
access.

Also, starting with ->DF_stack is a bit goofy looking. It's not obvious
(or enforced) that it is stack #0 or at the beginning of the structure.

Shouldn't we be _trying_ to make this look like:

	struct cea_exception_stacks *s;
	s = __this_cpu_read(cea_exception_stacks);

	return &s[stack_nr].stack;

?

Where 'cea_exception_stacks' is an actual array:

	struct cea_exception_stacks[N_EXCEPTION_STACKS];

which might need to be embedded in a larger structure to get the
'IST_top_guard' without wasting allocating space for an extra full stack.

>  static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, _cea_offset);
>  
>  static __always_inline unsigned int cea_offset(unsigned int cpu)
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 998bd807fc7b..1804eb86cc14 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -671,7 +671,7 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,
>  		 * and then double-fault, though, because we're likely to
>  		 * break the console driver and lose most of the stack dump.
>  		 */
> -		call_on_stack(__this_cpu_ist_top_va(DF) - sizeof(void*),
> +		call_on_stack(__this_cpu_ist_top_va(ESTACK_DF) - sizeof(void*),
>  			      handle_stack_overflow,
>  			      ASM_CALL_ARG3,
>  			      , [arg1] "r" (regs), [arg2] "r" (address), [arg3] "r" (&info));


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 04/20] x86/cea: Export an API to get per CPU exception stacks for KVM to use
  2025-08-27 17:33   ` Dave Hansen
@ 2025-08-27 22:18     ` Xin Li
  0 siblings, 0 replies; 33+ messages in thread
From: Xin Li @ 2025-08-27 22:18 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm, linux-doc
  Cc: pbonzini, seanjc, corbet, tglx, mingo, bp, dave.hansen, x86, hpa,
	luto, peterz, andrew.cooper3, chao.gao, hch

>> Suggested-by: Christoph Hellwig <hch@infradead.org>
>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> 
> Nit: I wouldn't use Suggested-by unless the person basically asked for
> the *entire* patch. Christoph and I were asking for specific bits of
> this, but neither of us asked for this patch as a whole.

I did it because the patch is almost rewritten to export accessors instead
of raw data, IOW, the way of doing it is completely changed.

But I will remove Suggested-by.

> 
>> diff --git a/arch/x86/coco/sev/sev-nmi.c b/arch/x86/coco/sev/sev-nmi.c
>> index d8dfaddfb367..73e34ad7a1a9 100644
>> --- a/arch/x86/coco/sev/sev-nmi.c
>> +++ b/arch/x86/coco/sev/sev-nmi.c
>> @@ -30,7 +30,7 @@ static __always_inline bool on_vc_stack(struct pt_regs *regs)
>>   	if (ip_within_syscall_gap(regs))
>>   		return false;
>>   
>> -	return ((sp >= __this_cpu_ist_bottom_va(VC)) && (sp < __this_cpu_ist_top_va(VC)));
>> +	return ((sp >= __this_cpu_ist_bottom_va(ESTACK_VC)) && (sp < __this_cpu_ist_top_va(ESTACK_VC)));
>>   }
> 
> This rename is one of those things that had me scratching my head for a
> minute. It wasn't obvious at _all_ why the VC=>ESTACK_VC "rename" is
> necessary.
> 
> This needs to have been mentioned in the changelog.
> 
> Better yet would have been to do this in a separate patch because a big
> chunk of this patch is just rename noise.

Sure, will do.

>> diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
>> index 462fc34f1317..8e17f0ca74e6 100644
>> --- a/arch/x86/include/asm/cpu_entry_area.h
>> +++ b/arch/x86/include/asm/cpu_entry_area.h
>> @@ -46,7 +46,7 @@ struct cea_exception_stacks {
>>    * The exception stack ordering in [cea_]exception_stacks
>>    */
>>   enum exception_stack_ordering {
>> -	ESTACK_DF,
>> +	ESTACK_DF = 0,
>>   	ESTACK_NMI,
>>   	ESTACK_DB,
>>   	ESTACK_MCE,
> 
> Is this really required? I thought the first enum was always 0? Is this
> just trying to ensure that ESTACKS_MEMBERS() defines a matching number
> of N_EXCEPTION_STACKS stacks?
> 
> If that's the case, shouldn't this be represented with a BUILD_BUG_ON()?

Will do BUILD_BUG_ON().

> 
>> @@ -58,18 +58,15 @@ enum exception_stack_ordering {
>>   #define CEA_ESTACK_SIZE(st)					\
>>   	sizeof(((struct cea_exception_stacks *)0)->st## _stack)
>>   
>> -#define CEA_ESTACK_BOT(ceastp, st)				\
>> -	((unsigned long)&(ceastp)->st## _stack)
>> -
>> -#define CEA_ESTACK_TOP(ceastp, st)				\
>> -	(CEA_ESTACK_BOT(ceastp, st) + CEA_ESTACK_SIZE(st))
>> -
>>   #define CEA_ESTACK_OFFS(st)					\
>>   	offsetof(struct cea_exception_stacks, st## _stack)
>>   
>>   #define CEA_ESTACK_PAGES					\
>>   	(sizeof(struct cea_exception_stacks) / PAGE_SIZE)
>>   
>> +extern unsigned long __this_cpu_ist_top_va(enum exception_stack_ordering stack);
>> +extern unsigned long __this_cpu_ist_bottom_va(enum exception_stack_ordering stack);
>> +
>>   #endif
>>   
>>   #ifdef CONFIG_X86_32
>> @@ -144,10 +141,4 @@ static __always_inline struct entry_stack *cpu_entry_stack(int cpu)
>>   	return &get_cpu_entry_area(cpu)->entry_stack_page.stack;
>>   }
>>   
>> -#define __this_cpu_ist_top_va(name)					\
>> -	CEA_ESTACK_TOP(__this_cpu_read(cea_exception_stacks), name)
>> -
>> -#define __this_cpu_ist_bottom_va(name)					\
>> -	CEA_ESTACK_BOT(__this_cpu_read(cea_exception_stacks), name)
>> -
>>   #endif
>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>> index 34a054181c4d..cb14919f92da 100644
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -2307,12 +2307,12 @@ static inline void setup_getcpu(int cpu)
>>   static inline void tss_setup_ist(struct tss_struct *tss)
>>   {
>>   	/* Set up the per-CPU TSS IST stacks */
>> -	tss->x86_tss.ist[IST_INDEX_DF] = __this_cpu_ist_top_va(DF);
>> -	tss->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(NMI);
>> -	tss->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(DB);
>> -	tss->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(MCE);
>> +	tss->x86_tss.ist[IST_INDEX_DF] = __this_cpu_ist_top_va(ESTACK_DF);
>> +	tss->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(ESTACK_NMI);
>> +	tss->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(ESTACK_DB);
>> +	tss->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(ESTACK_MCE);
> 
> If you respin this, please vertically align these.

NP.

> 
>> +/*
>> + * FRED introduced new fields in the host-state area of the VMCS for
>> + * stack levels 1->3 (HOST_IA32_FRED_RSP[123]), each respectively
>> + * corresponding to per CPU stacks for #DB, NMI and #DF.  KVM must
>> + * populate these each time a vCPU is loaded onto a CPU.
>> + *
>> + * Called from entry code, so must be noinstr.
>> + */
>> +noinstr unsigned long __this_cpu_ist_top_va(enum exception_stack_ordering stack)
>> +{
>> +	unsigned long base = (unsigned long)&(__this_cpu_read(cea_exception_stacks)->DF_stack);
>> +	return base + EXCEPTION_STKSZ + stack * (EXCEPTION_STKSZ + PAGE_SIZE);
>> +}
>> +EXPORT_SYMBOL(__this_cpu_ist_top_va);
>> +
>> +noinstr unsigned long __this_cpu_ist_bottom_va(enum exception_stack_ordering stack)
>> +{
>> +	unsigned long base = (unsigned long)&(__this_cpu_read(cea_exception_stacks)->DF_stack);
>> +	return base + stack * (EXCEPTION_STKSZ + PAGE_SIZE);
>> +}
> 
> These are basically treating 'struct exception_stacks' like an array.
> There's no type safety or anything here. It's just an open-coded array
> access.
> 
> Also, starting with ->DF_stack is a bit goofy looking. It's not obvious
> (or enforced) that it is stack #0 or at the beginning of the structure.
> 
> Shouldn't we be _trying_ to make this look like:
> 
> 	struct cea_exception_stacks *s;
> 	s = __this_cpu_read(cea_exception_stacks);
> 
> 	return &s[stack_nr].stack;
> 
> ?
> 
> Where 'cea_exception_stacks' is an actual array:
> 
> 	struct cea_exception_stacks[N_EXCEPTION_STACKS];
> 
> which might need to be embedded in a larger structure to get the
> 'IST_top_guard' without wasting allocating space for an extra full stack.
> 

Good suggestion!

Thanks!
     Xin

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts
  2025-08-26 22:17         ` Sean Christopherson
@ 2025-08-27 22:24           ` Xin Li
  2025-08-27 22:43             ` Xin Li
  0 siblings, 1 reply; 33+ messages in thread
From: Xin Li @ 2025-08-27 22:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, linux-doc, pbonzini, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, andrew.cooper3, chao.gao,
	hch

On 8/26/2025 3:17 PM, Sean Christopherson wrote:
>> +		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
>> +			wrmsrns(MSR_IA32_FRED_SSP0, vmx->msr_guest_fred_ssp0);
> FWIW, if we can't get an SDM change, don't bother with RDMSR/WRMSRNS, just
> configure KVM to intercept accesses.  Then in kvm_set_msr_common(), pivot on
> X86_FEATURE_SHSTK, e.g.


Intercepting is a solid approach: it ensures the guest value is fully
virtual and does not affect the hardware FRED SSP0 MSR.  Of course the code
is also simplified.


> 
> 	case MSR_IA32_U_CET:
> 	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> 		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> 			WARN_ON_ONCE(msr != MSR_IA32_FRED_SSP0);
> 			vcpu->arch.fred_rsp0_fallback = data;
> 			break;
> 		}
> 
> 		kvm_set_xstate_msr(vcpu, msr_info);
> 		break;
> 
> and
> 
> 	case MSR_IA32_U_CET:
> 	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> 		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> 			WARN_ON_ONCE(msr_info->index != MSR_IA32_FRED_SSP0);
> 			vcpu->arch.fred_rsp0_fallback = msr_info->data;
> 			break;
> 		}
> 
> 		kvm_get_xstate_msr(vcpu, msr_info);
> 		break;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts
  2025-08-27 22:24           ` Xin Li
@ 2025-08-27 22:43             ` Xin Li
  2025-08-28 23:32               ` Sean Christopherson
  0 siblings, 1 reply; 33+ messages in thread
From: Xin Li @ 2025-08-27 22:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, linux-doc, pbonzini, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, andrew.cooper3, chao.gao,
	hch

On 8/27/2025 3:24 PM, Xin Li wrote:
> On 8/26/2025 3:17 PM, Sean Christopherson wrote:
>>> +        if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
>>> +            wrmsrns(MSR_IA32_FRED_SSP0, vmx->msr_guest_fred_ssp0);
>> FWIW, if we can't get an SDM change, don't bother with RDMSR/WRMSRNS, just
>> configure KVM to intercept accesses.  Then in kvm_set_msr_common(), pivot on
>> X86_FEATURE_SHSTK, e.g.
> 
> 
> Intercepting is a solid approach: it ensures the guest value is fully
> virtual and does not affect the hardware FRED SSP0 MSR.  Of course the code
> is also simplified.
> 
> 
>>
>>     case MSR_IA32_U_CET:
>>     case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
>>         if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
>>             WARN_ON_ONCE(msr != MSR_IA32_FRED_SSP0);
>>             vcpu->arch.fred_rsp0_fallback = data;

Putting fred_rsp0_fallback in struct kvm_vcpu_arch reminds me one thing:

We know AMD will do FRED and follow the FRED spec for bare metal, but
regarding virtualization of FRED, I have no idea how it will be done on
AMD, so I keep the KVM FRED code in VMX files, e.g., msr_guest_fred_rsp0 is
defined in struct vcpu_vmx, and saved/restored in vmx.c.

It is a future task to make common KVM FRED code for Intel and AMD.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts
  2025-08-27 22:43             ` Xin Li
@ 2025-08-28 23:32               ` Sean Christopherson
  0 siblings, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2025-08-28 23:32 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, kvm, linux-doc, pbonzini, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, andrew.cooper3, chao.gao,
	hch

On Wed, Aug 27, 2025, Xin Li wrote:
> On 8/27/2025 3:24 PM, Xin Li wrote:
> > On 8/26/2025 3:17 PM, Sean Christopherson wrote:
> > > > +        if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > > > +            wrmsrns(MSR_IA32_FRED_SSP0, vmx->msr_guest_fred_ssp0);
> > > FWIW, if we can't get an SDM change, don't bother with RDMSR/WRMSRNS, just
> > > configure KVM to intercept accesses.  Then in kvm_set_msr_common(), pivot on
> > > X86_FEATURE_SHSTK, e.g.
> > 
> > 
> > Intercepting is a solid approach: it ensures the guest value is fully
> > virtual and does not affect the hardware FRED SSP0 MSR.  Of course the code
> > is also simplified.
> > 
> > 
> > > 
> > >     case MSR_IA32_U_CET:
> > >     case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > >         if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> > >             WARN_ON_ONCE(msr != MSR_IA32_FRED_SSP0);
> > >             vcpu->arch.fred_rsp0_fallback = data;
> 
> Putting fred_rsp0_fallback in struct kvm_vcpu_arch reminds me one thing:
> 
> We know AMD will do FRED and follow the FRED spec for bare metal, but
> regarding virtualization of FRED, I have no idea how it will be done on
> AMD, so I keep the KVM FRED code in VMX files, e.g., msr_guest_fred_rsp0 is
> defined in struct vcpu_vmx, and saved/restored in vmx.c.

The problem is that if you do that, then the handling of MSR_IA32_PL0_SSP takes
completely different paths depending on vendor, theoretically on hardware, and
on guest CPUID model.  That makes it _really_ difficult to understand how PL0_SSP
is emulated by KVM.

And I actually think that's moot anyways.  KVM _always_ needs to emulated MSR
accesses in software, and the whole goofy PL0_SSP behavior is a bare metal quirk,
not a virtualization quirk.  So unless AMD defines different architecture (which
is certainly possible), AMD will also need arch.fred_rsp0_fallback.

> It is a future task to make common KVM FRED code for Intel and AMD.

No, this is not how I want to approach hardware enabling.  KVM needs to guard
against false advertising, e.g. ensure likely-to-be-common CPUID features are
explicitly cleared in the other vendor.  But deliberately burying code that's
vendor agnostic in whatever vendor support happens to come along first isn't
necessary by any means, and is usually a net negative in the grand scheme, and
often in a big way.

E.g. in this case, if arch.fred_rsp0_fallback ends up being unnecessary for AMD,
we probably don't even need to do anything, KVM will just have a field that's
only used on Intel because the quirky scenario can't be reached on AMD.

But if we bury the code in VMX, then the _best_ case scenario is that KVM carries
a weird split of responsibility in perpetuity (happy path handled in x86.c, rare
sad path handled in vmx.c).  And the worst case scenario is that we carry the
weird split for some time, and then have to undo all of it when AMD support comes
along.  Actually, the worst case scenario is that we forget about the VMX code
and re-implement the same thing in svm.c.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-08-28 23:32 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-21 22:36 [PATCH v6 00/20] Enable FRED with KVM VMX Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 01/20] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 02/20] KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 03/20] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 04/20] x86/cea: Export an API to get per CPU exception stacks for KVM to use Xin Li (Intel)
2025-08-27 17:33   ` Dave Hansen
2025-08-27 22:18     ` Xin Li
2025-08-21 22:36 ` [PATCH v6 05/20] KVM: VMX: Initialize VMCS FRED fields Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 06/20] KVM: VMX: Set FRED MSR intercepts Xin Li (Intel)
2025-08-25  2:51   ` Xin Li
2025-08-26 18:11     ` Sean Christopherson
2025-08-26 21:59       ` Xin Li
2025-08-26 22:17         ` Sean Christopherson
2025-08-27 22:24           ` Xin Li
2025-08-27 22:43             ` Xin Li
2025-08-28 23:32               ` Sean Christopherson
2025-08-26 18:50     ` Andrew Cooper
2025-08-26 22:03       ` Xin Li
2025-08-26 22:20         ` Andrew Cooper
2025-08-21 22:36 ` [PATCH v6 07/20] KVM: VMX: Save/restore guest FRED RSP0 Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 08/20] KVM: VMX: Add support for FRED context save/restore Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 09/20] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 10/20] KVM: VMX: Virtualize FRED event_data Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 11/20] KVM: VMX: Virtualize FRED nested exception tracking Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 12/20] KVM: x86: Save/restore the nested flag of an exception Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 13/20] KVM: x86: Mark CR4.FRED as not reserved Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 14/20] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 15/20] KVM: x86: Advertise support for FRED Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 16/20] KVM: nVMX: Add support for the secondary VM exit controls Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 17/20] KVM: nVMX: Add FRED VMCS fields to nested VMX context handling Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 18/20] KVM: nVMX: Add FRED-related VMCS field checks Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 19/20] KVM: nVMX: Add prerequisites to SHADOW_FIELD_R[OW] macros Xin Li (Intel)
2025-08-21 22:36 ` [PATCH v6 20/20] KVM: nVMX: Allow VMX FRED controls Xin Li (Intel)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).