* [PATCH v4 00/19] Enable FRED with KVM VMX
@ 2025-03-28 17:11 Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 01/19] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
` (19 more replies)
0 siblings, 20 replies; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
This patch set enables the Intel flexible return and event delivery
(FRED) architecture with KVM VMX to allow guests to utilize FRED.
The FRED architecture defines simple new transitions that change
privilege level (ring transitions). The FRED architecture was
designed with the following goals:
1) Improve overall performance and response time by replacing event
delivery through the interrupt descriptor table (IDT event
delivery) and event return by the IRET instruction with lower
latency transitions.
2) Improve software robustness by ensuring that event delivery
establishes the full supervisor context and that event return
establishes the full user context.
The new transitions defined by the FRED architecture are FRED event
delivery and, for returning from events, two FRED return instructions.
FRED event delivery can effect a transition from ring 3 to ring 0, but
it is used also to deliver events incident to ring 0. One FRED
instruction (ERETU) effects a return from ring 0 to ring 3, while the
other (ERETS) returns while remaining in ring 0. Collectively, FRED
event delivery and the FRED return instructions are FRED transitions.
Intel VMX architecture is extended to run FRED guests, and the major
changes are:
1) New VMCS fields for FRED context management, which includes two new
event data VMCS fields, eight new guest FRED context VMCS fields and
eight new host FRED context VMCS fields.
2) VMX nested-exception support for proper virtualization of stack
levels introduced with FRED architecture.
Search for the latest FRED spec in most search engines with this search
pattern:
site:intel.com FRED (flexible return and event delivery) specification
Following is the link to the v3 of this patch set:
https://lore.kernel.org/lkml/20241001050110.3643764-1-xin@zytor.com/
Since several preparatory patches in v3 have been merged, and Sean
reiterated that it's NOT worth to precisely track which fields are/
aren't supported [1], v4 patch number is reduced to 19.
Although FRED and CET supervisor shadow stacks are independent CPU
features, FRED unconditionally includes FRED shadow stack pointer
MSRs IA32_FRED_SSP[0123], and IA32_FRED_SSP0 is just an alias of the
CET MSR IA32_PL0_SSP. IOW, the state management of MSR IA32_PL0_SSP
becomes an overlap area, and Sean requested that FRED virtualization
to land after CET virtualization [2].
[1]: https://lore.kernel.org/lkml/Z73uK5IzVoBej3mi@google.com/
[2]: https://lore.kernel.org/kvm/ZvQaNRhrsSJTYji3@google.com/
Xin Li (17):
KVM: VMX: Add support for the secondary VM exit controls
KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config
KVM: VMX: Disable FRED if FRED consistency checks fail
KVM: VMX: Initialize VMCS FRED fields
KVM: VMX: Set FRED MSR interception
KVM: VMX: Save/restore guest FRED RSP0
KVM: VMX: Add support for FRED context save/restore
KVM: x86: Add a helper to detect if FRED is enabled for a vCPU
KVM: VMX: Virtualize FRED event_data
KVM: VMX: Virtualize FRED nested exception tracking
KVM: x86: Mark CR4.FRED as not reserved
KVM: VMX: Dump FRED context in dump_vmcs()
KVM: x86: Allow FRED/LKGS to be advertised to guests
KVM: nVMX: Add support for the secondary VM exit controls
KVM: nVMX: Add FRED VMCS fields to nested VMX context management
KVM: nVMX: Add VMCS FRED states checking
KVM: nVMX: Allow VMX FRED controls
Xin Li (Intel) (2):
x86/cea: Export per CPU array 'cea_exception_stacks' for KVM to use
KVM: x86: Save/restore the nested flag of an exception
Documentation/virt/kvm/api.rst | 19 ++
Documentation/virt/kvm/x86/nested-vmx.rst | 19 ++
arch/x86/include/asm/kvm_host.h | 8 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/vmx.h | 48 ++++-
arch/x86/include/uapi/asm/kvm.h | 4 +-
arch/x86/kvm/cpuid.c | 2 +
arch/x86/kvm/kvm_cache_regs.h | 15 ++
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/capabilities.h | 26 ++-
arch/x86/kvm/vmx/nested.c | 188 ++++++++++++++++-
arch/x86/kvm/vmx/nested.h | 22 ++
arch/x86/kvm/vmx/vmcs.h | 1 +
arch/x86/kvm/vmx/vmcs12.c | 19 ++
arch/x86/kvm/vmx/vmcs12.h | 38 ++++
arch/x86/kvm/vmx/vmcs_shadow_fields.h | 4 +
arch/x86/kvm/vmx/vmx.c | 237 ++++++++++++++++++++--
arch/x86/kvm/vmx/vmx.h | 15 +-
arch/x86/kvm/x86.c | 74 ++++++-
arch/x86/kvm/x86.h | 8 +-
arch/x86/mm/cpu_entry_area.c | 7 +
include/uapi/linux/kvm.h | 1 +
22 files changed, 727 insertions(+), 31 deletions(-)
base-commit: acb4f33713b9f6cadb6143f211714c343465411c
--
2.48.1
^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v4 01/19] KVM: VMX: Add support for the secondary VM exit controls
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 02/19] KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config Xin Li (Intel)
` (18 subsequent siblings)
19 siblings, 0 replies; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Always load the secondary VM exit controls to prepare for FRED enabling.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
---
Changes in v4:
* Fix clearing VM_EXIT_ACTIVATE_SECONDARY_CONTROLS (Chao Gao).
* Check VM exit/entry consistency based on the new macro from Sean
Christopherson.
Change in v3:
* Do FRED controls consistency checks in the VM exit/entry consistency
check framework (Sean Christopherson).
Change in v2:
* Always load the secondary VM exit controls (Sean Christopherson).
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/vmx.h | 3 +++
arch/x86/kvm/vmx/capabilities.h | 9 ++++++++-
arch/x86/kvm/vmx/vmcs.h | 1 +
arch/x86/kvm/vmx/vmx.c | 29 +++++++++++++++++++++++++++--
arch/x86/kvm/vmx/vmx.h | 7 ++++++-
6 files changed, 46 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e6134ef2263d..9e97ac6a823a 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1187,6 +1187,7 @@
#define MSR_IA32_VMX_TRUE_ENTRY_CTLS 0x00000490
#define MSR_IA32_VMX_VMFUNC 0x00000491
#define MSR_IA32_VMX_PROCBASED_CTLS3 0x00000492
+#define MSR_IA32_VMX_EXIT_CTLS2 0x00000493
/* Resctrl MSRs: */
/* - Intel: */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 8707361b24da..47626773a9e1 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -106,6 +106,7 @@
#define VM_EXIT_CLEAR_BNDCFGS 0x00800000
#define VM_EXIT_PT_CONCEAL_PIP 0x01000000
#define VM_EXIT_CLEAR_IA32_RTIT_CTL 0x02000000
+#define VM_EXIT_ACTIVATE_SECONDARY_CONTROLS 0x80000000
#define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff
@@ -258,6 +259,8 @@ enum vmcs_field {
TERTIARY_VM_EXEC_CONTROL_HIGH = 0x00002035,
PID_POINTER_TABLE = 0x00002042,
PID_POINTER_TABLE_HIGH = 0x00002043,
+ SECONDARY_VM_EXIT_CONTROLS = 0x00002044,
+ SECONDARY_VM_EXIT_CONTROLS_HIGH = 0x00002045,
GUEST_PHYSICAL_ADDRESS = 0x00002400,
GUEST_PHYSICAL_ADDRESS_HIGH = 0x00002401,
VMCS_LINK_POINTER = 0x00002800,
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index cb6588238f46..b2aefee59395 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -59,8 +59,9 @@ struct vmcs_config {
u32 cpu_based_exec_ctrl;
u32 cpu_based_2nd_exec_ctrl;
u64 cpu_based_3rd_exec_ctrl;
- u32 vmexit_ctrl;
u32 vmentry_ctrl;
+ u32 vmexit_ctrl;
+ u64 vmexit_2nd_ctrl;
u64 misc;
struct nested_vmx_msrs nested;
};
@@ -136,6 +137,12 @@ static inline bool cpu_has_tertiary_exec_ctrls(void)
CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
}
+static inline bool cpu_has_secondary_vmexit_ctrls(void)
+{
+ return vmcs_config.vmexit_ctrl &
+ VM_EXIT_ACTIVATE_SECONDARY_CONTROLS;
+}
+
static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
{
return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index b25625314658..ae152a9d1963 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -47,6 +47,7 @@ struct vmcs_host_state {
struct vmcs_controls_shadow {
u32 vm_entry;
u32 vm_exit;
+ u64 secondary_vm_exit;
u32 pin;
u32 exec;
u32 secondary_exec;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5c5766467a61..f1348b140e7c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2614,8 +2614,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
u32 _cpu_based_exec_control = 0;
u32 _cpu_based_2nd_exec_control = 0;
u64 _cpu_based_3rd_exec_control = 0;
- u32 _vmexit_control = 0;
u32 _vmentry_control = 0;
+ u32 _vmexit_control = 0;
+ u64 _vmexit2_control = 0;
u64 basic_msr;
u64 misc_msr;
@@ -2635,6 +2636,12 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
{ VM_ENTRY_LOAD_IA32_RTIT_CTL, VM_EXIT_CLEAR_IA32_RTIT_CTL },
};
+ struct {
+ u32 entry_control;
+ u64 exit_control;
+ } const vmcs_entry_exit2_pairs[] = {
+ };
+
memset(vmcs_conf, 0, sizeof(*vmcs_conf));
if (adjust_vmx_controls(KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL,
@@ -2721,10 +2728,19 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
&_vmentry_control))
return -EIO;
+ if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
+ _vmexit2_control =
+ adjust_vmx_controls64(KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS,
+ MSR_IA32_VMX_EXIT_CTLS2);
+
if (vmx_check_entry_exit_pairs(vmcs_entry_exit_pairs,
_vmentry_control, _vmexit_control))
return -EIO;
+ if (vmx_check_entry_exit_pairs(vmcs_entry_exit2_pairs,
+ _vmentry_control, _vmexit2_control))
+ return -EIO;
+
/*
* Some cpus support VM_{ENTRY,EXIT}_IA32_PERF_GLOBAL_CTRL but they
* can't be used due to an errata where VM Exit may incorrectly clear
@@ -2773,8 +2789,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
- vmcs_conf->vmexit_ctrl = _vmexit_control;
vmcs_conf->vmentry_ctrl = _vmentry_control;
+ vmcs_conf->vmexit_ctrl = _vmexit_control;
+ vmcs_conf->vmexit_2nd_ctrl = _vmexit2_control;
vmcs_conf->misc = misc_msr;
#if IS_ENABLED(CONFIG_HYPERV)
@@ -4471,6 +4488,11 @@ static u32 vmx_vmexit_ctrl(void)
~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
}
+static u64 vmx_secondary_vmexit_ctrl(void)
+{
+ return vmcs_config.vmexit_2nd_ctrl;
+}
+
void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4819,6 +4841,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vm_exit_controls_set(vmx, vmx_vmexit_ctrl());
+ if (cpu_has_secondary_vmexit_ctrls())
+ secondary_vm_exit_controls_set(vmx, vmx_secondary_vmexit_ctrl());
+
/* 22.2.1, 20.8.1 */
vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 951e44dc9d0e..d0e026390d40 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -508,7 +508,11 @@ static inline u8 vmx_get_rvi(void)
VM_EXIT_LOAD_IA32_EFER | \
VM_EXIT_CLEAR_BNDCFGS | \
VM_EXIT_PT_CONCEAL_PIP | \
- VM_EXIT_CLEAR_IA32_RTIT_CTL)
+ VM_EXIT_CLEAR_IA32_RTIT_CTL | \
+ VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
+
+#define KVM_REQUIRED_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
+#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
#define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL \
(PIN_BASED_EXT_INTR_MASK | \
@@ -613,6 +617,7 @@ static __always_inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##b
}
BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(secondary_vm_exit, SECONDARY_VM_EXIT_CONTROLS, 64)
BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 02/19] KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 01/19] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-04-14 7:41 ` Chao Gao
2025-03-28 17:11 ` [PATCH v4 03/19] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
` (17 subsequent siblings)
19 siblings, 1 reply; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Setup VM entry/exit FRED controls in the global vmcs_config for proper
FRED VMCS fields management:
1) load guest FRED state upon VM entry.
2) save guest FRED state during VM exit.
3) load host FRED state during VM exit.
Also add FRED control consistency checks to the existing VM entry/exit
consistency check framework.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
---
Change in v4:
* Do VM exit/entry consistency checks using the new macro from Sean
Christopherson.
Changes in v3:
* Add FRED control consistency checks to the existing VM entry/exit
consistency check framework (Sean Christopherson).
* Just do the unnecessary FRED state load/store on every VM entry/exit
(Sean Christopherson).
---
arch/x86/include/asm/vmx.h | 4 ++++
arch/x86/kvm/vmx/vmx.c | 3 +++
arch/x86/kvm/vmx/vmx.h | 7 +++++--
3 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 47626773a9e1..5598517617a5 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -108,6 +108,9 @@
#define VM_EXIT_CLEAR_IA32_RTIT_CTL 0x02000000
#define VM_EXIT_ACTIVATE_SECONDARY_CONTROLS 0x80000000
+#define SECONDARY_VM_EXIT_SAVE_IA32_FRED BIT_ULL(0)
+#define SECONDARY_VM_EXIT_LOAD_IA32_FRED BIT_ULL(1)
+
#define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff
#define VM_ENTRY_LOAD_DEBUG_CONTROLS 0x00000004
@@ -120,6 +123,7 @@
#define VM_ENTRY_LOAD_BNDCFGS 0x00010000
#define VM_ENTRY_PT_CONCEAL_PIP 0x00020000
#define VM_ENTRY_LOAD_IA32_RTIT_CTL 0x00040000
+#define VM_ENTRY_LOAD_IA32_FRED 0x00800000
#define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR 0x000011ff
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f1348b140e7c..e38545d0dd17 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2634,12 +2634,15 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
{ VM_ENTRY_LOAD_IA32_EFER, VM_EXIT_LOAD_IA32_EFER },
{ VM_ENTRY_LOAD_BNDCFGS, VM_EXIT_CLEAR_BNDCFGS },
{ VM_ENTRY_LOAD_IA32_RTIT_CTL, VM_EXIT_CLEAR_IA32_RTIT_CTL },
+ { VM_ENTRY_LOAD_IA32_FRED, VM_EXIT_ACTIVATE_SECONDARY_CONTROLS },
};
struct {
u32 entry_control;
u64 exit_control;
} const vmcs_entry_exit2_pairs[] = {
+ { VM_ENTRY_LOAD_IA32_FRED,
+ SECONDARY_VM_EXIT_SAVE_IA32_FRED | SECONDARY_VM_EXIT_LOAD_IA32_FRED },
};
memset(vmcs_conf, 0, sizeof(*vmcs_conf));
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index d0e026390d40..d53904db5d1a 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -486,7 +486,8 @@ static inline u8 vmx_get_rvi(void)
VM_ENTRY_LOAD_IA32_EFER | \
VM_ENTRY_LOAD_BNDCFGS | \
VM_ENTRY_PT_CONCEAL_PIP | \
- VM_ENTRY_LOAD_IA32_RTIT_CTL)
+ VM_ENTRY_LOAD_IA32_RTIT_CTL | \
+ VM_ENTRY_LOAD_IA32_FRED)
#define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS \
(VM_EXIT_SAVE_DEBUG_CONTROLS | \
@@ -512,7 +513,9 @@ static inline u8 vmx_get_rvi(void)
VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
#define KVM_REQUIRED_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
-#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
+#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS \
+ (SECONDARY_VM_EXIT_SAVE_IA32_FRED | \
+ SECONDARY_VM_EXIT_LOAD_IA32_FRED)
#define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL \
(PIN_BASED_EXT_INTR_MASK | \
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 03/19] KVM: VMX: Disable FRED if FRED consistency checks fail
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 01/19] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 02/19] KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-06-24 15:20 ` Sean Christopherson
2025-03-28 17:11 ` [PATCH v4 04/19] x86/cea: Export per CPU array 'cea_exception_stacks' for KVM to use Xin Li (Intel)
` (16 subsequent siblings)
19 siblings, 1 reply; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Do not virtualize FRED if FRED consistency checks fail.
Either on broken hardware, or when run KVM on top of another hypervisor
before the underlying hypervisor implements nested FRED correctly.
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
---
Change in v4:
* Call out the reason why not check FRED VM-exit controls in
cpu_has_vmx_fred() (Chao Gao).
---
arch/x86/kvm/vmx/capabilities.h | 11 +++++++++++
arch/x86/kvm/vmx/vmx.c | 3 +++
2 files changed, 14 insertions(+)
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index b2aefee59395..b4f49a4690ca 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -400,6 +400,17 @@ static inline bool vmx_pebs_supported(void)
return boot_cpu_has(X86_FEATURE_PEBS) && kvm_pmu_cap.pebs_ept;
}
+static inline bool cpu_has_vmx_fred(void)
+{
+ /*
+ * setup_vmcs_config() guarantees FRED VM-entry/exit controls
+ * are either all set or none. So, no need to check FRED VM-exit
+ * controls.
+ */
+ return cpu_feature_enabled(X86_FEATURE_FRED) &&
+ (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_FRED);
+}
+
static inline bool cpu_has_notify_vmexit(void)
{
return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e38545d0dd17..ab84939ace96 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8052,6 +8052,9 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
}
+ if (!cpu_has_vmx_fred())
+ kvm_cpu_cap_clear(X86_FEATURE_FRED);
+
if (!enable_pmu)
kvm_cpu_cap_clear(X86_FEATURE_PDCM);
kvm_caps.supported_perf_cap = vmx_get_perf_capabilities();
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 04/19] x86/cea: Export per CPU array 'cea_exception_stacks' for KVM to use
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (2 preceding siblings ...)
2025-03-28 17:11 ` [PATCH v4 03/19] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-04-10 8:53 ` Christoph Hellwig
2025-03-28 17:11 ` [PATCH v4 05/19] KVM: VMX: Initialize VMCS FRED fields Xin Li (Intel)
` (15 subsequent siblings)
19 siblings, 1 reply; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
The per CPU array 'cea_exception_stacks' points to per CPU stacks
for #DB, NMI and #DF. It is normally referenced via the #define:
__this_cpu_ist_top_va().
FRED introduced new fields in the host-state area of the VMCS for
stack levels 1->3 (HOST_IA32_FRED_RSP[123]), each respectively
corresponding to per CPU stacks for #DB, NMI and #DF. KVM must
populate these each time a vCPU is loaded onto a CPU.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
Change in v4:
* Rewrite the change log and add comments to the export (Dave Hansen).
---
arch/x86/mm/cpu_entry_area.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 575f863f3c75..bc0d687de376 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -17,6 +17,13 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
#ifdef CONFIG_X86_64
static DEFINE_PER_CPU_PAGE_ALIGNED(struct exception_stacks, exception_stacks);
DEFINE_PER_CPU(struct cea_exception_stacks*, cea_exception_stacks);
+/*
+ * FRED introduced new fields in the host-state area of the VMCS for
+ * stack levels 1->3 (HOST_IA32_FRED_RSP[123]), each respectively
+ * corresponding to per CPU stacks for #DB, NMI and #DF. KVM must
+ * populate these each time a vCPU is loaded onto a CPU.
+ */
+EXPORT_PER_CPU_SYMBOL(cea_exception_stacks);
static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, _cea_offset);
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 05/19] KVM: VMX: Initialize VMCS FRED fields
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (3 preceding siblings ...)
2025-03-28 17:11 ` [PATCH v4 04/19] x86/cea: Export per CPU array 'cea_exception_stacks' for KVM to use Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 06/19] KVM: VMX: Set FRED MSR interception Xin Li (Intel)
` (14 subsequent siblings)
19 siblings, 0 replies; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Initialize host VMCS FRED fields with host FRED MSRs' value and
guest VMCS FRED fields to 0.
FRED CPU state is managed in 9 new FRED MSRs:
IA32_FRED_CONFIG,
IA32_FRED_STKLVLS,
IA32_FRED_RSP0,
IA32_FRED_RSP1,
IA32_FRED_RSP2,
IA32_FRED_RSP3,
IA32_FRED_SSP1,
IA32_FRED_SSP2,
IA32_FRED_SSP3,
as well as a few existing CPU registers and MSRs:
CR4.FRED,
IA32_STAR,
IA32_KERNEL_GS_BASE,
IA32_PL0_SSP (also known as IA32_FRED_SSP0).
CR4, IA32_KERNEL_GS_BASE and IA32_STAR are already well managed.
Except IA32_FRED_RSP0 and IA32_FRED_SSP0, all other FRED CPU state
MSRs have corresponding VMCS fields in both the host-state and
guest-state areas. So KVM just needs to initialize them, and with
proper VM entry/exit FRED controls, a FRED CPU will keep tracking
host and guest FRED CPU state in VMCS automatically.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
Change in v4:
* Initialize host SSP[1-3] to 0s in vmx_set_constant_host_state()
because Linux doesn't support kernel shadow stacks (Chao Gao).
Change in v3:
* Use structure kvm_host_values to keep host fred config & stack levels
(Sean Christopherson).
Changes in v2:
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() to decouple
KVM's capability to virtualize a feature and host's enabling of a
feature (Chao Gao).
* Move guest FRED state init into __vmx_vcpu_reset() (Chao Gao).
---
arch/x86/include/asm/vmx.h | 32 ++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 36 ++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.h | 3 +++
3 files changed, 71 insertions(+)
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 5598517617a5..8a2b097aadf2 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -289,12 +289,44 @@ enum vmcs_field {
GUEST_BNDCFGS_HIGH = 0x00002813,
GUEST_IA32_RTIT_CTL = 0x00002814,
GUEST_IA32_RTIT_CTL_HIGH = 0x00002815,
+ GUEST_IA32_FRED_CONFIG = 0x0000281a,
+ GUEST_IA32_FRED_CONFIG_HIGH = 0x0000281b,
+ GUEST_IA32_FRED_RSP1 = 0x0000281c,
+ GUEST_IA32_FRED_RSP1_HIGH = 0x0000281d,
+ GUEST_IA32_FRED_RSP2 = 0x0000281e,
+ GUEST_IA32_FRED_RSP2_HIGH = 0x0000281f,
+ GUEST_IA32_FRED_RSP3 = 0x00002820,
+ GUEST_IA32_FRED_RSP3_HIGH = 0x00002821,
+ GUEST_IA32_FRED_STKLVLS = 0x00002822,
+ GUEST_IA32_FRED_STKLVLS_HIGH = 0x00002823,
+ GUEST_IA32_FRED_SSP1 = 0x00002824,
+ GUEST_IA32_FRED_SSP1_HIGH = 0x00002825,
+ GUEST_IA32_FRED_SSP2 = 0x00002826,
+ GUEST_IA32_FRED_SSP2_HIGH = 0x00002827,
+ GUEST_IA32_FRED_SSP3 = 0x00002828,
+ GUEST_IA32_FRED_SSP3_HIGH = 0x00002829,
HOST_IA32_PAT = 0x00002c00,
HOST_IA32_PAT_HIGH = 0x00002c01,
HOST_IA32_EFER = 0x00002c02,
HOST_IA32_EFER_HIGH = 0x00002c03,
HOST_IA32_PERF_GLOBAL_CTRL = 0x00002c04,
HOST_IA32_PERF_GLOBAL_CTRL_HIGH = 0x00002c05,
+ HOST_IA32_FRED_CONFIG = 0x00002c08,
+ HOST_IA32_FRED_CONFIG_HIGH = 0x00002c09,
+ HOST_IA32_FRED_RSP1 = 0x00002c0a,
+ HOST_IA32_FRED_RSP1_HIGH = 0x00002c0b,
+ HOST_IA32_FRED_RSP2 = 0x00002c0c,
+ HOST_IA32_FRED_RSP2_HIGH = 0x00002c0d,
+ HOST_IA32_FRED_RSP3 = 0x00002c0e,
+ HOST_IA32_FRED_RSP3_HIGH = 0x00002c0f,
+ HOST_IA32_FRED_STKLVLS = 0x00002c10,
+ HOST_IA32_FRED_STKLVLS_HIGH = 0x00002c11,
+ HOST_IA32_FRED_SSP1 = 0x00002c12,
+ HOST_IA32_FRED_SSP1_HIGH = 0x00002c13,
+ HOST_IA32_FRED_SSP2 = 0x00002c14,
+ HOST_IA32_FRED_SSP2_HIGH = 0x00002c15,
+ HOST_IA32_FRED_SSP3 = 0x00002c16,
+ HOST_IA32_FRED_SSP3_HIGH = 0x00002c17,
PIN_BASED_VM_EXEC_CONTROL = 0x00004000,
CPU_BASED_VM_EXEC_CONTROL = 0x00004002,
EXCEPTION_BITMAP = 0x00004004,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ab84939ace96..ac6aa2d091c3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1505,6 +1505,15 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
(unsigned long)(cpu_entry_stack(cpu) + 1));
}
+ /* Per-CPU FRED MSRs */
+ if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+#ifdef CONFIG_X86_64
+ vmcs_write64(HOST_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB));
+ vmcs_write64(HOST_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI));
+ vmcs_write64(HOST_IA32_FRED_RSP3, __this_cpu_ist_top_va(DF));
+#endif
+ }
+
vmx->loaded_vmcs->cpu = cpu;
}
}
@@ -4388,6 +4397,17 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
*/
vmcs_write16(HOST_DS_SELECTOR, 0);
vmcs_write16(HOST_ES_SELECTOR, 0);
+
+ if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+ /* FRED CONFIG and STKLVLS are the same on all CPUs */
+ vmcs_write64(HOST_IA32_FRED_CONFIG, kvm_host.fred_config);
+ vmcs_write64(HOST_IA32_FRED_STKLVLS, kvm_host.fred_stklvls);
+
+ /* Linux doesn't support kernel shadow stacks, thus SSPs are 0s */
+ vmcs_write64(HOST_IA32_FRED_SSP1, 0);
+ vmcs_write64(HOST_IA32_FRED_SSP2, 0);
+ vmcs_write64(HOST_IA32_FRED_SSP3, 0);
+ }
#else
vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS); /* 22.2.4 */
vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS); /* 22.2.4 */
@@ -4889,6 +4909,17 @@ static void init_vmcs(struct vcpu_vmx *vmx)
}
vmx_setup_uret_msrs(vmx);
+
+ if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+ vmcs_write64(GUEST_IA32_FRED_CONFIG, 0);
+ vmcs_write64(GUEST_IA32_FRED_RSP1, 0);
+ vmcs_write64(GUEST_IA32_FRED_RSP2, 0);
+ vmcs_write64(GUEST_IA32_FRED_RSP3, 0);
+ vmcs_write64(GUEST_IA32_FRED_STKLVLS, 0);
+ vmcs_write64(GUEST_IA32_FRED_SSP1, 0);
+ vmcs_write64(GUEST_IA32_FRED_SSP2, 0);
+ vmcs_write64(GUEST_IA32_FRED_SSP3, 0);
+ }
}
static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
@@ -8705,6 +8736,11 @@ __init int vmx_hardware_setup(void)
kvm_set_posted_intr_wakeup_handler(pi_wakeup_handler);
+ if (kvm_cpu_cap_has(X86_FEATURE_FRED)) {
+ rdmsrl(MSR_IA32_FRED_CONFIG, kvm_host.fred_config);
+ rdmsrl(MSR_IA32_FRED_STKLVLS, kvm_host.fred_stklvls);
+ }
+
return r;
}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 9dc32a409076..02514f5b9c0b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -46,6 +46,9 @@ struct kvm_host_values {
u64 xcr0;
u64 xss;
u64 arch_capabilities;
+
+ u64 fred_config;
+ u64 fred_stklvls;
};
void kvm_spurious_fault(void);
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 06/19] KVM: VMX: Set FRED MSR interception
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (4 preceding siblings ...)
2025-03-28 17:11 ` [PATCH v4 05/19] KVM: VMX: Initialize VMCS FRED fields Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-06-24 15:27 ` Sean Christopherson
2025-03-28 17:11 ` [PATCH v4 07/19] KVM: VMX: Save/restore guest FRED RSP0 Xin Li (Intel)
` (13 subsequent siblings)
19 siblings, 1 reply; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Add FRED MSRs to the VMX passthrough MSR list and set FRED MSRs
interception.
8 FRED MSRs, i.e., MSR_IA32_FRED_RSP[123], MSR_IA32_FRED_STKLVLS,
MSR_IA32_FRED_SSP[123] and MSR_IA32_FRED_CONFIG, are all safe to
be passthrough, because they all have a pair of corresponding host
and guest VMCS fields.
Both MSR_IA32_FRED_RSP0 and MSR_IA32_FRED_SSP0 are dedicated for
userspace event delivery only, IOW they are NOT used in any kernel
event delivery and the execution of ERETS. Thus KVM can run safely
with guest values in the 2 MSRs. As a result, save and restore of
their guest values are deferred until vCPU context switch and their
host values are restored upon host returning to userspace.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
arch/x86/kvm/vmx/vmx.c | 40 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.h | 2 +-
2 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ac6aa2d091c3..236fe5428a74 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -176,6 +176,16 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
MSR_FS_BASE,
MSR_GS_BASE,
MSR_KERNEL_GS_BASE,
+ MSR_IA32_FRED_RSP0,
+ MSR_IA32_FRED_RSP1,
+ MSR_IA32_FRED_RSP2,
+ MSR_IA32_FRED_RSP3,
+ MSR_IA32_FRED_STKLVLS,
+ MSR_IA32_FRED_SSP1,
+ MSR_IA32_FRED_SSP2,
+ MSR_IA32_FRED_SSP3,
+ MSR_IA32_FRED_CONFIG,
+ MSR_IA32_FRED_SSP0, /* Should be added through CET */
MSR_IA32_XFD,
MSR_IA32_XFD_ERR,
#endif
@@ -7935,6 +7945,34 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}
+static void vmx_set_intercept_for_fred_msr(struct kvm_vcpu *vcpu)
+{
+ bool flag = !guest_cpu_cap_has(vcpu, X86_FEATURE_FRED);
+
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP1, MSR_TYPE_RW, flag);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP2, MSR_TYPE_RW, flag);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP3, MSR_TYPE_RW, flag);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_STKLVLS, MSR_TYPE_RW, flag);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP1, MSR_TYPE_RW, flag);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP2, MSR_TYPE_RW, flag);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP3, MSR_TYPE_RW, flag);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_CONFIG, MSR_TYPE_RW, flag);
+
+ /*
+ * IA32_FRED_RSP0 and IA32_PL0_SSP (a.k.a. IA32_FRED_SSP0) are only used
+ * for delivering events when running userspace, while KVM always runs in
+ * kernel mode (the CPL is always 0 after any VM exit), thus KVM can run
+ * safely with guest IA32_FRED_RSP0 and IA32_PL0_SSP.
+ *
+ * As a result, no need to intercept IA32_FRED_RSP0 and IA32_PL0_SSP.
+ *
+ * Note, save and restore of IA32_PL0_SSP belong to CET supervisor context
+ * management no matter whether FRED is enabled or not. So leave its
+ * state management to CET code.
+ */
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP0, MSR_TYPE_RW, flag);
+}
+
void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -8007,6 +8045,8 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
/* Refresh #PF interception to account for MAXPHYADDR changes. */
vmx_update_exception_bitmap(vcpu);
+
+ vmx_set_intercept_for_fred_msr(vcpu);
}
static __init u64 vmx_get_perf_capabilities(void)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index d53904db5d1a..f48791cf6aa6 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -356,7 +356,7 @@ struct vcpu_vmx {
struct lbr_desc lbr_desc;
/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS 16
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS 26
struct {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 07/19] KVM: VMX: Save/restore guest FRED RSP0
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (5 preceding siblings ...)
2025-03-28 17:11 ` [PATCH v4 06/19] KVM: VMX: Set FRED MSR interception Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-06-24 15:44 ` Sean Christopherson
2025-03-28 17:11 ` [PATCH v4 08/19] KVM: VMX: Add support for FRED context save/restore Xin Li (Intel)
` (12 subsequent siblings)
19 siblings, 1 reply; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Save guest FRED RSP0 in vmx_prepare_switch_to_host() and restore it
in vmx_prepare_switch_to_guest() because MSR_IA32_FRED_RSP0 is passed
through to the guest, thus is volatile/unknown.
Note, host FRED RSP0 is restored in arch_exit_to_user_mode_prepare(),
regardless of whether it is modified in KVM.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
Changes in v3:
* KVM only needs to save/restore guest FRED RSP0 now as host FRED RSP0
is restored in arch_exit_to_user_mode_prepare() (Sean Christopherson).
Changes in v2:
* Don't use guest_cpuid_has() in vmx_prepare_switch_to_{host,guest}(),
which are called from IRQ-disabled context (Chao Gao).
* Reset msr_guest_fred_rsp0 in __vmx_vcpu_reset() (Chao Gao).
---
arch/x86/kvm/vmx/vmx.c | 9 +++++++++
arch/x86/kvm/vmx/vmx.h | 1 +
2 files changed, 10 insertions(+)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 236fe5428a74..1fd32aa255f9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1349,6 +1349,10 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
}
wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
+
+ if (cpu_feature_enabled(X86_FEATURE_FRED) && guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
+ wrmsrns(MSR_IA32_FRED_RSP0, vmx->msr_guest_fred_rsp0);
+
#else
savesegment(fs, fs_sel);
savesegment(gs, gs_sel);
@@ -1393,6 +1397,11 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx)
invalidate_tss_limit();
#ifdef CONFIG_X86_64
wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
+
+ if (cpu_feature_enabled(X86_FEATURE_FRED) && guest_cpu_cap_has(&vmx->vcpu, X86_FEATURE_FRED)) {
+ vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
+ fred_sync_rsp0(vmx->msr_guest_fred_rsp0);
+ }
#endif
load_fixmap_gdt(raw_smp_processor_id());
vmx->guest_state_loaded = false;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index f48791cf6aa6..8e27b7cc700d 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -276,6 +276,7 @@ struct vcpu_vmx {
#ifdef CONFIG_X86_64
u64 msr_host_kernel_gs_base;
u64 msr_guest_kernel_gs_base;
+ u64 msr_guest_fred_rsp0;
#endif
u64 spec_ctrl;
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 08/19] KVM: VMX: Add support for FRED context save/restore
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (6 preceding siblings ...)
2025-03-28 17:11 ` [PATCH v4 07/19] KVM: VMX: Save/restore guest FRED RSP0 Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-06-24 16:27 ` Sean Christopherson
2025-03-28 17:11 ` [PATCH v4 09/19] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU Xin Li (Intel)
` (11 subsequent siblings)
19 siblings, 1 reply; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Handle FRED MSR access requests, allowing FRED context to be set/get
from both host and guest.
During VM save/restore and live migration, FRED context needs to be
saved/restored, which requires FRED MSRs to be accessed from userspace,
e.g., Qemu.
Note, handling of MSR_IA32_FRED_SSP0, i.e., MSR_IA32_PL0_SSP, is not
added yet, which is done in the KVM CET patch set.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
Changes since v2:
* Add a helper to convert FRED MSR index to VMCS field encoding to
make the code more compact (Chao Gao).
* Get rid of the "host_initiated" check because userspace has to set
CPUID before MSRs (Chao Gao & Sean Christopherson).
* Address a few cleanup comments (Sean Christopherson).
Changes since v1:
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() (Chao Gao).
* Fail host requested FRED MSRs access if KVM cannot virtualize FRED
(Chao Gao).
* Handle the case FRED MSRs are valid but KVM cannot virtualize FRED
(Chao Gao).
* Add sanity checks when writing to FRED MSRs.
---
arch/x86/kvm/vmx/vmx.c | 48 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 28 ++++++++++++++++++++++++
2 files changed, 76 insertions(+)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1fd32aa255f9..ae9712624413 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1426,6 +1426,24 @@ static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data)
preempt_enable();
vmx->msr_guest_kernel_gs_base = data;
}
+
+static u64 vmx_read_guest_fred_rsp0(struct vcpu_vmx *vmx)
+{
+ preempt_disable();
+ if (vmx->guest_state_loaded)
+ vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
+ preempt_enable();
+ return vmx->msr_guest_fred_rsp0;
+}
+
+static void vmx_write_guest_fred_rsp0(struct vcpu_vmx *vmx, u64 data)
+{
+ preempt_disable();
+ if (vmx->guest_state_loaded)
+ wrmsrns(MSR_IA32_FRED_RSP0, data);
+ preempt_enable();
+ vmx->msr_guest_fred_rsp0 = data;
+}
#endif
static void grow_ple_window(struct kvm_vcpu *vcpu)
@@ -2039,6 +2057,24 @@ int vmx_get_feature_msr(u32 msr, u64 *data)
}
}
+#ifdef CONFIG_X86_64
+static u32 fred_msr_vmcs_fields[] = {
+ GUEST_IA32_FRED_RSP1,
+ GUEST_IA32_FRED_RSP2,
+ GUEST_IA32_FRED_RSP3,
+ GUEST_IA32_FRED_STKLVLS,
+ GUEST_IA32_FRED_SSP1,
+ GUEST_IA32_FRED_SSP2,
+ GUEST_IA32_FRED_SSP3,
+ GUEST_IA32_FRED_CONFIG,
+};
+
+static u32 fred_msr_to_vmcs(u32 msr)
+{
+ return fred_msr_vmcs_fields[msr - MSR_IA32_FRED_RSP1];
+}
+#endif
+
/*
* Reads an msr value (of 'msr_info->index') into 'msr_info->data'.
* Returns 0 on success, non-0 otherwise.
@@ -2061,6 +2097,12 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_KERNEL_GS_BASE:
msr_info->data = vmx_read_guest_kernel_gs_base(vmx);
break;
+ case MSR_IA32_FRED_RSP0:
+ msr_info->data = vmx_read_guest_fred_rsp0(vmx);
+ break;
+ case MSR_IA32_FRED_RSP1 ... MSR_IA32_FRED_CONFIG:
+ msr_info->data = vmcs_read64(fred_msr_to_vmcs(msr_info->index));
+ break;
#endif
case MSR_EFER:
return kvm_get_msr_common(vcpu, msr_info);
@@ -2268,6 +2310,12 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vmx_update_exception_bitmap(vcpu);
}
break;
+ case MSR_IA32_FRED_RSP0:
+ vmx_write_guest_fred_rsp0(vmx, data);
+ break;
+ case MSR_IA32_FRED_RSP1 ... MSR_IA32_FRED_CONFIG:
+ vmcs_write64(fred_msr_to_vmcs(msr_index), data);
+ break;
#endif
case MSR_IA32_SYSENTER_CS:
if (is_guest_mode(vcpu))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c841817a914a..007577143337 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -318,6 +318,9 @@ static const u32 msrs_to_save_base[] = {
MSR_STAR,
#ifdef CONFIG_X86_64
MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
+ MSR_IA32_FRED_RSP0, MSR_IA32_FRED_RSP1, MSR_IA32_FRED_RSP2,
+ MSR_IA32_FRED_RSP3, MSR_IA32_FRED_STKLVLS, MSR_IA32_FRED_SSP1,
+ MSR_IA32_FRED_SSP2, MSR_IA32_FRED_SSP3, MSR_IA32_FRED_CONFIG,
#endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
MSR_IA32_FEAT_CTL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
@@ -1849,6 +1852,23 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
data = (u32)data;
break;
+ case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
+ if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
+ return 1;
+
+ /* Bit 11, bits 5:4, and bit 2 of the IA32_FRED_CONFIG must be zero */
+ if (index == MSR_IA32_FRED_CONFIG && data & (BIT_ULL(11) | GENMASK_ULL(5, 4) | BIT_ULL(2)))
+ return 1;
+ if (index != MSR_IA32_FRED_STKLVLS && is_noncanonical_msr_address(data, vcpu))
+ return 1;
+ if ((index >= MSR_IA32_FRED_RSP0 && index <= MSR_IA32_FRED_RSP3) &&
+ (data & GENMASK_ULL(5, 0)))
+ return 1;
+ if ((index >= MSR_IA32_FRED_SSP1 && index <= MSR_IA32_FRED_SSP3) &&
+ (data & GENMASK_ULL(2, 0)))
+ return 1;
+
+ break;
}
msr.data = data;
@@ -1893,6 +1913,10 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
!guest_cpu_cap_has(vcpu, X86_FEATURE_RDPID))
return 1;
break;
+ case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
+ if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
+ return 1;
+ break;
}
msr.index = index;
@@ -7455,6 +7479,10 @@ static void kvm_probe_msr_to_save(u32 msr_index)
if (!(kvm_get_arch_capabilities() & ARCH_CAP_TSX_CTRL_MSR))
return;
break;
+ case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
+ if (!kvm_cpu_cap_has(X86_FEATURE_FRED))
+ return;
+ break;
default:
break;
}
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 09/19] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (7 preceding siblings ...)
2025-03-28 17:11 ` [PATCH v4 08/19] KVM: VMX: Add support for FRED context save/restore Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 10/19] KVM: VMX: Virtualize FRED event_data Xin Li (Intel)
` (10 subsequent siblings)
19 siblings, 0 replies; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[ Sean: removed the "kvm_" prefix from the function name ]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
arch/x86/kvm/kvm_cache_regs.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 36a8786db291..31b446b6cbd7 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -204,6 +204,21 @@ static __always_inline bool kvm_is_cr4_bit_set(struct kvm_vcpu *vcpu,
return !!kvm_read_cr4_bits(vcpu, cr4_bit);
}
+/*
+ * It's enough to check just CR4.FRED (X86_CR4_FRED) to tell if
+ * a vCPU is running with FRED enabled, because:
+ * 1) CR4.FRED can be set to 1 only _after_ IA32_EFER.LMA = 1.
+ * 2) To leave IA-32e mode, CR4.FRED must be cleared first.
+ */
+static inline bool is_fred_enabled(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_X86_64
+ return kvm_is_cr4_bit_set(vcpu, X86_CR4_FRED);
+#else
+ return false;
+#endif
+}
+
static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu)
{
if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 10/19] KVM: VMX: Virtualize FRED event_data
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (8 preceding siblings ...)
2025-03-28 17:11 ` [PATCH v4 09/19] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 11/19] KVM: VMX: Virtualize FRED nested exception tracking Xin Li (Intel)
` (9 subsequent siblings)
19 siblings, 0 replies; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Set injected-event data when injecting a #PF, #DB, or #NM caused
by extended feature disable using FRED event delivery, and save
original-event data for being used as injected-event data.
Unlike IDT using some extra CPU register as part of an event
context, e.g., %cr2 for #PF, FRED saves a complete event context
in its stack frame, e.g., FRED saves the faulting linear address
of a #PF into the event data field defined in its stack frame.
Thus a new VMX control field called injected-event data is added
to provide the event data that will be pushed into a FRED stack
frame for VM entries that inject an event using FRED event delivery.
In addition, a new VM exit information field called original-event
data is added to store the event data that would have saved into a
FRED stack frame for VM exits that occur during FRED event delivery.
After such a VM exit is handled to allow the original-event to be
delivered, the data in the original-event data VMCS field needs to
be set into the injected-event data VMCS field for the injection of
the original event.
Signed-off-by: Xin Li <xin3.li@intel.com>
[ Sean: reworked event data injection for nested ]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
Change in v3:
* Rework event data injection for nested (Chao Gao & Sean Christopherson).
Changes in v2:
* Document event data should be equal to CR2/DR6/IA32_XFD_ERR instead
of using WARN_ON() (Chao Gao).
* Zero event data if a #NM was not caused by extended feature disable
(Chao Gao).
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/include/asm/vmx.h | 4 ++++
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 22 ++++++++++++++++++----
arch/x86/kvm/x86.c | 16 +++++++++++++++-
5 files changed, 40 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a884ab544335..85b6713702d2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -741,6 +741,7 @@ struct kvm_queued_exception {
u32 error_code;
unsigned long payload;
bool has_payload;
+ u64 event_data;
};
/*
@@ -2168,7 +2169,7 @@ void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long payload);
void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
- bool has_error_code, u32 error_code);
+ bool has_error_code, u32 error_code, u64 event_data);
void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
struct x86_exception *fault);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 8a2b097aadf2..1f20a28c9262 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -265,8 +265,12 @@ enum vmcs_field {
PID_POINTER_TABLE_HIGH = 0x00002043,
SECONDARY_VM_EXIT_CONTROLS = 0x00002044,
SECONDARY_VM_EXIT_CONTROLS_HIGH = 0x00002045,
+ INJECTED_EVENT_DATA = 0x00002052,
+ INJECTED_EVENT_DATA_HIGH = 0x00002053,
GUEST_PHYSICAL_ADDRESS = 0x00002400,
GUEST_PHYSICAL_ADDRESS_HIGH = 0x00002401,
+ ORIGINAL_EVENT_DATA = 0x00002404,
+ ORIGINAL_EVENT_DATA_HIGH = 0x00002405,
VMCS_LINK_POINTER = 0x00002800,
VMCS_LINK_POINTER_HIGH = 0x00002801,
GUEST_IA32_DEBUGCTL = 0x00002802,
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d5d0c5c3300b..73bde84ca9a4 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4172,7 +4172,7 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
kvm_requeue_exception(vcpu, vector,
exitintinfo & SVM_EXITINTINFO_VALID_ERR,
- error_code);
+ error_code, 0);
break;
}
case SVM_EXITINTINFO_TYPE_INTR:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ae9712624413..ae6d275aab6a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1918,6 +1918,9 @@ void vmx_inject_exception(struct kvm_vcpu *vcpu)
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
+ if (is_fred_enabled(vcpu))
+ vmcs_write64(INJECTED_EVENT_DATA, ex->event_data);
+
vmx_clear_hlt(vcpu);
}
@@ -7295,7 +7298,8 @@ static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx)
static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
u32 idt_vectoring_info,
int instr_len_field,
- int error_code_field)
+ int error_code_field,
+ int event_data_field)
{
u8 vector;
int type;
@@ -7330,13 +7334,17 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
fallthrough;
case INTR_TYPE_HARD_EXCEPTION: {
u32 error_code = 0;
+ u64 event_data = 0;
if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK)
error_code = vmcs_read32(error_code_field);
+ if (is_fred_enabled(vcpu))
+ event_data = vmcs_read64(event_data_field);
kvm_requeue_exception(vcpu, vector,
idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK,
- error_code);
+ error_code,
+ event_data);
break;
}
case INTR_TYPE_SOFT_INTR:
@@ -7354,7 +7362,8 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
{
__vmx_complete_interrupts(&vmx->vcpu, vmx->idt_vectoring_info,
VM_EXIT_INSTRUCTION_LEN,
- IDT_VECTORING_ERROR_CODE);
+ IDT_VECTORING_ERROR_CODE,
+ ORIGINAL_EVENT_DATA);
}
void vmx_cancel_injection(struct kvm_vcpu *vcpu)
@@ -7362,7 +7371,8 @@ void vmx_cancel_injection(struct kvm_vcpu *vcpu)
__vmx_complete_interrupts(vcpu,
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
VM_ENTRY_INSTRUCTION_LEN,
- VM_ENTRY_EXCEPTION_ERROR_CODE);
+ VM_ENTRY_EXCEPTION_ERROR_CODE,
+ INJECTED_EVENT_DATA);
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
}
@@ -7493,6 +7503,10 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
vmx_disable_fb_clear(vmx);
+ /*
+ * Note, even though FRED delivers the faulting linear address via the
+ * event data field on the stack, CR2 is still updated.
+ */
if (vcpu->arch.cr2 != native_read_cr2())
native_write_cr2(vcpu->arch.cr2);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 007577143337..d1d42926ac67 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -777,9 +777,22 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
* breakpoint), it is reserved and must be zero in DR6.
*/
vcpu->arch.dr6 &= ~BIT(12);
+
+ /*
+ * FRED #DB event data matches DR6, but follows the polarity of
+ * VMX's pending debug exceptions, not DR6.
+ */
+ ex->event_data = ex->payload & ~BIT(12);
+ break;
+ case NM_VECTOR:
+ ex->event_data = ex->payload;
break;
case PF_VECTOR:
vcpu->arch.cr2 = ex->payload;
+ ex->event_data = ex->payload;
+ break;
+ default:
+ ex->event_data = 0;
break;
}
@@ -887,7 +900,7 @@ static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
}
void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
- bool has_error_code, u32 error_code)
+ bool has_error_code, u32 error_code, u64 event_data)
{
/*
@@ -912,6 +925,7 @@ void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
vcpu->arch.exception.error_code = error_code;
vcpu->arch.exception.has_payload = false;
vcpu->arch.exception.payload = 0;
+ vcpu->arch.exception.event_data = event_data;
}
EXPORT_SYMBOL_GPL(kvm_requeue_exception);
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 11/19] KVM: VMX: Virtualize FRED nested exception tracking
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (9 preceding siblings ...)
2025-03-28 17:11 ` [PATCH v4 10/19] KVM: VMX: Virtualize FRED event_data Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 12/19] KVM: x86: Save/restore the nested flag of an exception Xin Li (Intel)
` (8 subsequent siblings)
19 siblings, 0 replies; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Set the VMX nested exception bit in VM-entry interruption information
field when injecting a nested exception using FRED event delivery to
ensure:
1) A nested exception is injected on a correct stack level.
2) The nested bit defined in FRED stack frame is set.
The event stack level used by FRED event delivery depends on whether
the event was a nested exception encountered during delivery of an
earlier event, because a nested exception is "regarded" as happening
on ring 0. E.g., when #PF is configured to use stack level 1 in
IA32_FRED_STKLVLS MSR:
- nested #PF will be delivered on the stack pointed by IA32_FRED_RSP1
MSR when encountered in ring 3 and ring 0.
- normal #PF will be delivered on the stack pointed by IA32_FRED_RSP0
MSR when encountered in ring 3.
The VMX nested-exception support ensures a correct event stack level is
chosen when a VM entry injects a nested exception.
Signed-off-by: Xin Li <xin3.li@intel.com>
[ Sean: reworked kvm_requeue_exception() to simply the code changes ]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
Change in v4:
* Move the check is_fred_enable() from kvm_multiple_exception() to
vmx_inject_exception() thus avoid bleeding FRED details into
kvm_multiple_exception() (Chao Gao).
Change in v3:
* Rework kvm_requeue_exception() to simply the code changes (Sean
Christopherson).
Change in v2:
* Set the nested flag when there is an original interrupt (Chao Gao).
---
arch/x86/include/asm/kvm_host.h | 4 +++-
arch/x86/include/asm/vmx.h | 5 ++++-
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 6 +++++-
arch/x86/kvm/x86.c | 13 ++++++++++++-
arch/x86/kvm/x86.h | 1 +
6 files changed, 26 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 85b6713702d2..c5f92a1befc0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -741,6 +741,7 @@ struct kvm_queued_exception {
u32 error_code;
unsigned long payload;
bool has_payload;
+ bool nested;
u64 event_data;
};
@@ -2169,7 +2170,8 @@ void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long payload);
void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
- bool has_error_code, u32 error_code, u64 event_data);
+ bool has_error_code, u32 error_code, bool nested,
+ u64 event_data);
void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
struct x86_exception *fault);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 1f20a28c9262..a019a06d21aa 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -137,6 +137,7 @@
#define VMX_BASIC_DUAL_MONITOR_TREATMENT BIT_ULL(49)
#define VMX_BASIC_INOUT BIT_ULL(54)
#define VMX_BASIC_TRUE_CTLS BIT_ULL(55)
+#define VMX_BASIC_NESTED_EXCEPTION BIT_ULL(58)
static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
{
@@ -432,13 +433,15 @@ enum vmcs_field {
#define INTR_INFO_INTR_TYPE_MASK 0x700 /* 10:8 */
#define INTR_INFO_DELIVER_CODE_MASK 0x800 /* 11 */
#define INTR_INFO_UNBLOCK_NMI 0x1000 /* 12 */
+#define INTR_INFO_NESTED_EXCEPTION_MASK 0x2000 /* 13 */
#define INTR_INFO_VALID_MASK 0x80000000 /* 31 */
-#define INTR_INFO_RESVD_BITS_MASK 0x7ffff000
+#define INTR_INFO_RESVD_BITS_MASK 0x7fffd000
#define VECTORING_INFO_VECTOR_MASK INTR_INFO_VECTOR_MASK
#define VECTORING_INFO_TYPE_MASK INTR_INFO_INTR_TYPE_MASK
#define VECTORING_INFO_DELIVER_CODE_MASK INTR_INFO_DELIVER_CODE_MASK
#define VECTORING_INFO_VALID_MASK INTR_INFO_VALID_MASK
+#define VECTORING_INFO_NESTED_EXCEPTION_MASK INTR_INFO_NESTED_EXCEPTION_MASK
#define INTR_TYPE_EXT_INTR (EVENT_TYPE_EXTINT << 8) /* external interrupt */
#define INTR_TYPE_RESERVED (EVENT_TYPE_RESERVED << 8) /* reserved */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 73bde84ca9a4..d96d6cec4a34 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4172,7 +4172,7 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
kvm_requeue_exception(vcpu, vector,
exitintinfo & SVM_EXITINTINFO_VALID_ERR,
- error_code, 0);
+ error_code, false, 0);
break;
}
case SVM_EXITINTINFO_TYPE_INTR:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ae6d275aab6a..c76015e1e3f8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1913,8 +1913,11 @@ void vmx_inject_exception(struct kvm_vcpu *vcpu)
vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
vmx->vcpu.arch.event_exit_inst_len);
intr_info |= INTR_TYPE_SOFT_EXCEPTION;
- } else
+ } else {
intr_info |= INTR_TYPE_HARD_EXCEPTION;
+ if (ex->nested && is_fred_enabled(vcpu))
+ intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
+ }
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
@@ -7344,6 +7347,7 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
kvm_requeue_exception(vcpu, vector,
idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK,
error_code,
+ idt_vectoring_info & VECTORING_INFO_NESTED_EXCEPTION_MASK,
event_data);
break;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d1d42926ac67..7f013ff97067 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -841,6 +841,10 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
vcpu->arch.exception.pending = true;
vcpu->arch.exception.injected = false;
+ vcpu->arch.exception.nested = vcpu->arch.exception.nested ||
+ vcpu->arch.nmi_injected ||
+ vcpu->arch.interrupt.injected;
+
vcpu->arch.exception.has_error_code = has_error;
vcpu->arch.exception.vector = nr;
vcpu->arch.exception.error_code = error_code;
@@ -870,8 +874,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
vcpu->arch.exception.injected = false;
vcpu->arch.exception.pending = false;
+ /* #DF is NOT a nested event, per its definition. */
+ vcpu->arch.exception.nested = false;
+
kvm_queue_exception_e(vcpu, DF_VECTOR, 0);
} else {
+ vcpu->arch.exception.nested = true;
+
/* replace previous exception with a new one in a hope
that instruction re-execution will regenerate lost
exception */
@@ -900,7 +909,8 @@ static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
}
void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
- bool has_error_code, u32 error_code, u64 event_data)
+ bool has_error_code, u32 error_code, bool nested,
+ u64 event_data)
{
/*
@@ -925,6 +935,7 @@ void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
vcpu->arch.exception.error_code = error_code;
vcpu->arch.exception.has_payload = false;
vcpu->arch.exception.payload = 0;
+ vcpu->arch.exception.nested = nested;
vcpu->arch.exception.event_data = event_data;
}
EXPORT_SYMBOL_GPL(kvm_requeue_exception);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 02514f5b9c0b..13dbd87970db 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -142,6 +142,7 @@ static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
{
vcpu->arch.exception.pending = false;
vcpu->arch.exception.injected = false;
+ vcpu->arch.exception.nested = false;
vcpu->arch.exception_vmexit.pending = false;
}
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 12/19] KVM: x86: Save/restore the nested flag of an exception
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (10 preceding siblings ...)
2025-03-28 17:11 ` [PATCH v4 11/19] KVM: VMX: Virtualize FRED nested exception tracking Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 13/19] KVM: x86: Mark CR4.FRED as not reserved Xin Li (Intel)
` (7 subsequent siblings)
19 siblings, 0 replies; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
Save/restore the nested flag of an exception during VM save/restore
and live migration to ensure a correct event stack level is chosen
when a nested exception is injected through FRED event delivery.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
---
Change in v4:
* Add live migration support for exception nested flag (Chao Gao).
---
Documentation/virt/kvm/api.rst | 19 +++++++++++++++++++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/uapi/asm/kvm.h | 4 +++-
arch/x86/kvm/x86.c | 19 ++++++++++++++++++-
include/uapi/linux/kvm.h | 1 +
5 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 1f8625b7646a..32c00b07bcf1 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1184,6 +1184,10 @@ The following bits are defined in the flags field:
fields contain a valid state. This bit will be set whenever
KVM_CAP_EXCEPTION_PAYLOAD is enabled.
+- KVM_VCPUEVENT_VALID_NESTED_FLAG may be set to inform that the
+ exception is a nested exception. This bit will be set whenever
+ KVM_CAP_EXCEPTION_NESTED_FLAG is enabled.
+
- KVM_VCPUEVENT_VALID_TRIPLE_FAULT may be set to signal that the
triple_fault_pending field contains a valid state. This bit will
be set whenever KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled.
@@ -1283,6 +1287,10 @@ can be set in the flags field to signal that the
exception_has_payload, exception_payload, and exception.pending fields
contain a valid state and shall be written into the VCPU.
+If KVM_CAP_EXCEPTION_NESTED_FLAG is enabled, KVM_VCPUEVENT_VALID_NESTED_FLAG
+can be set in the flags field to inform that the exception is a nested
+exception and exception_is_nested shall be written into the VCPU.
+
If KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled, KVM_VCPUEVENT_VALID_TRIPLE_FAULT
can be set in flags field to signal that the triple_fault field contains
a valid state and shall be written into the VCPU.
@@ -8280,6 +8288,17 @@ aforementioned registers before the first KVM_RUN. These registers are VM
scoped, meaning that the same set of values are presented on all vCPUs in a
given VM.
+7.38 KVM_CAP_EXCEPTION_NESTED_FLAG
+----------------------------------
+
+:Architectures: x86
+:Parameters: args[0] whether feature should be enabled or not
+
+With this capability enabled, an exception is save/restored with the
+additional information of whether it was nested or not. FRED event
+delivery uses this information to ensure a correct event stack level
+is chosen when a VM entry injects a nested exception.
+
8. Other capabilities.
======================
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c5f92a1befc0..f8b9834f2f37 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1444,6 +1444,7 @@ struct kvm_arch {
bool guest_can_read_msr_platform_info;
bool exception_payload_enabled;
+ bool exception_nested_flag_enabled;
bool triple_fault_event;
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 460306b35a4b..6a3a39d04843 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -326,6 +326,7 @@ struct kvm_reinject_control {
#define KVM_VCPUEVENT_VALID_SMM 0x00000008
#define KVM_VCPUEVENT_VALID_PAYLOAD 0x00000010
#define KVM_VCPUEVENT_VALID_TRIPLE_FAULT 0x00000020
+#define KVM_VCPUEVENT_VALID_NESTED_FLAG 0x00000040
/* Interrupt shadow states */
#define KVM_X86_SHADOW_INT_MOV_SS 0x01
@@ -363,7 +364,8 @@ struct kvm_vcpu_events {
struct {
__u8 pending;
} triple_fault;
- __u8 reserved[26];
+ __u8 reserved[25];
+ __u8 exception_is_nested;
__u8 exception_has_payload;
__u64 exception_payload;
};
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7f013ff97067..17b5a799f65d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4710,6 +4710,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_GET_MSR_FEATURES:
case KVM_CAP_MSR_PLATFORM_INFO:
case KVM_CAP_EXCEPTION_PAYLOAD:
+ case KVM_CAP_EXCEPTION_NESTED_FLAG:
case KVM_CAP_X86_TRIPLE_FAULT_EVENT:
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_LAST_CPU:
@@ -5437,6 +5438,7 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
events->exception.error_code = ex->error_code;
events->exception_has_payload = ex->has_payload;
events->exception_payload = ex->payload;
+ events->exception_is_nested = ex->nested;
events->interrupt.injected =
vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
@@ -5462,6 +5464,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
| KVM_VCPUEVENT_VALID_SMM);
if (vcpu->kvm->arch.exception_payload_enabled)
events->flags |= KVM_VCPUEVENT_VALID_PAYLOAD;
+ if (vcpu->kvm->arch.exception_nested_flag_enabled)
+ events->flags |= KVM_VCPUEVENT_VALID_NESTED_FLAG;
if (vcpu->kvm->arch.triple_fault_event) {
events->triple_fault.pending = kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu);
events->flags |= KVM_VCPUEVENT_VALID_TRIPLE_FAULT;
@@ -5476,7 +5480,8 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
| KVM_VCPUEVENT_VALID_SHADOW
| KVM_VCPUEVENT_VALID_SMM
| KVM_VCPUEVENT_VALID_PAYLOAD
- | KVM_VCPUEVENT_VALID_TRIPLE_FAULT))
+ | KVM_VCPUEVENT_VALID_TRIPLE_FAULT
+ | KVM_VCPUEVENT_VALID_NESTED_FLAG))
return -EINVAL;
if (events->flags & KVM_VCPUEVENT_VALID_PAYLOAD) {
@@ -5491,6 +5496,13 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
events->exception_has_payload = 0;
}
+ if (events->flags & KVM_VCPUEVENT_VALID_NESTED_FLAG) {
+ if (!vcpu->kvm->arch.exception_nested_flag_enabled)
+ return -EINVAL;
+ } else {
+ events->exception_is_nested = 0;
+ }
+
if ((events->exception.injected || events->exception.pending) &&
(events->exception.nr > 31 || events->exception.nr == NMI_VECTOR))
return -EINVAL;
@@ -5522,6 +5534,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
vcpu->arch.exception.error_code = events->exception.error_code;
vcpu->arch.exception.has_payload = events->exception_has_payload;
vcpu->arch.exception.payload = events->exception_payload;
+ vcpu->arch.exception.nested = events->exception_is_nested;
vcpu->arch.interrupt.injected = events->interrupt.injected;
vcpu->arch.interrupt.nr = events->interrupt.nr;
@@ -6644,6 +6657,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
kvm->arch.exception_payload_enabled = cap->args[0];
r = 0;
break;
+ case KVM_CAP_EXCEPTION_NESTED_FLAG:
+ kvm->arch.exception_nested_flag_enabled = cap->args[0];
+ r = 0;
+ break;
case KVM_CAP_X86_TRIPLE_FAULT_EVENT:
kvm->arch.triple_fault_event = cap->args[0];
r = 0;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b6ae8ad8934b..5ef33256858f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -930,6 +930,7 @@ struct kvm_enable_cap {
#define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
#define KVM_CAP_X86_GUEST_MODE 238
#define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239
+#define KVM_CAP_EXCEPTION_NESTED_FLAG 240
struct kvm_irq_routing_irqchip {
__u32 irqchip;
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 13/19] KVM: x86: Mark CR4.FRED as not reserved
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (11 preceding siblings ...)
2025-03-28 17:11 ` [PATCH v4 12/19] KVM: x86: Save/restore the nested flag of an exception Xin Li (Intel)
@ 2025-03-28 17:11 ` Xin Li (Intel)
2025-03-28 17:12 ` [PATCH v4 14/19] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li (Intel)
` (6 subsequent siblings)
19 siblings, 0 replies; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:11 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
The CR4.FRED bit, i.e., CR4[32], is no longer a reserved bit when
guest cpu cap has FRED, i.e.,
1) All of FRED KVM support is in place.
2) Guest enumerates FRED.
Otherwise it is still a reserved bit.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
Change in v4:
* Rebase on top of "guest_cpu_cap".
Change in v3:
* Don't allow CR4.FRED=1 before all of FRED KVM support is in place
(Sean Christopherson).
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/x86.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8b9834f2f37..e94924397230 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -138,7 +138,7 @@
| X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
| X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
| X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
- | X86_CR4_LAM_SUP))
+ | X86_CR4_LAM_SUP | X86_CR4_FRED))
#define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 13dbd87970db..24661b2ad3ad 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -617,6 +617,8 @@ static inline bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
__reserved_bits |= X86_CR4_PCIDE; \
if (!__cpu_has(__c, X86_FEATURE_LAM)) \
__reserved_bits |= X86_CR4_LAM_SUP; \
+ if (!__cpu_has(__c, X86_FEATURE_FRED)) \
+ __reserved_bits |= X86_CR4_FRED; \
__reserved_bits; \
})
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 14/19] KVM: VMX: Dump FRED context in dump_vmcs()
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (12 preceding siblings ...)
2025-03-28 17:11 ` [PATCH v4 13/19] KVM: x86: Mark CR4.FRED as not reserved Xin Li (Intel)
@ 2025-03-28 17:12 ` Xin Li (Intel)
2025-06-24 16:32 ` Sean Christopherson
2025-03-28 17:12 ` [PATCH v4 15/19] KVM: x86: Allow FRED/LKGS to be advertised to guests Xin Li (Intel)
` (5 subsequent siblings)
19 siblings, 1 reply; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:12 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Add FRED related VMCS fields to dump_vmcs() to dump FRED context.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
Change in v3:
* Use (vmentry_ctrl & VM_ENTRY_LOAD_IA32_FRED) instead of is_fred_enabled()
(Chao Gao).
Changes in v2:
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() (Chao Gao).
* Dump guest FRED states only if guest has FRED enabled (Nikolay Borisov).
---
arch/x86/kvm/vmx/vmx.c | 40 +++++++++++++++++++++++++++++++++-------
1 file changed, 33 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c76015e1e3f8..03855d6690b2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6462,7 +6462,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 vmentry_ctl, vmexit_ctl;
u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
- u64 tertiary_exec_control;
+ u64 tertiary_exec_control, secondary_vmexit_ctl;
unsigned long cr4;
int efer_slot;
@@ -6473,6 +6473,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
vmentry_ctl = vmcs_read32(VM_ENTRY_CONTROLS);
vmexit_ctl = vmcs_read32(VM_EXIT_CONTROLS);
+ secondary_vmexit_ctl = cpu_has_secondary_vmexit_ctrls() ?
+ vmcs_read64(SECONDARY_VM_EXIT_CONTROLS) : 0;
cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
cr4 = vmcs_readl(GUEST_CR4);
@@ -6519,6 +6521,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
vmx_dump_sel("LDTR:", GUEST_LDTR_SELECTOR);
vmx_dump_dtsel("IDTR:", GUEST_IDTR_LIMIT);
vmx_dump_sel("TR: ", GUEST_TR_SELECTOR);
+ if (vmentry_ctl & VM_ENTRY_LOAD_IA32_FRED)
+ pr_err("FRED guest: config=0x%016llx, stack_levels=0x%016llx\n"
+ "RSP0=0x%016llx, RSP1=0x%016llx\n"
+ "RSP2=0x%016llx, RSP3=0x%016llx\n",
+ vmcs_read64(GUEST_IA32_FRED_CONFIG),
+ vmcs_read64(GUEST_IA32_FRED_STKLVLS),
+ __rdmsr(MSR_IA32_FRED_RSP0),
+ vmcs_read64(GUEST_IA32_FRED_RSP1),
+ vmcs_read64(GUEST_IA32_FRED_RSP2),
+ vmcs_read64(GUEST_IA32_FRED_RSP3));
efer_slot = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest, MSR_EFER);
if (vmentry_ctl & VM_ENTRY_LOAD_IA32_EFER)
pr_err("EFER= 0x%016llx\n", vmcs_read64(GUEST_IA32_EFER));
@@ -6566,6 +6578,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
vmcs_readl(HOST_TR_BASE));
pr_err("GDTBase=%016lx IDTBase=%016lx\n",
vmcs_readl(HOST_GDTR_BASE), vmcs_readl(HOST_IDTR_BASE));
+ if (vmexit_ctl & SECONDARY_VM_EXIT_LOAD_IA32_FRED)
+ pr_err("FRED host: config=0x%016llx, stack_levels=0x%016llx\n"
+ "RSP0=0x%016lx, RSP1=0x%016llx\n"
+ "RSP2=0x%016llx, RSP3=0x%016llx\n",
+ vmcs_read64(HOST_IA32_FRED_CONFIG),
+ vmcs_read64(HOST_IA32_FRED_STKLVLS),
+ (unsigned long)task_stack_page(current) + THREAD_SIZE,
+ vmcs_read64(HOST_IA32_FRED_RSP1),
+ vmcs_read64(HOST_IA32_FRED_RSP2),
+ vmcs_read64(HOST_IA32_FRED_RSP3));
pr_err("CR0=%016lx CR3=%016lx CR4=%016lx\n",
vmcs_readl(HOST_CR0), vmcs_readl(HOST_CR3),
vmcs_readl(HOST_CR4));
@@ -6587,25 +6609,29 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
pr_err("*** Control State ***\n");
pr_err("CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
cpu_based_exec_ctrl, secondary_exec_control, tertiary_exec_control);
- pr_err("PinBased=0x%08x EntryControls=%08x ExitControls=%08x\n",
- pin_based_exec_ctrl, vmentry_ctl, vmexit_ctl);
+ pr_err("PinBased=0x%08x EntryControls=0x%08x\n",
+ pin_based_exec_ctrl, vmentry_ctl);
+ pr_err("ExitControls=0x%08x SecondaryExitControls=0x%016llx\n",
+ vmexit_ctl, secondary_vmexit_ctl);
pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
vmcs_read32(EXCEPTION_BITMAP),
vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK),
vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH));
- pr_err("VMEntry: intr_info=%08x errcode=%08x ilen=%08x\n",
+ pr_err("VMEntry: intr_info=%08x errcode=%08x ilen=%08x event_data=%016llx\n",
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE),
- vmcs_read32(VM_ENTRY_INSTRUCTION_LEN));
+ vmcs_read32(VM_ENTRY_INSTRUCTION_LEN),
+ kvm_cpu_cap_has(X86_FEATURE_FRED) ? vmcs_read64(INJECTED_EVENT_DATA) : 0);
pr_err("VMExit: intr_info=%08x errcode=%08x ilen=%08x\n",
vmcs_read32(VM_EXIT_INTR_INFO),
vmcs_read32(VM_EXIT_INTR_ERROR_CODE),
vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
pr_err(" reason=%08x qualification=%016lx\n",
vmcs_read32(VM_EXIT_REASON), vmcs_readl(EXIT_QUALIFICATION));
- pr_err("IDTVectoring: info=%08x errcode=%08x\n",
+ pr_err("IDTVectoring: info=%08x errcode=%08x event_data=%016llx\n",
vmcs_read32(IDT_VECTORING_INFO_FIELD),
- vmcs_read32(IDT_VECTORING_ERROR_CODE));
+ vmcs_read32(IDT_VECTORING_ERROR_CODE),
+ kvm_cpu_cap_has(X86_FEATURE_FRED) ? vmcs_read64(ORIGINAL_EVENT_DATA) : 0);
pr_err("TSC Offset = 0x%016llx\n", vmcs_read64(TSC_OFFSET));
if (secondary_exec_control & SECONDARY_EXEC_TSC_SCALING)
pr_err("TSC Multiplier = 0x%016llx\n",
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 15/19] KVM: x86: Allow FRED/LKGS to be advertised to guests
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (13 preceding siblings ...)
2025-03-28 17:12 ` [PATCH v4 14/19] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li (Intel)
@ 2025-03-28 17:12 ` Xin Li (Intel)
2025-06-24 16:38 ` Sean Christopherson
2025-03-28 17:12 ` [PATCH v4 16/19] KVM: nVMX: Add support for the secondary VM exit controls Xin Li (Intel)
` (4 subsequent siblings)
19 siblings, 1 reply; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:12 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Allow FRED/LKGS to be advertised to guests after changes required to
enable FRED in a KVM guest are in place.
LKGS is introduced with FRED to completely eliminate the need to swapgs
explicilty, because
1) FRED transitions ensure that an operating system can always operate
with its own GS base address.
2) LKGS behaves like the MOV to GS instruction except that it loads
the base address into the IA32_KERNEL_GS_BASE MSR instead of the
GS segment’s descriptor cache, which is exactly what Linux kernel
does to load a user level GS base. Thus there is no need to SWAPGS
away from the kernel GS base and an execution of SWAPGS causes #UD
if FRED transitions are enabled.
A FRED CPU must enumerate LKGS. When LKGS is not available, FRED must
not be enabled.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
arch/x86/kvm/cpuid.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 5e4d4934c0d3..8f290273aee1 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -992,6 +992,8 @@ void kvm_set_cpu_caps(void)
F(FZRM),
F(FSRS),
F(FSRC),
+ F(FRED),
+ F(LKGS),
F(AMX_FP16),
F(AVX_IFMA),
F(LAM),
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 16/19] KVM: nVMX: Add support for the secondary VM exit controls
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (14 preceding siblings ...)
2025-03-28 17:12 ` [PATCH v4 15/19] KVM: x86: Allow FRED/LKGS to be advertised to guests Xin Li (Intel)
@ 2025-03-28 17:12 ` Xin Li (Intel)
2025-06-24 16:54 ` Sean Christopherson
2025-03-28 17:12 ` [PATCH v4 17/19] KVM: nVMX: Add FRED VMCS fields to nested VMX context management Xin Li (Intel)
` (3 subsequent siblings)
19 siblings, 1 reply; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:12 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Enable the secondary VM exit controls to prepare for nested FRED.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
Change in v3:
* Read secondary VM exit controls from vmcs_conf insteasd of the hardware
MSR MSR_IA32_VMX_EXIT_CTLS2 to avoid advertising features to L1 that KVM
itself doesn't support, e.g. because the expected entry+exit pairs aren't
supported. (Sean Christopherson)
---
Documentation/virt/kvm/x86/nested-vmx.rst | 1 +
arch/x86/kvm/vmx/capabilities.h | 1 +
arch/x86/kvm/vmx/nested.c | 21 ++++++++++++++++++++-
arch/x86/kvm/vmx/vmcs12.c | 1 +
arch/x86/kvm/vmx/vmcs12.h | 2 ++
arch/x86/kvm/x86.h | 2 +-
6 files changed, 26 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
index ac2095d41f02..e64ef231f310 100644
--- a/Documentation/virt/kvm/x86/nested-vmx.rst
+++ b/Documentation/virt/kvm/x86/nested-vmx.rst
@@ -217,6 +217,7 @@ struct shadow_vmcs is ever changed.
u16 host_fs_selector;
u16 host_gs_selector;
u16 host_tr_selector;
+ u64 secondary_vm_exit_controls;
};
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index b4f49a4690ca..d29be4e4124e 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -38,6 +38,7 @@ struct nested_vmx_msrs {
u32 pinbased_ctls_high;
u32 exit_ctls_low;
u32 exit_ctls_high;
+ u64 secondary_exit_ctls;
u32 entry_ctls_low;
u32 entry_ctls_high;
u32 misc_low;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 5504d9e9fd32..8b0c5e5f1e98 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1457,6 +1457,7 @@ int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
case MSR_IA32_VMX_PINBASED_CTLS:
case MSR_IA32_VMX_PROCBASED_CTLS:
case MSR_IA32_VMX_EXIT_CTLS:
+ case MSR_IA32_VMX_EXIT_CTLS2:
case MSR_IA32_VMX_ENTRY_CTLS:
/*
* The "non-true" VMX capability MSRs are generated from the
@@ -1535,6 +1536,9 @@ int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata)
if (msr_index == MSR_IA32_VMX_EXIT_CTLS)
*pdata |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
break;
+ case MSR_IA32_VMX_EXIT_CTLS2:
+ *pdata = msrs->secondary_exit_ctls;
+ break;
case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
case MSR_IA32_VMX_ENTRY_CTLS:
*pdata = vmx_control_msr(
@@ -2485,6 +2489,11 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
exec_control &= ~VM_EXIT_LOAD_IA32_EFER;
vm_exit_controls_set(vmx, exec_control);
+ if (exec_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
+ exec_control = __secondary_vm_exit_controls_get(vmcs01);
+ secondary_vm_exit_controls_set(vmx, exec_control);
+ }
+
/*
* Interrupt/Exception Fields
*/
@@ -7011,7 +7020,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
VM_EXIT_HOST_ADDR_SPACE_SIZE |
#endif
VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
- VM_EXIT_CLEAR_BNDCFGS;
+ VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_ACTIVATE_SECONDARY_CONTROLS;
msrs->exit_ctls_high |=
VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -7020,6 +7029,16 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
/* We support free control of debug control saving. */
msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS;
+
+ if (msrs->exit_ctls_high & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
+ msrs->secondary_exit_ctls = vmcs_conf->vmexit_2nd_ctrl;
+ /*
+ * As the secondary VM exit control is always loaded, do not
+ * advertise any feature in it to nVMX until its nVMX support
+ * is ready.
+ */
+ msrs->secondary_exit_ctls &= 0;
+ }
}
static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 106a72c923ca..9fac24fd5b4b 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -66,6 +66,7 @@ const unsigned short vmcs12_field_offsets[] = {
FIELD64(HOST_IA32_PAT, host_ia32_pat),
FIELD64(HOST_IA32_EFER, host_ia32_efer),
FIELD64(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl),
+ FIELD64(SECONDARY_VM_EXIT_CONTROLS, secondary_vm_exit_controls),
FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
FIELD(EXCEPTION_BITMAP, exception_bitmap),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 56fd150a6f24..1fe3ed9108aa 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -185,6 +185,7 @@ struct __packed vmcs12 {
u16 host_gs_selector;
u16 host_tr_selector;
u16 guest_pml_index;
+ u64 secondary_vm_exit_controls;
};
/*
@@ -360,6 +361,7 @@ static inline void vmx_check_vmcs12_offsets(void)
CHECK_OFFSET(host_gs_selector, 992);
CHECK_OFFSET(host_tr_selector, 994);
CHECK_OFFSET(guest_pml_index, 996);
+ CHECK_OFFSET(secondary_vm_exit_controls, 998);
}
extern const unsigned short vmcs12_field_offsets[];
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 24661b2ad3ad..75e1a0eb504c 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -67,7 +67,7 @@ void kvm_spurious_fault(void);
* associated feature that KVM supports for nested virtualization.
*/
#define KVM_FIRST_EMULATED_VMX_MSR MSR_IA32_VMX_BASIC
-#define KVM_LAST_EMULATED_VMX_MSR MSR_IA32_VMX_VMFUNC
+#define KVM_LAST_EMULATED_VMX_MSR MSR_IA32_VMX_EXIT_CTLS2
#define KVM_DEFAULT_PLE_GAP 128
#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 17/19] KVM: nVMX: Add FRED VMCS fields to nested VMX context management
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (15 preceding siblings ...)
2025-03-28 17:12 ` [PATCH v4 16/19] KVM: nVMX: Add support for the secondary VM exit controls Xin Li (Intel)
@ 2025-03-28 17:12 ` Xin Li (Intel)
2025-03-28 17:12 ` [PATCH v4 18/19] KVM: nVMX: Add VMCS FRED states checking Xin Li (Intel)
` (2 subsequent siblings)
19 siblings, 0 replies; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:12 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
Changes in v4:
* Advertise VMX nested exception as if the CPU supports it (Chao Gao).
* Split FRED state management controls (Chao Gao).
Changes in v3:
* Add and use nested_cpu_has_fred(vmcs12) because vmcs02 should be set
from vmcs12 if and only if the field is enabled in L1's VMX config
(Sean Christopherson).
* Fix coding style issues (Sean Christopherson).
Changes in v2:
* Remove hyperv TLFS related changes (Jeremi Piotrowski).
* Use kvm_cpu_cap_has() instead of cpu_feature_enabled() (Chao Gao).
---
Documentation/virt/kvm/x86/nested-vmx.rst | 18 +++++
arch/x86/kvm/vmx/capabilities.h | 5 ++
arch/x86/kvm/vmx/nested.c | 83 ++++++++++++++++++++++-
arch/x86/kvm/vmx/nested.h | 22 ++++++
arch/x86/kvm/vmx/vmcs12.c | 18 +++++
arch/x86/kvm/vmx/vmcs12.h | 36 ++++++++++
arch/x86/kvm/vmx/vmcs_shadow_fields.h | 4 ++
7 files changed, 184 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
index e64ef231f310..87fa9f3877ab 100644
--- a/Documentation/virt/kvm/x86/nested-vmx.rst
+++ b/Documentation/virt/kvm/x86/nested-vmx.rst
@@ -218,6 +218,24 @@ struct shadow_vmcs is ever changed.
u16 host_gs_selector;
u16 host_tr_selector;
u64 secondary_vm_exit_controls;
+ u64 guest_ia32_fred_config;
+ u64 guest_ia32_fred_rsp1;
+ u64 guest_ia32_fred_rsp2;
+ u64 guest_ia32_fred_rsp3;
+ u64 guest_ia32_fred_stklvls;
+ u64 guest_ia32_fred_ssp1;
+ u64 guest_ia32_fred_ssp2;
+ u64 guest_ia32_fred_ssp3;
+ u64 host_ia32_fred_config;
+ u64 host_ia32_fred_rsp1;
+ u64 host_ia32_fred_rsp2;
+ u64 host_ia32_fred_rsp3;
+ u64 host_ia32_fred_stklvls;
+ u64 host_ia32_fred_ssp1;
+ u64 host_ia32_fred_ssp2;
+ u64 host_ia32_fred_ssp3;
+ u64 injected_event_data;
+ u64 original_event_data;
};
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index d29be4e4124e..b1abbdb48449 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -79,6 +79,11 @@ static inline bool cpu_has_vmx_basic_inout(void)
return vmcs_config.basic & VMX_BASIC_INOUT;
}
+static inline bool cpu_has_vmx_nested_exception(void)
+{
+ return vmcs_config.basic & VMX_BASIC_NESTED_EXCEPTION;
+}
+
static inline bool cpu_has_virtual_nmis(void)
{
return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 8b0c5e5f1e98..6ff7ae3b7a33 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -704,6 +704,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_FRED_RSP0, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_FRED_SSP0, MSR_TYPE_RW);
#endif
nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
MSR_IA32_SPEC_CTRL, MSR_TYPE_RW);
@@ -1256,9 +1262,11 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
{
const u64 feature_bits = VMX_BASIC_DUAL_MONITOR_TREATMENT |
VMX_BASIC_INOUT |
- VMX_BASIC_TRUE_CTLS;
+ VMX_BASIC_TRUE_CTLS |
+ VMX_BASIC_NESTED_EXCEPTION;
- const u64 reserved_bits = GENMASK_ULL(63, 56) |
+ const u64 reserved_bits = GENMASK_ULL(63, 59) |
+ GENMASK_ULL(57, 56) |
GENMASK_ULL(47, 45) |
BIT_ULL(31);
@@ -2506,6 +2514,8 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
vmcs12->vm_entry_instruction_len);
vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
vmcs12->guest_interruptibility_info);
+ if (cpu_has_vmx_fred())
+ vmcs_write64(INJECTED_EVENT_DATA, vmcs12->injected_event_data);
vmx->loaded_vmcs->nmi_known_unmasked =
!(vmcs12->guest_interruptibility_info & GUEST_INTR_STATE_NMI);
} else {
@@ -2558,6 +2568,17 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
vmx_segment_cache_clear(vmx);
+
+ if (nested_cpu_load_guest_fred_states(vmcs12)) {
+ vmcs_write64(GUEST_IA32_FRED_CONFIG, vmcs12->guest_ia32_fred_config);
+ vmcs_write64(GUEST_IA32_FRED_RSP1, vmcs12->guest_ia32_fred_rsp1);
+ vmcs_write64(GUEST_IA32_FRED_RSP2, vmcs12->guest_ia32_fred_rsp2);
+ vmcs_write64(GUEST_IA32_FRED_RSP3, vmcs12->guest_ia32_fred_rsp3);
+ vmcs_write64(GUEST_IA32_FRED_STKLVLS, vmcs12->guest_ia32_fred_stklvls);
+ vmcs_write64(GUEST_IA32_FRED_SSP1, vmcs12->guest_ia32_fred_ssp1);
+ vmcs_write64(GUEST_IA32_FRED_SSP2, vmcs12->guest_ia32_fred_ssp2);
+ vmcs_write64(GUEST_IA32_FRED_SSP3, vmcs12->guest_ia32_fred_ssp3);
+ }
}
if (!hv_evmcs || !(hv_evmcs->hv_clean_fields &
@@ -3842,6 +3863,8 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
u32 idt_vectoring;
unsigned int nr;
+ vmcs12->original_event_data = 0;
+
/*
* Per the SDM, VM-Exits due to double and triple faults are never
* considered to occur during event delivery, even if the double/triple
@@ -3880,6 +3903,13 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
vcpu->arch.exception.error_code;
}
+ if ((vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) &&
+ (vmcs12->guest_cr4 & X86_CR4_FRED) &&
+ (vcpu->arch.exception.nested))
+ idt_vectoring |= VECTORING_INFO_NESTED_EXCEPTION_MASK;
+
+ vmcs12->original_event_data = vcpu->arch.exception.event_data;
+
vmcs12->idt_vectoring_info_field = idt_vectoring;
} else if (vcpu->arch.nmi_injected) {
vmcs12->idt_vectoring_info_field =
@@ -4460,6 +4490,14 @@ static bool is_vmcs12_ext_field(unsigned long field)
case GUEST_TR_BASE:
case GUEST_GDTR_BASE:
case GUEST_IDTR_BASE:
+ case GUEST_IA32_FRED_CONFIG:
+ case GUEST_IA32_FRED_RSP1:
+ case GUEST_IA32_FRED_RSP2:
+ case GUEST_IA32_FRED_RSP3:
+ case GUEST_IA32_FRED_STKLVLS:
+ case GUEST_IA32_FRED_SSP1:
+ case GUEST_IA32_FRED_SSP2:
+ case GUEST_IA32_FRED_SSP3:
case GUEST_PENDING_DBG_EXCEPTIONS:
case GUEST_BNDCFGS:
return true;
@@ -4509,6 +4547,18 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+
+ if (nested_cpu_save_guest_fred_states(vmcs12)) {
+ vmcs12->guest_ia32_fred_config = vmcs_read64(GUEST_IA32_FRED_CONFIG);
+ vmcs12->guest_ia32_fred_rsp1 = vmcs_read64(GUEST_IA32_FRED_RSP1);
+ vmcs12->guest_ia32_fred_rsp2 = vmcs_read64(GUEST_IA32_FRED_RSP2);
+ vmcs12->guest_ia32_fred_rsp3 = vmcs_read64(GUEST_IA32_FRED_RSP3);
+ vmcs12->guest_ia32_fred_stklvls = vmcs_read64(GUEST_IA32_FRED_STKLVLS);
+ vmcs12->guest_ia32_fred_ssp1 = vmcs_read64(GUEST_IA32_FRED_SSP1);
+ vmcs12->guest_ia32_fred_ssp2 = vmcs_read64(GUEST_IA32_FRED_SSP2);
+ vmcs12->guest_ia32_fred_ssp3 = vmcs_read64(GUEST_IA32_FRED_SSP3);
+ }
+
vmcs12->guest_pending_dbg_exceptions =
vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
@@ -4656,6 +4706,21 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
vmcs12->vm_exit_intr_info = exit_intr_info;
vmcs12->vm_exit_instruction_len = exit_insn_len;
+
+ /*
+ * When there is a valid original event, the exiting event is a nested
+ * event during delivery of the earlier original event.
+ *
+ * FRED event delivery reflects this relationship by setting the value
+ * of the nested exception bit of VM-exit interruption information
+ * (aka exiting-event identification) to that of the valid bit of the
+ * IDT-vectoring information (aka original-event identification).
+ */
+ if ((vmcs12->idt_vectoring_info_field & VECTORING_INFO_VALID_MASK) &&
+ (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) &&
+ (vmcs12->guest_cr4 & X86_CR4_FRED))
+ vmcs12->vm_exit_intr_info |= INTR_INFO_NESTED_EXCEPTION_MASK;
+
vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
/*
@@ -4733,6 +4798,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
vmcs_write32(GUEST_IDTR_LIMIT, 0xFFFF);
vmcs_write32(GUEST_GDTR_LIMIT, 0xFFFF);
+ if (nested_cpu_load_host_fred_states(vmcs12)) {
+ vmcs_write64(GUEST_IA32_FRED_CONFIG, vmcs12->host_ia32_fred_config);
+ vmcs_write64(GUEST_IA32_FRED_RSP1, vmcs12->host_ia32_fred_rsp1);
+ vmcs_write64(GUEST_IA32_FRED_RSP2, vmcs12->host_ia32_fred_rsp2);
+ vmcs_write64(GUEST_IA32_FRED_RSP3, vmcs12->host_ia32_fred_rsp3);
+ vmcs_write64(GUEST_IA32_FRED_STKLVLS, vmcs12->host_ia32_fred_stklvls);
+ vmcs_write64(GUEST_IA32_FRED_SSP1, vmcs12->host_ia32_fred_ssp1);
+ vmcs_write64(GUEST_IA32_FRED_SSP2, vmcs12->host_ia32_fred_ssp2);
+ vmcs_write64(GUEST_IA32_FRED_SSP3, vmcs12->host_ia32_fred_ssp3);
+ }
+
/* If not VM_EXIT_CLEAR_BNDCFGS, the L2 value propagates to L1. */
if (vmcs12->vm_exit_controls & VM_EXIT_CLEAR_BNDCFGS)
vmcs_write64(GUEST_BNDCFGS, 0);
@@ -7206,6 +7282,9 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
msrs->basic |= VMX_BASIC_TRUE_CTLS;
if (cpu_has_vmx_basic_inout())
msrs->basic |= VMX_BASIC_INOUT;
+
+ if (cpu_has_vmx_nested_exception())
+ msrs->basic |= VMX_BASIC_NESTED_EXCEPTION;
}
static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index 6eedcfc91070..c6b69699e28e 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -249,6 +249,11 @@ static inline bool nested_cpu_has_save_preemption_timer(struct vmcs12 *vmcs12)
VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
}
+static inline bool nested_cpu_has_secondary_vm_exit_controls(struct vmcs12 *vmcs12)
+{
+ return vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS;
+}
+
static inline bool nested_exit_on_nmi(struct kvm_vcpu *vcpu)
{
return nested_cpu_has_nmi_exiting(get_vmcs12(vcpu));
@@ -269,6 +274,23 @@ static inline bool nested_cpu_has_encls_exit(struct vmcs12 *vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENCLS_EXITING);
}
+static inline bool nested_cpu_load_guest_fred_states(struct vmcs12 *vmcs12)
+{
+ return vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED;
+}
+
+static inline bool nested_cpu_save_guest_fred_states(struct vmcs12 *vmcs12)
+{
+ return nested_cpu_has_secondary_vm_exit_controls(vmcs12) &&
+ vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_SAVE_IA32_FRED;
+}
+
+static inline bool nested_cpu_load_host_fred_states(struct vmcs12 *vmcs12)
+{
+ return nested_cpu_has_secondary_vm_exit_controls(vmcs12) &&
+ vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED;
+}
+
/*
* if fixed0[i] == 1: val[i] must be 1
* if fixed1[i] == 0: val[i] must be 0
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 9fac24fd5b4b..5fa63326deba 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -67,6 +67,24 @@ const unsigned short vmcs12_field_offsets[] = {
FIELD64(HOST_IA32_EFER, host_ia32_efer),
FIELD64(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl),
FIELD64(SECONDARY_VM_EXIT_CONTROLS, secondary_vm_exit_controls),
+ FIELD64(INJECTED_EVENT_DATA, injected_event_data),
+ FIELD64(ORIGINAL_EVENT_DATA, original_event_data),
+ FIELD64(GUEST_IA32_FRED_CONFIG, guest_ia32_fred_config),
+ FIELD64(GUEST_IA32_FRED_RSP1, guest_ia32_fred_rsp1),
+ FIELD64(GUEST_IA32_FRED_RSP2, guest_ia32_fred_rsp2),
+ FIELD64(GUEST_IA32_FRED_RSP3, guest_ia32_fred_rsp3),
+ FIELD64(GUEST_IA32_FRED_STKLVLS, guest_ia32_fred_stklvls),
+ FIELD64(GUEST_IA32_FRED_SSP1, guest_ia32_fred_ssp1),
+ FIELD64(GUEST_IA32_FRED_SSP2, guest_ia32_fred_ssp2),
+ FIELD64(GUEST_IA32_FRED_SSP3, guest_ia32_fred_ssp3),
+ FIELD64(HOST_IA32_FRED_CONFIG, host_ia32_fred_config),
+ FIELD64(HOST_IA32_FRED_RSP1, host_ia32_fred_rsp1),
+ FIELD64(HOST_IA32_FRED_RSP2, host_ia32_fred_rsp2),
+ FIELD64(HOST_IA32_FRED_RSP3, host_ia32_fred_rsp3),
+ FIELD64(HOST_IA32_FRED_STKLVLS, host_ia32_fred_stklvls),
+ FIELD64(HOST_IA32_FRED_SSP1, host_ia32_fred_ssp1),
+ FIELD64(HOST_IA32_FRED_SSP2, host_ia32_fred_ssp2),
+ FIELD64(HOST_IA32_FRED_SSP3, host_ia32_fred_ssp3),
FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
FIELD(EXCEPTION_BITMAP, exception_bitmap),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 1fe3ed9108aa..f2a33d7007c9 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -186,6 +186,24 @@ struct __packed vmcs12 {
u16 host_tr_selector;
u16 guest_pml_index;
u64 secondary_vm_exit_controls;
+ u64 guest_ia32_fred_config;
+ u64 guest_ia32_fred_rsp1;
+ u64 guest_ia32_fred_rsp2;
+ u64 guest_ia32_fred_rsp3;
+ u64 guest_ia32_fred_stklvls;
+ u64 guest_ia32_fred_ssp1;
+ u64 guest_ia32_fred_ssp2;
+ u64 guest_ia32_fred_ssp3;
+ u64 host_ia32_fred_config;
+ u64 host_ia32_fred_rsp1;
+ u64 host_ia32_fred_rsp2;
+ u64 host_ia32_fred_rsp3;
+ u64 host_ia32_fred_stklvls;
+ u64 host_ia32_fred_ssp1;
+ u64 host_ia32_fred_ssp2;
+ u64 host_ia32_fred_ssp3;
+ u64 injected_event_data;
+ u64 original_event_data;
};
/*
@@ -362,6 +380,24 @@ static inline void vmx_check_vmcs12_offsets(void)
CHECK_OFFSET(host_tr_selector, 994);
CHECK_OFFSET(guest_pml_index, 996);
CHECK_OFFSET(secondary_vm_exit_controls, 998);
+ CHECK_OFFSET(guest_ia32_fred_config, 1006);
+ CHECK_OFFSET(guest_ia32_fred_rsp1, 1014);
+ CHECK_OFFSET(guest_ia32_fred_rsp2, 1022);
+ CHECK_OFFSET(guest_ia32_fred_rsp3, 1030);
+ CHECK_OFFSET(guest_ia32_fred_stklvls, 1038);
+ CHECK_OFFSET(guest_ia32_fred_ssp1, 1046);
+ CHECK_OFFSET(guest_ia32_fred_ssp2, 1054);
+ CHECK_OFFSET(guest_ia32_fred_ssp3, 1062);
+ CHECK_OFFSET(host_ia32_fred_config, 1070);
+ CHECK_OFFSET(host_ia32_fred_rsp1, 1078);
+ CHECK_OFFSET(host_ia32_fred_rsp2, 1086);
+ CHECK_OFFSET(host_ia32_fred_rsp3, 1094);
+ CHECK_OFFSET(host_ia32_fred_stklvls, 1102);
+ CHECK_OFFSET(host_ia32_fred_ssp1, 1110);
+ CHECK_OFFSET(host_ia32_fred_ssp2, 1118);
+ CHECK_OFFSET(host_ia32_fred_ssp3, 1126);
+ CHECK_OFFSET(injected_event_data, 1134);
+ CHECK_OFFSET(original_event_data, 1142);
}
extern const unsigned short vmcs12_field_offsets[];
diff --git a/arch/x86/kvm/vmx/vmcs_shadow_fields.h b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
index cad128d1657b..da338327c2b3 100644
--- a/arch/x86/kvm/vmx/vmcs_shadow_fields.h
+++ b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
@@ -74,6 +74,10 @@ SHADOW_FIELD_RW(HOST_GS_BASE, host_gs_base)
/* 64-bit */
SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS, guest_physical_address)
SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH, guest_physical_address)
+SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA, original_event_data)
+SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA_HIGH, original_event_data)
+SHADOW_FIELD_RW(INJECTED_EVENT_DATA, injected_event_data)
+SHADOW_FIELD_RW(INJECTED_EVENT_DATA_HIGH, injected_event_data)
#undef SHADOW_FIELD_RO
#undef SHADOW_FIELD_RW
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 18/19] KVM: nVMX: Add VMCS FRED states checking
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (16 preceding siblings ...)
2025-03-28 17:12 ` [PATCH v4 17/19] KVM: nVMX: Add FRED VMCS fields to nested VMX context management Xin Li (Intel)
@ 2025-03-28 17:12 ` Xin Li (Intel)
2025-03-28 17:12 ` [PATCH v4 19/19] KVM: nVMX: Allow VMX FRED controls Xin Li (Intel)
2025-03-28 17:25 ` [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li
19 siblings, 0 replies; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:12 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
As real hardware, nested VMX performs checks on various VMCS fields,
including both controls and guest/host states. Add FRED related VMCS
field checkings with the addition of nested FRED.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
arch/x86/kvm/vmx/nested.c | 80 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 79 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 6ff7ae3b7a33..538ab3418957 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2945,6 +2945,8 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
+ bool fred_enabled = (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) &&
+ (vmcs12->guest_cr4 & X86_CR4_FRED);
if (CC(!vmx_control_verify(vmcs12->vm_entry_controls,
vmx->nested.msrs.entry_ctls_low,
@@ -2963,6 +2965,7 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
u32 intr_type = intr_info & INTR_INFO_INTR_TYPE_MASK;
bool has_error_code = intr_info & INTR_INFO_DELIVER_CODE_MASK;
bool should_have_error_code;
+ bool has_nested_exception = vmx->nested.msrs.basic & VMX_BASIC_NESTED_EXCEPTION;
bool urg = nested_cpu_has2(vmcs12,
SECONDARY_EXEC_UNRESTRICTED_GUEST);
bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
@@ -2976,7 +2979,9 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
/* VM-entry interruption-info field: vector */
if (CC(intr_type == INTR_TYPE_NMI_INTR && vector != NMI_VECTOR) ||
CC(intr_type == INTR_TYPE_HARD_EXCEPTION && vector > 31) ||
- CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
+ CC(intr_type == INTR_TYPE_OTHER_EVENT &&
+ ((!fred_enabled && vector > 0) ||
+ (fred_enabled && vector > 2))))
return -EINVAL;
/* VM-entry interruption-info field: deliver error code */
@@ -2995,6 +3000,15 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
if (CC(intr_info & INTR_INFO_RESVD_BITS_MASK))
return -EINVAL;
+ /*
+ * When the CPU enumerates VMX nested-exception support, bit 13
+ * (set to indicate a nested exception) of the intr info field
+ * may have value 1. Otherwise bit 13 is reserved.
+ */
+ if (CC(!has_nested_exception &&
+ (intr_info & INTR_INFO_NESTED_EXCEPTION_MASK)))
+ return -EINVAL;
+
/* VM-entry instruction length */
switch (intr_type) {
case INTR_TYPE_SOFT_EXCEPTION:
@@ -3004,6 +3018,12 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
CC(vmcs12->vm_entry_instruction_len == 0 &&
CC(!nested_cpu_has_zero_length_injection(vcpu))))
return -EINVAL;
+ break;
+ case INTR_TYPE_OTHER_EVENT:
+ if (fred_enabled && (vector == 1 || vector == 2))
+ if (CC(vmcs12->vm_entry_instruction_len > 15))
+ return -EINVAL;
+ break;
}
}
@@ -3077,9 +3097,30 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
if (ia32e) {
if (CC(!(vmcs12->host_cr4 & X86_CR4_PAE)))
return -EINVAL;
+ if (vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS &&
+ vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED) {
+ /* Bit 11, bits 5:4, and bit 2 of the IA32_FRED_CONFIG must be zero */
+ if (CC(vmcs12->host_ia32_fred_config &
+ (BIT_ULL(11) | GENMASK_ULL(5, 4) | BIT_ULL(2))) ||
+ CC(vmcs12->host_ia32_fred_rsp1 & GENMASK_ULL(5, 0)) ||
+ CC(vmcs12->host_ia32_fred_rsp2 & GENMASK_ULL(5, 0)) ||
+ CC(vmcs12->host_ia32_fred_rsp3 & GENMASK_ULL(5, 0)) ||
+ CC(vmcs12->host_ia32_fred_ssp1 & GENMASK_ULL(2, 0)) ||
+ CC(vmcs12->host_ia32_fred_ssp2 & GENMASK_ULL(2, 0)) ||
+ CC(vmcs12->host_ia32_fred_ssp3 & GENMASK_ULL(2, 0)) ||
+ CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_config & PAGE_MASK, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_rsp1, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_rsp2, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_rsp3, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_ssp1, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_ssp2, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->host_ia32_fred_ssp3, vcpu)))
+ return -EINVAL;
+ }
} else {
if (CC(vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) ||
CC(vmcs12->host_cr4 & X86_CR4_PCIDE) ||
+ CC(vmcs12->host_cr4 & X86_CR4_FRED) ||
CC((vmcs12->host_rip) >> 32))
return -EINVAL;
}
@@ -3223,6 +3264,43 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
CC((vmcs12->guest_bndcfgs & MSR_IA32_BNDCFGS_RSVD))))
return -EINVAL;
+ if (ia32e) {
+ if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED) {
+ /* Bit 11, bits 5:4, and bit 2 of the IA32_FRED_CONFIG must be zero */
+ if (CC(vmcs12->guest_ia32_fred_config &
+ (BIT_ULL(11) | GENMASK_ULL(5, 4) | BIT_ULL(2))) ||
+ CC(vmcs12->guest_ia32_fred_rsp1 & GENMASK_ULL(5, 0)) ||
+ CC(vmcs12->guest_ia32_fred_rsp2 & GENMASK_ULL(5, 0)) ||
+ CC(vmcs12->guest_ia32_fred_rsp3 & GENMASK_ULL(5, 0)) ||
+ CC(vmcs12->guest_ia32_fred_ssp1 & GENMASK_ULL(2, 0)) ||
+ CC(vmcs12->guest_ia32_fred_ssp2 & GENMASK_ULL(2, 0)) ||
+ CC(vmcs12->guest_ia32_fred_ssp3 & GENMASK_ULL(2, 0)) ||
+ CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_config & PAGE_MASK, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_rsp1, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_rsp2, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_rsp3, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_ssp1, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_ssp2, vcpu)) ||
+ CC(is_noncanonical_msr_address(vmcs12->guest_ia32_fred_ssp3, vcpu)))
+ return -EINVAL;
+ }
+ if (vmcs12->guest_cr4 & X86_CR4_FRED) {
+ unsigned int ss_dpl = VMX_AR_DPL(vmcs12->guest_ss_ar_bytes);
+ if (CC(ss_dpl == 1 || ss_dpl == 2))
+ return -EINVAL;
+ if (ss_dpl == 0 &&
+ CC(!(vmcs12->guest_cs_ar_bytes & VMX_AR_L_MASK)))
+ return -EINVAL;
+ if (ss_dpl == 3 &&
+ (CC(vmcs12->guest_rflags & X86_EFLAGS_IOPL) ||
+ CC(vmcs12->guest_interruptibility_info & GUEST_INTR_STATE_STI)))
+ return -EINVAL;
+ }
+ } else {
+ if (CC(vmcs12->guest_cr4 & X86_CR4_FRED))
+ return -EINVAL;
+ }
+
if (nested_check_guest_non_reg_state(vmcs12))
return -EINVAL;
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 19/19] KVM: nVMX: Allow VMX FRED controls
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (17 preceding siblings ...)
2025-03-28 17:12 ` [PATCH v4 18/19] KVM: nVMX: Add VMCS FRED states checking Xin Li (Intel)
@ 2025-03-28 17:12 ` Xin Li (Intel)
2025-03-28 17:25 ` [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li
19 siblings, 0 replies; 44+ messages in thread
From: Xin Li (Intel) @ 2025-03-28 17:12 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
From: Xin Li <xin3.li@intel.com>
Allow nVMX FRED controls as nested FRED support is in place.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---
arch/x86/kvm/vmx/nested.c | 6 ++++--
arch/x86/kvm/vmx/vmx.c | 1 +
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 538ab3418957..e64ac0d1f6f2 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -7191,7 +7191,8 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
* advertise any feature in it to nVMX until its nVMX support
* is ready.
*/
- msrs->secondary_exit_ctls &= 0;
+ msrs->secondary_exit_ctls &= SECONDARY_VM_EXIT_SAVE_IA32_FRED |
+ SECONDARY_VM_EXIT_LOAD_IA32_FRED;
}
}
@@ -7206,7 +7207,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
#ifdef CONFIG_X86_64
VM_ENTRY_IA32E_MODE |
#endif
- VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
+ VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
+ VM_ENTRY_LOAD_IA32_FRED;
msrs->entry_ctls_high |=
(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 03855d6690b2..601753a90b53 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7973,6 +7973,7 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
entry = kvm_find_cpuid_entry_index(vcpu, 0x7, 1);
cr4_fixed1_update(X86_CR4_LAM_SUP, eax, feature_bit(LAM));
+ cr4_fixed1_update(X86_CR4_FRED, eax, feature_bit(FRED));
#undef cr4_fixed1_update
}
--
2.48.1
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH v4 00/19] Enable FRED with KVM VMX
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
` (18 preceding siblings ...)
2025-03-28 17:12 ` [PATCH v4 19/19] KVM: nVMX: Allow VMX FRED controls Xin Li (Intel)
@ 2025-03-28 17:25 ` Xin Li
2025-06-24 17:06 ` Sean Christopherson
19 siblings, 1 reply; 44+ messages in thread
From: Xin Li @ 2025-03-28 17:25 UTC (permalink / raw)
To: pbonzini, seanjc, kvm, linux-doc, linux-kernel
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, andrew.cooper3,
luto, peterz, chao.gao, xin3.li
On 3/28/2025 10:11 AM, Xin Li (Intel) wrote:
> This patch set enables the Intel flexible return and event delivery
> (FRED) architecture with KVM VMX to allow guests to utilize FRED.
>
> The FRED architecture defines simple new transitions that change
> privilege level (ring transitions). The FRED architecture was
> designed with the following goals:
>
> 1) Improve overall performance and response time by replacing event
> delivery through the interrupt descriptor table (IDT event
> delivery) and event return by the IRET instruction with lower
> latency transitions.
>
> 2) Improve software robustness by ensuring that event delivery
> establishes the full supervisor context and that event return
> establishes the full user context.
>
> The new transitions defined by the FRED architecture are FRED event
> delivery and, for returning from events, two FRED return instructions.
> FRED event delivery can effect a transition from ring 3 to ring 0, but
> it is used also to deliver events incident to ring 0. One FRED
> instruction (ERETU) effects a return from ring 0 to ring 3, while the
> other (ERETS) returns while remaining in ring 0. Collectively, FRED
> event delivery and the FRED return instructions are FRED transitions.
>
> Intel VMX architecture is extended to run FRED guests, and the major
> changes are:
>
> 1) New VMCS fields for FRED context management, which includes two new
> event data VMCS fields, eight new guest FRED context VMCS fields and
> eight new host FRED context VMCS fields.
>
> 2) VMX nested-exception support for proper virtualization of stack
> levels introduced with FRED architecture.
>
> Search for the latest FRED spec in most search engines with this search
> pattern:
>
> site:intel.com FRED (flexible return and event delivery) specification
>
> Following is the link to the v3 of this patch set:
> https://lore.kernel.org/lkml/20241001050110.3643764-1-xin@zytor.com/
>
> Since several preparatory patches in v3 have been merged, and Sean
> reiterated that it's NOT worth to precisely track which fields are/
> aren't supported [1], v4 patch number is reduced to 19.
>
> Although FRED and CET supervisor shadow stacks are independent CPU
> features, FRED unconditionally includes FRED shadow stack pointer
> MSRs IA32_FRED_SSP[0123], and IA32_FRED_SSP0 is just an alias of the
> CET MSR IA32_PL0_SSP. IOW, the state management of MSR IA32_PL0_SSP
> becomes an overlap area, and Sean requested that FRED virtualization
> to land after CET virtualization [2].
Hi Sean,
Any chance we could merge FRED ahead of CET?
Ofc with proper changes to FRED code.
Thanks!
Xin
>
> [1]: https://lore.kernel.org/lkml/Z73uK5IzVoBej3mi@google.com/
> [2]: https://lore.kernel.org/kvm/ZvQaNRhrsSJTYji3@google.com/
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 04/19] x86/cea: Export per CPU array 'cea_exception_stacks' for KVM to use
2025-03-28 17:11 ` [PATCH v4 04/19] x86/cea: Export per CPU array 'cea_exception_stacks' for KVM to use Xin Li (Intel)
@ 2025-04-10 8:53 ` Christoph Hellwig
2025-04-10 14:18 ` Dave Hansen
0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2025-04-10 8:53 UTC (permalink / raw)
To: Xin Li (Intel)
Cc: pbonzini, seanjc, kvm, linux-doc, linux-kernel, corbet, tglx,
mingo, bp, dave.hansen, x86, hpa, andrew.cooper3, luto, peterz,
chao.gao, xin3.li
On Fri, Mar 28, 2025 at 10:11:50AM -0700, Xin Li (Intel) wrote:
> The per CPU array 'cea_exception_stacks' points to per CPU stacks
> +/*
> + * FRED introduced new fields in the host-state area of the VMCS for
> + * stack levels 1->3 (HOST_IA32_FRED_RSP[123]), each respectively
> + * corresponding to per CPU stacks for #DB, NMI and #DF. KVM must
> + * populate these each time a vCPU is loaded onto a CPU.
> + */
> +EXPORT_PER_CPU_SYMBOL(cea_exception_stacks);
Exporting data vs accessors for it is usually a bad idea. Doing a
non-_GPl for such a very low level data struture is even worse.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 04/19] x86/cea: Export per CPU array 'cea_exception_stacks' for KVM to use
2025-04-10 8:53 ` Christoph Hellwig
@ 2025-04-10 14:18 ` Dave Hansen
2025-04-11 16:16 ` Xin Li
0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2025-04-10 14:18 UTC (permalink / raw)
To: Christoph Hellwig, Xin Li (Intel)
Cc: pbonzini, seanjc, kvm, linux-doc, linux-kernel, corbet, tglx,
mingo, bp, dave.hansen, x86, hpa, andrew.cooper3, luto, peterz,
chao.gao, xin3.li
On 4/10/25 01:53, Christoph Hellwig wrote:
> On Fri, Mar 28, 2025 at 10:11:50AM -0700, Xin Li (Intel) wrote:
>> The per CPU array 'cea_exception_stacks' points to per CPU stacks
>> +/*
>> + * FRED introduced new fields in the host-state area of the VMCS for
>> + * stack levels 1->3 (HOST_IA32_FRED_RSP[123]), each respectively
>> + * corresponding to per CPU stacks for #DB, NMI and #DF. KVM must
>> + * populate these each time a vCPU is loaded onto a CPU.
>> + */
>> +EXPORT_PER_CPU_SYMBOL(cea_exception_stacks);
> Exporting data vs accessors for it is usually a bad idea. Doing a
> non-_GPl for such a very low level data struture is even worse.
Big ack on this.
I don't even see a single caller of __this_cpu_ist_top_va() that's
remotely performance sensitive or that needs to be inline.
Just make the __this_cpu_ist_top/bottom_va() macros into real functions
and export __this_cpu_ist_top_va(). It's going to be a pretty tiny
function but I think that's tolerable.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 04/19] x86/cea: Export per CPU array 'cea_exception_stacks' for KVM to use
2025-04-10 14:18 ` Dave Hansen
@ 2025-04-11 16:16 ` Xin Li
0 siblings, 0 replies; 44+ messages in thread
From: Xin Li @ 2025-04-11 16:16 UTC (permalink / raw)
To: Dave Hansen, Christoph Hellwig
Cc: pbonzini, seanjc, kvm, linux-doc, linux-kernel, corbet, tglx,
mingo, bp, dave.hansen, x86, hpa, andrew.cooper3, luto, peterz,
chao.gao, xin3.li
On 4/10/2025 7:18 AM, Dave Hansen wrote:
> On 4/10/25 01:53, Christoph Hellwig wrote:
>> On Fri, Mar 28, 2025 at 10:11:50AM -0700, Xin Li (Intel) wrote:
>>> The per CPU array 'cea_exception_stacks' points to per CPU stacks
>>> +/*
>>> + * FRED introduced new fields in the host-state area of the VMCS for
>>> + * stack levels 1->3 (HOST_IA32_FRED_RSP[123]), each respectively
>>> + * corresponding to per CPU stacks for #DB, NMI and #DF. KVM must
>>> + * populate these each time a vCPU is loaded onto a CPU.
>>> + */
>>> +EXPORT_PER_CPU_SYMBOL(cea_exception_stacks);
>> Exporting data vs accessors for it is usually a bad idea. Doing a
>> non-_GPl for such a very low level data struture is even worse.
>
> Big ack on this.
>
> I don't even see a single caller of __this_cpu_ist_top_va() that's
> remotely performance sensitive or that needs to be inline.
>
> Just make the __this_cpu_ist_top/bottom_va() macros into real functions
> and export __this_cpu_ist_top_va(). It's going to be a pretty tiny
> function but I think that's tolerable.
>
Right, that does make sense to me.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 02/19] KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config
2025-03-28 17:11 ` [PATCH v4 02/19] KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config Xin Li (Intel)
@ 2025-04-14 7:41 ` Chao Gao
2025-04-14 16:53 ` Xin Li
0 siblings, 1 reply; 44+ messages in thread
From: Chao Gao @ 2025-04-14 7:41 UTC (permalink / raw)
To: Xin Li (Intel)
Cc: pbonzini, seanjc, kvm, linux-doc, linux-kernel, corbet, tglx,
mingo, bp, dave.hansen, x86, hpa, andrew.cooper3, luto, peterz,
xin3.li
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index f1348b140e7c..e38545d0dd17 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -2634,12 +2634,15 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> { VM_ENTRY_LOAD_IA32_EFER, VM_EXIT_LOAD_IA32_EFER },
> { VM_ENTRY_LOAD_BNDCFGS, VM_EXIT_CLEAR_BNDCFGS },
> { VM_ENTRY_LOAD_IA32_RTIT_CTL, VM_EXIT_CLEAR_IA32_RTIT_CTL },
>+ { VM_ENTRY_LOAD_IA32_FRED, VM_EXIT_ACTIVATE_SECONDARY_CONTROLS },
This line should be removed. It enforces that "Activate secondary controls"
is supported iff FRED is supported, which isn't true.
Bit 3 of 2nd VM-exit controls is "Prematurely busy shadow stack". Some CPUs
support it, but not FRED.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 02/19] KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config
2025-04-14 7:41 ` Chao Gao
@ 2025-04-14 16:53 ` Xin Li
0 siblings, 0 replies; 44+ messages in thread
From: Xin Li @ 2025-04-14 16:53 UTC (permalink / raw)
To: Chao Gao
Cc: pbonzini, seanjc, kvm, linux-doc, linux-kernel, corbet, tglx,
mingo, bp, dave.hansen, x86, hpa, andrew.cooper3, luto, peterz,
xin3.li
On 4/14/2025 12:41 AM, Chao Gao wrote:
>> + { VM_ENTRY_LOAD_IA32_FRED, VM_EXIT_ACTIVATE_SECONDARY_CONTROLS },
> This line should be removed. It enforces that "Activate secondary controls"
> is supported iff FRED is supported, which isn't true.
>
> Bit 3 of 2nd VM-exit controls is "Prematurely busy shadow stack". Some CPUs
> support it, but not FRED.
Sigh, 2nd time on the same shit.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 03/19] KVM: VMX: Disable FRED if FRED consistency checks fail
2025-03-28 17:11 ` [PATCH v4 03/19] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
@ 2025-06-24 15:20 ` Sean Christopherson
0 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2025-06-24 15:20 UTC (permalink / raw)
To: Xin Li (Intel)
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On Fri, Mar 28, 2025, Xin Li (Intel) wrote:
> From: Xin Li <xin3.li@intel.com>
>
> Do not virtualize FRED if FRED consistency checks fail.
>
> Either on broken hardware, or when run KVM on top of another hypervisor
> before the underlying hypervisor implements nested FRED correctly.
>
> Suggested-by: Chao Gao <chao.gao@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Reviewed-by: Chao Gao <chao.gao@intel.com>
> ---
>
> Change in v4:
> * Call out the reason why not check FRED VM-exit controls in
> cpu_has_vmx_fred() (Chao Gao).
> ---
> arch/x86/kvm/vmx/capabilities.h | 11 +++++++++++
> arch/x86/kvm/vmx/vmx.c | 3 +++
> 2 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index b2aefee59395..b4f49a4690ca 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -400,6 +400,17 @@ static inline bool vmx_pebs_supported(void)
> return boot_cpu_has(X86_FEATURE_PEBS) && kvm_pmu_cap.pebs_ept;
> }
>
> +static inline bool cpu_has_vmx_fred(void)
> +{
> + /*
> + * setup_vmcs_config() guarantees FRED VM-entry/exit controls
> + * are either all set or none. So, no need to check FRED VM-exit
> + * controls.
> + */
> + return cpu_feature_enabled(X86_FEATURE_FRED) &&
Drop the cpu_feature_enabled(). These helpers are all about checking raw CPU
support; whether or not the kernel is configured to support FRED is irrelevant.
[For these helpers; KVM obviously needs to account for FRED support in other
paths, but that should be automagically handled by kvm_set_cpu_caps()]
> + (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_FRED);
> +}
> +
> static inline bool cpu_has_notify_vmexit(void)
> {
> return vmcs_config.cpu_based_2nd_exec_ctrl &
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index e38545d0dd17..ab84939ace96 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -8052,6 +8052,9 @@ static __init void vmx_set_cpu_caps(void)
> kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
> }
>
> + if (!cpu_has_vmx_fred())
> + kvm_cpu_cap_clear(X86_FEATURE_FRED);
> +
> if (!enable_pmu)
> kvm_cpu_cap_clear(X86_FEATURE_PDCM);
> kvm_caps.supported_perf_cap = vmx_get_perf_capabilities();
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 06/19] KVM: VMX: Set FRED MSR interception
2025-03-28 17:11 ` [PATCH v4 06/19] KVM: VMX: Set FRED MSR interception Xin Li (Intel)
@ 2025-06-24 15:27 ` Sean Christopherson
0 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2025-06-24 15:27 UTC (permalink / raw)
To: Xin Li (Intel)
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On Fri, Mar 28, 2025, Xin Li (Intel) wrote:
> @@ -7935,6 +7945,34 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
> vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
> }
>
> +static void vmx_set_intercept_for_fred_msr(struct kvm_vcpu *vcpu)
> +{
This function should short-circult on
if (!kvm_cpu_cap_has(X86_FEATURE_FRED))
return;
Functionally, it shouldn't matter. It's mostly for documentation purposes, and
to avoid doing unnecessary work.
> + bool flag = !guest_cpu_cap_has(vcpu, X86_FEATURE_FRED);
"flag" is unnecessarily ambiguous (eww, I see that the exiting PT code does that).
I like "set", as it has (hopefully) obvious polarity, and aligns with the function
being called.
> +
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP1, MSR_TYPE_RW, flag);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP2, MSR_TYPE_RW, flag);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP3, MSR_TYPE_RW, flag);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_STKLVLS, MSR_TYPE_RW, flag);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP1, MSR_TYPE_RW, flag);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP2, MSR_TYPE_RW, flag);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP3, MSR_TYPE_RW, flag);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_CONFIG, MSR_TYPE_RW, flag);
> +
> + /*
> + * IA32_FRED_RSP0 and IA32_PL0_SSP (a.k.a. IA32_FRED_SSP0) are only used
> + * for delivering events when running userspace, while KVM always runs in
> + * kernel mode (the CPL is always 0 after any VM exit), thus KVM can run
> + * safely with guest IA32_FRED_RSP0 and IA32_PL0_SSP.
> + *
> + * As a result, no need to intercept IA32_FRED_RSP0 and IA32_PL0_SSP.
> + *
> + * Note, save and restore of IA32_PL0_SSP belong to CET supervisor context
> + * management no matter whether FRED is enabled or not. So leave its
> + * state management to CET code.
> + */
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP0, MSR_TYPE_RW, flag);
> +}
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 07/19] KVM: VMX: Save/restore guest FRED RSP0
2025-03-28 17:11 ` [PATCH v4 07/19] KVM: VMX: Save/restore guest FRED RSP0 Xin Li (Intel)
@ 2025-06-24 15:44 ` Sean Christopherson
0 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2025-06-24 15:44 UTC (permalink / raw)
To: Xin Li (Intel)
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On Fri, Mar 28, 2025, Xin Li (Intel) wrote:
> From: Xin Li <xin3.li@intel.com>
>
> Save guest FRED RSP0 in vmx_prepare_switch_to_host() and restore it
> in vmx_prepare_switch_to_guest() because MSR_IA32_FRED_RSP0 is passed
> through to the guest, thus is volatile/unknown.
>
> Note, host FRED RSP0 is restored in arch_exit_to_user_mode_prepare(),
> regardless of whether it is modified in KVM.
>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> ---
>
> Changes in v3:
> * KVM only needs to save/restore guest FRED RSP0 now as host FRED RSP0
> is restored in arch_exit_to_user_mode_prepare() (Sean Christopherson).
>
> Changes in v2:
> * Don't use guest_cpuid_has() in vmx_prepare_switch_to_{host,guest}(),
> which are called from IRQ-disabled context (Chao Gao).
> * Reset msr_guest_fred_rsp0 in __vmx_vcpu_reset() (Chao Gao).
> ---
> arch/x86/kvm/vmx/vmx.c | 9 +++++++++
> arch/x86/kvm/vmx/vmx.h | 1 +
> 2 files changed, 10 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 236fe5428a74..1fd32aa255f9 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1349,6 +1349,10 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> }
>
> wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
> +
> + if (cpu_feature_enabled(X86_FEATURE_FRED) && guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
For these paths, I'm leaning towards omitting the cpu_feature_enabled() check.
The guest_cpu_cap_has() check should suffice, this isn't a super hot path, and
the cost of the runtime check will likely be a single, well-predicted uop when
FRED is unsupported (e.g. a fused BT+Jcc).
Unlike the MSR interception toggling, the "extra" work is negligible (and it's
something confusing to check cpu_feature_enabled() instead of kvm_cpu_cap_has()).
> + wrmsrns(MSR_IA32_FRED_RSP0, vmx->msr_guest_fred_rsp0);
> +
> #else
> savesegment(fs, fs_sel);
> savesegment(gs, gs_sel);
> @@ -1393,6 +1397,11 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx)
> invalidate_tss_limit();
> #ifdef CONFIG_X86_64
> wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
> +
> + if (cpu_feature_enabled(X86_FEATURE_FRED) && guest_cpu_cap_has(&vmx->vcpu, X86_FEATURE_FRED)) {
> + vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
> + fred_sync_rsp0(vmx->msr_guest_fred_rsp0);
Can you add a comment here? Passing the guest value to fred_sync_rsp0() surprised
me a bit. The code and naming makes sense after looking at everything, but it's
quite different than the surrounding code, e.g. the MSR_KERNEL_GS_BASE, handling.
Something like this?
/*
* Synchronize the current value in hardware to the kernel's
* local cache. The desired host RSP0 will be set if/when the
* CPU exits to userspace (RSP0 is a per-task value).
*/
> + }
> #endif
> load_fixmap_gdt(raw_smp_processor_id());
> vmx->guest_state_loaded = false;
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index f48791cf6aa6..8e27b7cc700d 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -276,6 +276,7 @@ struct vcpu_vmx {
> #ifdef CONFIG_X86_64
> u64 msr_host_kernel_gs_base;
> u64 msr_guest_kernel_gs_base;
> + u64 msr_guest_fred_rsp0;
> #endif
>
> u64 spec_ctrl;
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 08/19] KVM: VMX: Add support for FRED context save/restore
2025-03-28 17:11 ` [PATCH v4 08/19] KVM: VMX: Add support for FRED context save/restore Xin Li (Intel)
@ 2025-06-24 16:27 ` Sean Christopherson
2025-06-25 17:18 ` Xin Li
0 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2025-06-24 16:27 UTC (permalink / raw)
To: Xin Li (Intel)
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On Fri, Mar 28, 2025, Xin Li (Intel) wrote:
> From: Xin Li <xin3.li@intel.com>
>
> Handle FRED MSR access requests, allowing FRED context to be set/get
> from both host and guest.
>
> During VM save/restore and live migration, FRED context needs to be
> saved/restored, which requires FRED MSRs to be accessed from userspace,
> e.g., Qemu.
>
> Note, handling of MSR_IA32_FRED_SSP0, i.e., MSR_IA32_PL0_SSP, is not
> added yet, which is done in the KVM CET patch set.
>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> ---
>
> Changes since v2:
> * Add a helper to convert FRED MSR index to VMCS field encoding to
> make the code more compact (Chao Gao).
> * Get rid of the "host_initiated" check because userspace has to set
> CPUID before MSRs (Chao Gao & Sean Christopherson).
> * Address a few cleanup comments (Sean Christopherson).
>
> Changes since v1:
> * Use kvm_cpu_cap_has() instead of cpu_feature_enabled() (Chao Gao).
> * Fail host requested FRED MSRs access if KVM cannot virtualize FRED
> (Chao Gao).
> * Handle the case FRED MSRs are valid but KVM cannot virtualize FRED
> (Chao Gao).
> * Add sanity checks when writing to FRED MSRs.
> ---
> arch/x86/kvm/vmx/vmx.c | 48 ++++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/x86.c | 28 ++++++++++++++++++++++++
> 2 files changed, 76 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 1fd32aa255f9..ae9712624413 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1426,6 +1426,24 @@ static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data)
> preempt_enable();
> vmx->msr_guest_kernel_gs_base = data;
> }
> +
> +static u64 vmx_read_guest_fred_rsp0(struct vcpu_vmx *vmx)
> +{
> + preempt_disable();
> + if (vmx->guest_state_loaded)
> + vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
> + preempt_enable();
> + return vmx->msr_guest_fred_rsp0;
> +}
> +
> +static void vmx_write_guest_fred_rsp0(struct vcpu_vmx *vmx, u64 data)
> +{
> + preempt_disable();
> + if (vmx->guest_state_loaded)
> + wrmsrns(MSR_IA32_FRED_RSP0, data);
> + preempt_enable();
> + vmx->msr_guest_fred_rsp0 = data;
> +}
> #endif
Maybe add helpers to deal with the preemption stuff? Oh, never mind, FRED
uses WRMSRNS. Hmm, actually, can't these all be non-serializing? KVM is
progating *guest* values to hardware, so a VM-Enter is guaranteed before the
CPU value can be consumed.
#ifdef CONFIG_X86_64
static u64 vmx_read_guest_host_msr(struct vcpu_vmx *vmx, u32 msr, u64 *cache)
{
preempt_disable();
if (vmx->guest_state_loaded)
*cache = read_msr(msr);
preempt_enable();
return *cache;
}
static u64 vmx_write_guest_host_msr(struct vcpu_vmx *vmx, u32 msr, u64 data,
u64 *cache)
{
preempt_disable();
if (vmx->guest_state_loaded)
wrmsrns(MSR_KERNEL_GS_BASE, data);
preempt_enable();
*cache = data;
}
static u64 vmx_read_guest_kernel_gs_base(struct vcpu_vmx *vmx)
{
return vmx_read_guest_host_msr(vmx, MSR_KERNEL_GS_BASE,
&vmx->msr_guest_kernel_gs_base);
}
static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data)
{
vmx_write_guest_host_msr(vmx, MSR_KERNEL_GS_BASE, data,
&vmx->msr_guest_kernel_gs_base);
}
static u64 vmx_read_guest_fred_rsp0(struct vcpu_vmx *vmx)
{
return vmx_read_guest_host_msr(vmx, MSR_IA32_FRED_RSP0,
&vmx->msr_guest_fred_rsp0);
}
static void vmx_write_guest_fred_rsp0(struct vcpu_vmx *vmx, u64 data)
{
return vmx_write_guest_host_msr(vmx, MSR_IA32_FRED_RSP0, data,
&vmx->msr_guest_fred_rsp0);
}
#endif
> static void grow_ple_window(struct kvm_vcpu *vcpu)
> @@ -2039,6 +2057,24 @@ int vmx_get_feature_msr(u32 msr, u64 *data)
> }
> }
>
> +#ifdef CONFIG_X86_64
> +static u32 fred_msr_vmcs_fields[] = {
This should be const.
> + GUEST_IA32_FRED_RSP1,
> + GUEST_IA32_FRED_RSP2,
> + GUEST_IA32_FRED_RSP3,
> + GUEST_IA32_FRED_STKLVLS,
> + GUEST_IA32_FRED_SSP1,
> + GUEST_IA32_FRED_SSP2,
> + GUEST_IA32_FRED_SSP3,
> + GUEST_IA32_FRED_CONFIG,
> +};
I think it also makes sense to add a static_assert() here, more so to help
readers follow along than anything else.
static_assert(MSR_IA32_FRED_CONFIG - MSR_IA32_FRED_RSP1 ==
ARRAY_SIZE(fred_msr_vmcs_fields) - 1);
> +
> +static u32 fred_msr_to_vmcs(u32 msr)
> +{
> + return fred_msr_vmcs_fields[msr - MSR_IA32_FRED_RSP1];
> +}
> +#endif
> +
> /*
> * Reads an msr value (of 'msr_info->index') into 'msr_info->data'.
> * Returns 0 on success, non-0 otherwise.
> @@ -2061,6 +2097,12 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_KERNEL_GS_BASE:
> msr_info->data = vmx_read_guest_kernel_gs_base(vmx);
> break;
> + case MSR_IA32_FRED_RSP0:
> + msr_info->data = vmx_read_guest_fred_rsp0(vmx);
> + break;
> + case MSR_IA32_FRED_RSP1 ... MSR_IA32_FRED_CONFIG:
> + msr_info->data = vmcs_read64(fred_msr_to_vmcs(msr_info->index));
> + break;
> #endif
> case MSR_EFER:
> return kvm_get_msr_common(vcpu, msr_info);
> @@ -2268,6 +2310,12 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> vmx_update_exception_bitmap(vcpu);
> }
> break;
> + case MSR_IA32_FRED_RSP0:
> + vmx_write_guest_fred_rsp0(vmx, data);
> + break;
> + case MSR_IA32_FRED_RSP1 ... MSR_IA32_FRED_CONFIG:
> + vmcs_write64(fred_msr_to_vmcs(msr_index), data);
> + break;
> #endif
> case MSR_IA32_SYSENTER_CS:
> if (is_guest_mode(vcpu))
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c841817a914a..007577143337 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -318,6 +318,9 @@ static const u32 msrs_to_save_base[] = {
> MSR_STAR,
> #ifdef CONFIG_X86_64
> MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
> + MSR_IA32_FRED_RSP0, MSR_IA32_FRED_RSP1, MSR_IA32_FRED_RSP2,
> + MSR_IA32_FRED_RSP3, MSR_IA32_FRED_STKLVLS, MSR_IA32_FRED_SSP1,
> + MSR_IA32_FRED_SSP2, MSR_IA32_FRED_SSP3, MSR_IA32_FRED_CONFIG,
> #endif
> MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
> MSR_IA32_FEAT_CTL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
> @@ -1849,6 +1852,23 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
>
> data = (u32)data;
> break;
> + case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
> + if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
> + return 1;
Yeesh, this is a bit of a no-win situation. Having to re-check the MSR index is
no fun, but the amount of overlap between MSRs is significant, i.e. I see why you
bundled everything together. Ugh, and MSR_IA32_FRED_STKLVLS is buried smack dab
in the middle of everything.
> +
> + /* Bit 11, bits 5:4, and bit 2 of the IA32_FRED_CONFIG must be zero */
Eh, the comment isn't helping much. If we want to add more documentation, add
#defines. But I think we can documented the reserved behavior while also tidying
up the code a bit.
After much fiddling, how about this?
case MSR_IA32_FRED_STKLVLS:
if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
return 1;
break;
case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_RSP3:
case MSR_IA32_FRED_SSP1 ... MSR_IA32_FRED_CONFIG: {
u64 reserved_bits;
if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
return 1;
if (is_noncanonical_msr_address(data, vcpu))
return 1;
switch (index) {
case MSR_IA32_FRED_CONFIG:
reserved_bits = BIT_ULL(11) | GENMASK_ULL(5, 4) | BIT_ULL(2);
break;
case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_RSP3:
reserved_bits = GENMASK_ULL(5, 0);
break;
case MSR_IA32_FRED_SSP1 ... MSR_IA32_FRED_SSP3:
reserved_bits = GENMASK_ULL(2, 0);
break;
default:
WARN_ON_ONCE(1);
return 1;
}
if (data & reserved_bits)
return 1;
break;
}
> @@ -1893,6 +1913,10 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> !guest_cpu_cap_has(vcpu, X86_FEATURE_RDPID))
> return 1;
> break;
> + case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
> + if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
> + return 1;
> + break;
> }
>
> msr.index = index;
> @@ -7455,6 +7479,10 @@ static void kvm_probe_msr_to_save(u32 msr_index)
> if (!(kvm_get_arch_capabilities() & ARCH_CAP_TSX_CTRL_MSR))
> return;
> break;
> + case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
> + if (!kvm_cpu_cap_has(X86_FEATURE_FRED))
> + return;
> + break;
> default:
> break;
> }
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 14/19] KVM: VMX: Dump FRED context in dump_vmcs()
2025-03-28 17:12 ` [PATCH v4 14/19] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li (Intel)
@ 2025-06-24 16:32 ` Sean Christopherson
2025-06-25 17:38 ` Xin Li
0 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2025-06-24 16:32 UTC (permalink / raw)
To: Xin Li (Intel)
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On Fri, Mar 28, 2025, Xin Li (Intel) wrote:
> From: Xin Li <xin3.li@intel.com>
>
> Add FRED related VMCS fields to dump_vmcs() to dump FRED context.
>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> ---
>
> Change in v3:
> * Use (vmentry_ctrl & VM_ENTRY_LOAD_IA32_FRED) instead of is_fred_enabled()
> (Chao Gao).
>
> Changes in v2:
> * Use kvm_cpu_cap_has() instead of cpu_feature_enabled() (Chao Gao).
> * Dump guest FRED states only if guest has FRED enabled (Nikolay Borisov).
> ---
> arch/x86/kvm/vmx/vmx.c | 40 +++++++++++++++++++++++++++++++++-------
> 1 file changed, 33 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c76015e1e3f8..03855d6690b2 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6462,7 +6462,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> u32 vmentry_ctl, vmexit_ctl;
> u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
> - u64 tertiary_exec_control;
> + u64 tertiary_exec_control, secondary_vmexit_ctl;
> unsigned long cr4;
> int efer_slot;
>
> @@ -6473,6 +6473,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>
> vmentry_ctl = vmcs_read32(VM_ENTRY_CONTROLS);
> vmexit_ctl = vmcs_read32(VM_EXIT_CONTROLS);
> + secondary_vmexit_ctl = cpu_has_secondary_vmexit_ctrls() ?
> + vmcs_read64(SECONDARY_VM_EXIT_CONTROLS) : 0;
> cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
> cr4 = vmcs_readl(GUEST_CR4);
> @@ -6519,6 +6521,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> vmx_dump_sel("LDTR:", GUEST_LDTR_SELECTOR);
> vmx_dump_dtsel("IDTR:", GUEST_IDTR_LIMIT);
> vmx_dump_sel("TR: ", GUEST_TR_SELECTOR);
> + if (vmentry_ctl & VM_ENTRY_LOAD_IA32_FRED)
> + pr_err("FRED guest: config=0x%016llx, stack_levels=0x%016llx\n"
> + "RSP0=0x%016llx, RSP1=0x%016llx\n"
> + "RSP2=0x%016llx, RSP3=0x%016llx\n",
> + vmcs_read64(GUEST_IA32_FRED_CONFIG),
> + vmcs_read64(GUEST_IA32_FRED_STKLVLS),
> + __rdmsr(MSR_IA32_FRED_RSP0),
There is no guarantee the vCPU's FRED_RSP is loaded in hardware at this point.
I think you need to use vmx_read_guest_fred_rsp0().
> + vmcs_read64(GUEST_IA32_FRED_RSP1),
> + vmcs_read64(GUEST_IA32_FRED_RSP2),
> + vmcs_read64(GUEST_IA32_FRED_RSP3));
> efer_slot = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest, MSR_EFER);
> if (vmentry_ctl & VM_ENTRY_LOAD_IA32_EFER)
> pr_err("EFER= 0x%016llx\n", vmcs_read64(GUEST_IA32_EFER));
> @@ -6566,6 +6578,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> vmcs_readl(HOST_TR_BASE));
> pr_err("GDTBase=%016lx IDTBase=%016lx\n",
> vmcs_readl(HOST_GDTR_BASE), vmcs_readl(HOST_IDTR_BASE));
> + if (vmexit_ctl & SECONDARY_VM_EXIT_LOAD_IA32_FRED)
> + pr_err("FRED host: config=0x%016llx, stack_levels=0x%016llx\n"
> + "RSP0=0x%016lx, RSP1=0x%016llx\n"
> + "RSP2=0x%016llx, RSP3=0x%016llx\n",
> + vmcs_read64(HOST_IA32_FRED_CONFIG),
> + vmcs_read64(HOST_IA32_FRED_STKLVLS),
> + (unsigned long)task_stack_page(current) + THREAD_SIZE,
Maybe add a helper in arch/x86/include/asm/fred.h to generate the desired RSP0?
Not sure it's worth doing that just for this code.
> + vmcs_read64(HOST_IA32_FRED_RSP1),
> + vmcs_read64(HOST_IA32_FRED_RSP2),
> + vmcs_read64(HOST_IA32_FRED_RSP3));
> pr_err("CR0=%016lx CR3=%016lx CR4=%016lx\n",
> vmcs_readl(HOST_CR0), vmcs_readl(HOST_CR3),
> vmcs_readl(HOST_CR4));
> @@ -6587,25 +6609,29 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> pr_err("*** Control State ***\n");
> pr_err("CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
> cpu_based_exec_ctrl, secondary_exec_control, tertiary_exec_control);
> - pr_err("PinBased=0x%08x EntryControls=%08x ExitControls=%08x\n",
> - pin_based_exec_ctrl, vmentry_ctl, vmexit_ctl);
> + pr_err("PinBased=0x%08x EntryControls=0x%08x\n",
> + pin_based_exec_ctrl, vmentry_ctl);
> + pr_err("ExitControls=0x%08x SecondaryExitControls=0x%016llx\n",
> + vmexit_ctl, secondary_vmexit_ctl);
> pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
> vmcs_read32(EXCEPTION_BITMAP),
> vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK),
> vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH));
> - pr_err("VMEntry: intr_info=%08x errcode=%08x ilen=%08x\n",
> + pr_err("VMEntry: intr_info=%08x errcode=%08x ilen=%08x event_data=%016llx\n",
> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
> vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE),
> - vmcs_read32(VM_ENTRY_INSTRUCTION_LEN));
> + vmcs_read32(VM_ENTRY_INSTRUCTION_LEN),
> + kvm_cpu_cap_has(X86_FEATURE_FRED) ? vmcs_read64(INJECTED_EVENT_DATA) : 0);
> pr_err("VMExit: intr_info=%08x errcode=%08x ilen=%08x\n",
> vmcs_read32(VM_EXIT_INTR_INFO),
> vmcs_read32(VM_EXIT_INTR_ERROR_CODE),
> vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
> pr_err(" reason=%08x qualification=%016lx\n",
> vmcs_read32(VM_EXIT_REASON), vmcs_readl(EXIT_QUALIFICATION));
> - pr_err("IDTVectoring: info=%08x errcode=%08x\n",
> + pr_err("IDTVectoring: info=%08x errcode=%08x event_data=%016llx\n",
> vmcs_read32(IDT_VECTORING_INFO_FIELD),
> - vmcs_read32(IDT_VECTORING_ERROR_CODE));
> + vmcs_read32(IDT_VECTORING_ERROR_CODE),
> + kvm_cpu_cap_has(X86_FEATURE_FRED) ? vmcs_read64(ORIGINAL_EVENT_DATA) : 0);
> pr_err("TSC Offset = 0x%016llx\n", vmcs_read64(TSC_OFFSET));
> if (secondary_exec_control & SECONDARY_EXEC_TSC_SCALING)
> pr_err("TSC Multiplier = 0x%016llx\n",
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 15/19] KVM: x86: Allow FRED/LKGS to be advertised to guests
2025-03-28 17:12 ` [PATCH v4 15/19] KVM: x86: Allow FRED/LKGS to be advertised to guests Xin Li (Intel)
@ 2025-06-24 16:38 ` Sean Christopherson
2025-06-25 18:05 ` Xin Li
0 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2025-06-24 16:38 UTC (permalink / raw)
To: Xin Li (Intel)
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
The shortlog (and changelog intro) are wrong. KVM isn't allowing FRED/LKGS to
be advertised to the guest. Userspace can advertise whatever it wants. The guest
will break badly without KVM support, but that doesn't stop userspace from
advertising a bogus vCPU model.
KVM: x86: Advertise support for FRED/LKGS to userspace
On Fri, Mar 28, 2025, Xin Li (Intel) wrote:
> From: Xin Li <xin3.li@intel.com>
>
> Allow FRED/LKGS to be advertised to guests after changes required to
Please explain what LKGS is early in the changelog. I assumed it was a feature
of sorts; turns out it's a new instruction.
Actually, why wait this long to enumerate support for LKGS? I.e. why not have a
patch at the head of the series to enumerate support for LKGS? IIUC, LKGS doesn't
depend on FRED.
> enable FRED in a KVM guest are in place.
>
> LKGS is introduced with FRED to completely eliminate the need to swapgs
> explicilty, because
>
> 1) FRED transitions ensure that an operating system can always operate
> with its own GS base address.
>
> 2) LKGS behaves like the MOV to GS instruction except that it loads
> the base address into the IA32_KERNEL_GS_BASE MSR instead of the
> GS segment’s descriptor cache, which is exactly what Linux kernel
> does to load a user level GS base. Thus there is no need to SWAPGS
> away from the kernel GS base and an execution of SWAPGS causes #UD
> if FRED transitions are enabled.
>
> A FRED CPU must enumerate LKGS. When LKGS is not available, FRED must
> not be enabled.
>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> ---
> arch/x86/kvm/cpuid.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 5e4d4934c0d3..8f290273aee1 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -992,6 +992,8 @@ void kvm_set_cpu_caps(void)
> F(FZRM),
> F(FSRS),
> F(FSRC),
> + F(FRED),
> + F(LKGS),
These need to be X86_64_F, no?
> F(AMX_FP16),
> F(AVX_IFMA),
> F(LAM),
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 16/19] KVM: nVMX: Add support for the secondary VM exit controls
2025-03-28 17:12 ` [PATCH v4 16/19] KVM: nVMX: Add support for the secondary VM exit controls Xin Li (Intel)
@ 2025-06-24 16:54 ` Sean Christopherson
0 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2025-06-24 16:54 UTC (permalink / raw)
To: Xin Li (Intel)
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On Fri, Mar 28, 2025, Xin Li (Intel) wrote:
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index b4f49a4690ca..d29be4e4124e 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -38,6 +38,7 @@ struct nested_vmx_msrs {
> u32 pinbased_ctls_high;
> u32 exit_ctls_low;
> u32 exit_ctls_high;
> + u64 secondary_exit_ctls;
> u32 entry_ctls_low;
> u32 entry_ctls_high;
> u32 misc_low;
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 5504d9e9fd32..8b0c5e5f1e98 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1457,6 +1457,7 @@ int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
> case MSR_IA32_VMX_PINBASED_CTLS:
> case MSR_IA32_VMX_PROCBASED_CTLS:
> case MSR_IA32_VMX_EXIT_CTLS:
> + case MSR_IA32_VMX_EXIT_CTLS2:
This is wrong. KVM allows userspace to configure control MSRs, it's just the
non-true MSRs that have a true version that KVM rejects. I.e. KVM needs to
actually handle writing MSR_IA32_VMX_EXIT_CTLS2.
> case MSR_IA32_VMX_ENTRY_CTLS:
> /*
> * The "non-true" VMX capability MSRs are generated from the
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 00/19] Enable FRED with KVM VMX
2025-03-28 17:25 ` [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li
@ 2025-06-24 17:06 ` Sean Christopherson
2025-06-24 17:43 ` Xin Li
0 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2025-06-24 17:06 UTC (permalink / raw)
To: Xin Li
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On Fri, Mar 28, 2025, Xin Li wrote:
> Any chance we could merge FRED ahead of CET?
Probably not? CET exists is publicly available CPUs. AFAIK, FRED does not.
And CET is (/knock wood) hopefully pretty much ready? FWIW, I'd really like to
get both CET and FRED virtualization landed by 6.18, i.e. in time for the next
LTS.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 00/19] Enable FRED with KVM VMX
2025-06-24 17:06 ` Sean Christopherson
@ 2025-06-24 17:43 ` Xin Li
2025-06-24 17:47 ` H. Peter Anvin
0 siblings, 1 reply; 44+ messages in thread
From: Xin Li @ 2025-06-24 17:43 UTC (permalink / raw)
To: Sean Christopherson
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On 6/24/2025 10:06 AM, Sean Christopherson wrote:
> On Fri, Mar 28, 2025, Xin Li wrote:
>> Any chance we could merge FRED ahead of CET?
>
> Probably not? CET exists is publicly available CPUs. AFAIK, FRED does not.
Better not, as you said it creates extra effort because FRED does lean a
bit on CET.
I was a bit worried that CET would take longer time than expected...
> And CET is (/knock wood) hopefully pretty much ready? FWIW, I'd really like to
That is also my reading on CET.
> get both CET and FRED virtualization landed by 6.18, i.e. in time for the next
> LTS.
I love the plan!
FRED is my top priority. I’ll address all your comments, rebase onto
kvm-x86/next (you have not updated yet :) ), and send out v5 at an
appropriate time.
Thanks!
Xin
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 00/19] Enable FRED with KVM VMX
2025-06-24 17:43 ` Xin Li
@ 2025-06-24 17:47 ` H. Peter Anvin
2025-06-24 18:02 ` Xin Li
0 siblings, 1 reply; 44+ messages in thread
From: H. Peter Anvin @ 2025-06-24 17:47 UTC (permalink / raw)
To: Xin Li, Sean Christopherson
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, andrew.cooper3, luto, peterz, chao.gao, xin3.li
On June 24, 2025 10:43:26 AM PDT, Xin Li <xin@zytor.com> wrote:
>On 6/24/2025 10:06 AM, Sean Christopherson wrote:
>> On Fri, Mar 28, 2025, Xin Li wrote:
>>> Any chance we could merge FRED ahead of CET?
>>
>> Probably not? CET exists is publicly available CPUs. AFAIK, FRED does not.
>
>Better not, as you said it creates extra effort because FRED does lean a
>bit on CET.
>
>I was a bit worried that CET would take longer time than expected...
>
>> And CET is (/knock wood) hopefully pretty much ready? FWIW, I'd really like to
>
>That is also my reading on CET.
>
>> get both CET and FRED virtualization landed by 6.18, i.e. in time for the next
>> LTS.
>
>I love the plan!
>
>FRED is my top priority. I’ll address all your comments, rebase onto
>kvm-x86/next (you have not updated yet :) ), and send out v5 at an
>appropriate time.
>
>Thanks!
> Xin
>
FRED doesn't lean on CET... one could argue it leans on LASS, at least to a small extent, though.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 00/19] Enable FRED with KVM VMX
2025-06-24 17:47 ` H. Peter Anvin
@ 2025-06-24 18:02 ` Xin Li
2025-06-24 18:40 ` H. Peter Anvin
0 siblings, 1 reply; 44+ messages in thread
From: Xin Li @ 2025-06-24 18:02 UTC (permalink / raw)
To: H. Peter Anvin, Sean Christopherson
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, andrew.cooper3, luto, peterz, chao.gao, xin3.li
On 6/24/2025 10:47 AM, H. Peter Anvin wrote:
> FRED doesn't lean on CET... one could argue it leans on LASS, at least to a small extent, though.
Probably I used a wrong verb "lean", "overlap" is better.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 00/19] Enable FRED with KVM VMX
2025-06-24 18:02 ` Xin Li
@ 2025-06-24 18:40 ` H. Peter Anvin
0 siblings, 0 replies; 44+ messages in thread
From: H. Peter Anvin @ 2025-06-24 18:40 UTC (permalink / raw)
To: Xin Li, Sean Christopherson
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, andrew.cooper3, luto, peterz, chao.gao, xin3.li
On June 24, 2025 11:02:41 AM PDT, Xin Li <xin@zytor.com> wrote:
>On 6/24/2025 10:47 AM, H. Peter Anvin wrote:
>> FRED doesn't lean on CET... one could argue it leans on LASS, at least to a small extent, though.
>
>Probably I used a wrong verb "lean", "overlap" is better.
I would personally say to the extent there is overlap it is the opposite direction (FRED helps enable kCET, but uCET is pretty much orthogonal.)
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 08/19] KVM: VMX: Add support for FRED context save/restore
2025-06-24 16:27 ` Sean Christopherson
@ 2025-06-25 17:18 ` Xin Li
2025-06-26 17:22 ` Xin Li
0 siblings, 1 reply; 44+ messages in thread
From: Xin Li @ 2025-06-25 17:18 UTC (permalink / raw)
To: Sean Christopherson
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On 6/24/2025 9:27 AM, Sean Christopherson wrote:
>> +
>> +static u64 vmx_read_guest_fred_rsp0(struct vcpu_vmx *vmx)
>> +{
>> + preempt_disable();
>> + if (vmx->guest_state_loaded)
>> + vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
>> + preempt_enable();
>> + return vmx->msr_guest_fred_rsp0;
>> +}
>> +
>> +static void vmx_write_guest_fred_rsp0(struct vcpu_vmx *vmx, u64 data)
>> +{
>> + preempt_disable();
>> + if (vmx->guest_state_loaded)
>> + wrmsrns(MSR_IA32_FRED_RSP0, data);
>> + preempt_enable();
>> + vmx->msr_guest_fred_rsp0 = data;
>> +}
>> #endif
>
> Maybe add helpers to deal with the preemption stuff? Oh, never mind, FRED
This is a good idea.
Do you want to upstream the following patch?
So I can rebase this patch on top of it in the next iteration.
> uses WRMSRNS. Hmm, actually, can't these all be non-serializing? KVM is
> progating *guest* values to hardware, so a VM-Enter is guaranteed before the
> CPU value can be consumed.
I see your point. It seems that only a new MSR write instruction could
achieve this: consistently performing a non-serializing write to a MSR
with the assumption that the target is a guest MSR. So software needs
to explicitly specify the type, host or guest, of the target MSR to the CPU.
(WRMSRNS writes to an MSR in either a serializing or non-serializing
manner, only based on its index.)
>
> #ifdef CONFIG_X86_64
> static u64 vmx_read_guest_host_msr(struct vcpu_vmx *vmx, u32 msr, u64 *cache)
> {
> preempt_disable();
> if (vmx->guest_state_loaded)
> *cache = read_msr(msr);
> preempt_enable();
> return *cache;
> }
>
> static u64 vmx_write_guest_host_msr(struct vcpu_vmx *vmx, u32 msr, u64 data,
> u64 *cache)
> {
> preempt_disable();
> if (vmx->guest_state_loaded)
> wrmsrns(MSR_KERNEL_GS_BASE, data);
> preempt_enable();
> *cache = data;
> }
>
> static u64 vmx_read_guest_kernel_gs_base(struct vcpu_vmx *vmx)
> {
> return vmx_read_guest_host_msr(vmx, MSR_KERNEL_GS_BASE,
> &vmx->msr_guest_kernel_gs_base);
> }
>
> static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data)
> {
> vmx_write_guest_host_msr(vmx, MSR_KERNEL_GS_BASE, data,
> &vmx->msr_guest_kernel_gs_base);
> }
>
> static u64 vmx_read_guest_fred_rsp0(struct vcpu_vmx *vmx)
> {
> return vmx_read_guest_host_msr(vmx, MSR_IA32_FRED_RSP0,
> &vmx->msr_guest_fred_rsp0);
> }
>
> static void vmx_write_guest_fred_rsp0(struct vcpu_vmx *vmx, u64 data)
> {
> return vmx_write_guest_host_msr(vmx, MSR_IA32_FRED_RSP0, data,
> &vmx->msr_guest_fred_rsp0);
> }
> #endif
>
>> +#ifdef CONFIG_X86_64
>> +static u32 fred_msr_vmcs_fields[] = {
>
> This should be const.
Will add.
>
>> + GUEST_IA32_FRED_RSP1,
>> + GUEST_IA32_FRED_RSP2,
>> + GUEST_IA32_FRED_RSP3,
>> + GUEST_IA32_FRED_STKLVLS,
>> + GUEST_IA32_FRED_SSP1,
>> + GUEST_IA32_FRED_SSP2,
>> + GUEST_IA32_FRED_SSP3,
>> + GUEST_IA32_FRED_CONFIG,
>> +};
>
> I think it also makes sense to add a static_assert() here, more so to help
> readers follow along than anything else.
>
> static_assert(MSR_IA32_FRED_CONFIG - MSR_IA32_FRED_RSP1 ==
> ARRAY_SIZE(fred_msr_vmcs_fields) - 1);
Good idea!
I tried to make fred_msr_to_vmcs() fail at build time, but couldn’t get
it to work.
>
>> +
>> +static u32 fred_msr_to_vmcs(u32 msr)
>> +{
>> + return fred_msr_vmcs_fields[msr - MSR_IA32_FRED_RSP1];
>> +}
>> +#endif
>> +
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> @@ -1849,6 +1852,23 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
>>
>> data = (u32)data;
>> break;
>> + case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
>> + if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
>> + return 1;
>
> Yeesh, this is a bit of a no-win situation. Having to re-check the MSR index is
> no fun, but the amount of overlap between MSRs is significant, i.e. I see why you
> bundled everything together. Ugh, and MSR_IA32_FRED_STKLVLS is buried smack dab
> in the middle of everything.
>
>> +
>> + /* Bit 11, bits 5:4, and bit 2 of the IA32_FRED_CONFIG must be zero */
>
> Eh, the comment isn't helping much. If we want to add more documentation, add
> #defines. But I think we can documented the reserved behavior while also tidying
> up the code a bit.
>
> After much fiddling, how about this?
>
> case MSR_IA32_FRED_STKLVLS:
> if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
> return 1;
> break;
>
> case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_RSP3:
> case MSR_IA32_FRED_SSP1 ... MSR_IA32_FRED_CONFIG: {
> u64 reserved_bits;
>
> if (!guest_cpu_cap_has(vcpu, X86_FEATURE_FRED))
> return 1;
>
> if (is_noncanonical_msr_address(data, vcpu))
> return 1;
>
> switch (index) {
> case MSR_IA32_FRED_CONFIG:
> reserved_bits = BIT_ULL(11) | GENMASK_ULL(5, 4) | BIT_ULL(2);
> break;
> case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_RSP3:
> reserved_bits = GENMASK_ULL(5, 0);
> break;
> case MSR_IA32_FRED_SSP1 ... MSR_IA32_FRED_SSP3:
> reserved_bits = GENMASK_ULL(2, 0);
> break;
> default:
> WARN_ON_ONCE(1);
> return 1;
> }
> if (data & reserved_bits)
> return 1;
> break;
> }
>
Easier to read, I will use it :)
Thanks!
Xin
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 14/19] KVM: VMX: Dump FRED context in dump_vmcs()
2025-06-24 16:32 ` Sean Christopherson
@ 2025-06-25 17:38 ` Xin Li
0 siblings, 0 replies; 44+ messages in thread
From: Xin Li @ 2025-06-25 17:38 UTC (permalink / raw)
To: Sean Christopherson
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On 6/24/2025 9:32 AM, Sean Christopherson wrote:
>> @@ -6519,6 +6521,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>> vmx_dump_sel("LDTR:", GUEST_LDTR_SELECTOR);
>> vmx_dump_dtsel("IDTR:", GUEST_IDTR_LIMIT);
>> vmx_dump_sel("TR: ", GUEST_TR_SELECTOR);
>> + if (vmentry_ctl & VM_ENTRY_LOAD_IA32_FRED)
>> + pr_err("FRED guest: config=0x%016llx, stack_levels=0x%016llx\n"
>> + "RSP0=0x%016llx, RSP1=0x%016llx\n"
>> + "RSP2=0x%016llx, RSP3=0x%016llx\n",
>> + vmcs_read64(GUEST_IA32_FRED_CONFIG),
>> + vmcs_read64(GUEST_IA32_FRED_STKLVLS),
>> + __rdmsr(MSR_IA32_FRED_RSP0),
>
> There is no guarantee the vCPU's FRED_RSP is loaded in hardware at this point.
> I think you need to use vmx_read_guest_fred_rsp0().
Good catch.
>
>> + vmcs_read64(GUEST_IA32_FRED_RSP1),
>> + vmcs_read64(GUEST_IA32_FRED_RSP2),
>> + vmcs_read64(GUEST_IA32_FRED_RSP3));
>> efer_slot = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest, MSR_EFER);
>> if (vmentry_ctl & VM_ENTRY_LOAD_IA32_EFER)
>> pr_err("EFER= 0x%016llx\n", vmcs_read64(GUEST_IA32_EFER));
>> @@ -6566,6 +6578,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>> vmcs_readl(HOST_TR_BASE));
>> pr_err("GDTBase=%016lx IDTBase=%016lx\n",
>> vmcs_readl(HOST_GDTR_BASE), vmcs_readl(HOST_IDTR_BASE));
>> + if (vmexit_ctl & SECONDARY_VM_EXIT_LOAD_IA32_FRED)
>> + pr_err("FRED host: config=0x%016llx, stack_levels=0x%016llx\n"
>> + "RSP0=0x%016lx, RSP1=0x%016llx\n"
>> + "RSP2=0x%016llx, RSP3=0x%016llx\n",
>> + vmcs_read64(HOST_IA32_FRED_CONFIG),
>> + vmcs_read64(HOST_IA32_FRED_STKLVLS),
>> + (unsigned long)task_stack_page(current) + THREAD_SIZE,
>
> Maybe add a helper in arch/x86/include/asm/fred.h to generate the desired RSP0?
> Not sure it's worth doing that just for this code.
It's not just one usage. I checked with:
git grep -w task_stack_page | grep THREAD_SIZE | wc -l
And get 25.
However it is used in other architectures, so I'll work it in parallel.
I.e., likely I won't change it in the next iteration.
Thanks!
Xin
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 15/19] KVM: x86: Allow FRED/LKGS to be advertised to guests
2025-06-24 16:38 ` Sean Christopherson
@ 2025-06-25 18:05 ` Xin Li
2025-06-25 18:29 ` Sean Christopherson
0 siblings, 1 reply; 44+ messages in thread
From: Xin Li @ 2025-06-25 18:05 UTC (permalink / raw)
To: Sean Christopherson
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On 6/24/2025 9:38 AM, Sean Christopherson wrote:
> The shortlog (and changelog intro) are wrong. KVM isn't allowing FRED/LKGS to
> be advertised to the guest. Userspace can advertise whatever it wants. The guest
> will break badly without KVM support, but that doesn't stop userspace from
> advertising a bogus vCPU model.
>
> KVM: x86: Advertise support for FRED/LKGS to userspace
>
> On Fri, Mar 28, 2025, Xin Li (Intel) wrote:
>> From: Xin Li <xin3.li@intel.com>
>>
>> Allow FRED/LKGS to be advertised to guests after changes required to
>
> Please explain what LKGS is early in the changelog. I assumed it was a feature
> of sorts; turns out it's a new instruction.
>
> Actually, why wait this long to enumerate support for LKGS? I.e. why not have a
> patch at the head of the series to enumerate support for LKGS? IIUC, LKGS doesn't
> depend on FRED.
I will send LKGS as a separate patch, thus if you prefer you can take
it before the KVM FRED patch set.
>
>> enable FRED in a KVM guest are in place.
>>
>> LKGS is introduced with FRED to completely eliminate the need to swapgs
>> explicilty, because
>>
>> 1) FRED transitions ensure that an operating system can always operate
>> with its own GS base address.
>>
>> 2) LKGS behaves like the MOV to GS instruction except that it loads
>> the base address into the IA32_KERNEL_GS_BASE MSR instead of the
>> GS segment’s descriptor cache, which is exactly what Linux kernel
>> does to load a user level GS base. Thus there is no need to SWAPGS
>> away from the kernel GS base and an execution of SWAPGS causes #UD
>> if FRED transitions are enabled.
>>
>> A FRED CPU must enumerate LKGS. When LKGS is not available, FRED must
>> not be enabled.
>>
>> Signed-off-by: Xin Li <xin3.li@intel.com>
>> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
>> Tested-by: Shan Kang <shan.kang@intel.com>
>> ---
>> arch/x86/kvm/cpuid.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index 5e4d4934c0d3..8f290273aee1 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -992,6 +992,8 @@ void kvm_set_cpu_caps(void)
>> F(FZRM),
>> F(FSRS),
>> F(FSRC),
>> + F(FRED),
>> + F(LKGS),
>
> These need to be X86_64_F, no?
Yes. Both LKGS and FRED are 64-bit only features.
However I assume KVM is 64-bit only now, so X86_64_F is essentially F,
right?
Thanks!
Xin
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 15/19] KVM: x86: Allow FRED/LKGS to be advertised to guests
2025-06-25 18:05 ` Xin Li
@ 2025-06-25 18:29 ` Sean Christopherson
0 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2025-06-25 18:29 UTC (permalink / raw)
To: Xin Li
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On Wed, Jun 25, 2025, Xin Li wrote:
> On 6/24/2025 9:38 AM, Sean Christopherson wrote:
> > > ---
> > > arch/x86/kvm/cpuid.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > > index 5e4d4934c0d3..8f290273aee1 100644
> > > --- a/arch/x86/kvm/cpuid.c
> > > +++ b/arch/x86/kvm/cpuid.c
> > > @@ -992,6 +992,8 @@ void kvm_set_cpu_caps(void)
> > > F(FZRM),
> > > F(FSRS),
> > > F(FSRC),
> > > + F(FRED),
> > > + F(LKGS),
> >
> > These need to be X86_64_F, no?
>
> Yes. Both LKGS and FRED are 64-bit only features.
>
> However I assume KVM is 64-bit only now, so X86_64_F is essentially F,
> right?
Nope, KVM still supports 32-bit builds. There are plans/efforts to kill off 32-bit
KVM x86, but we're not quite there yet.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 08/19] KVM: VMX: Add support for FRED context save/restore
2025-06-25 17:18 ` Xin Li
@ 2025-06-26 17:22 ` Xin Li
2025-06-26 20:50 ` Sean Christopherson
0 siblings, 1 reply; 44+ messages in thread
From: Xin Li @ 2025-06-26 17:22 UTC (permalink / raw)
To: Sean Christopherson
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On 6/25/2025 10:18 AM, Xin Li wrote:
>>
>> Maybe add helpers to deal with the preemption stuff? Oh, never mind,
>> FRED
>
> This is a good idea.
>
> Do you want to upstream the following patch?
As I have almost done addressing your comments in my local repo, just
sent out the patch.
It's based on the latest kvm-x86/vmx branch.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v4 08/19] KVM: VMX: Add support for FRED context save/restore
2025-06-26 17:22 ` Xin Li
@ 2025-06-26 20:50 ` Sean Christopherson
0 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2025-06-26 20:50 UTC (permalink / raw)
To: Xin Li
Cc: pbonzini, kvm, linux-doc, linux-kernel, corbet, tglx, mingo, bp,
dave.hansen, x86, hpa, andrew.cooper3, luto, peterz, chao.gao,
xin3.li
On Thu, Jun 26, 2025, Xin Li wrote:
> On 6/25/2025 10:18 AM, Xin Li wrote:
> > >
> > > Maybe add helpers to deal with the preemption stuff? Oh, never
> > > mind, FRED
> >
> > This is a good idea.
> >
> > Do you want to upstream the following patch?
>
> As I have almost done addressing your comments in my local repo, just
> sent out the patch.
Saw it, and the LKGS patch. I'm OOO for a week, so I probably won't get them
applied for a couple weeks.
Thanks!
^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2025-06-26 20:50 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-28 17:11 [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 01/19] KVM: VMX: Add support for the secondary VM exit controls Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 02/19] KVM: VMX: Initialize VM entry/exit FRED controls in vmcs_config Xin Li (Intel)
2025-04-14 7:41 ` Chao Gao
2025-04-14 16:53 ` Xin Li
2025-03-28 17:11 ` [PATCH v4 03/19] KVM: VMX: Disable FRED if FRED consistency checks fail Xin Li (Intel)
2025-06-24 15:20 ` Sean Christopherson
2025-03-28 17:11 ` [PATCH v4 04/19] x86/cea: Export per CPU array 'cea_exception_stacks' for KVM to use Xin Li (Intel)
2025-04-10 8:53 ` Christoph Hellwig
2025-04-10 14:18 ` Dave Hansen
2025-04-11 16:16 ` Xin Li
2025-03-28 17:11 ` [PATCH v4 05/19] KVM: VMX: Initialize VMCS FRED fields Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 06/19] KVM: VMX: Set FRED MSR interception Xin Li (Intel)
2025-06-24 15:27 ` Sean Christopherson
2025-03-28 17:11 ` [PATCH v4 07/19] KVM: VMX: Save/restore guest FRED RSP0 Xin Li (Intel)
2025-06-24 15:44 ` Sean Christopherson
2025-03-28 17:11 ` [PATCH v4 08/19] KVM: VMX: Add support for FRED context save/restore Xin Li (Intel)
2025-06-24 16:27 ` Sean Christopherson
2025-06-25 17:18 ` Xin Li
2025-06-26 17:22 ` Xin Li
2025-06-26 20:50 ` Sean Christopherson
2025-03-28 17:11 ` [PATCH v4 09/19] KVM: x86: Add a helper to detect if FRED is enabled for a vCPU Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 10/19] KVM: VMX: Virtualize FRED event_data Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 11/19] KVM: VMX: Virtualize FRED nested exception tracking Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 12/19] KVM: x86: Save/restore the nested flag of an exception Xin Li (Intel)
2025-03-28 17:11 ` [PATCH v4 13/19] KVM: x86: Mark CR4.FRED as not reserved Xin Li (Intel)
2025-03-28 17:12 ` [PATCH v4 14/19] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li (Intel)
2025-06-24 16:32 ` Sean Christopherson
2025-06-25 17:38 ` Xin Li
2025-03-28 17:12 ` [PATCH v4 15/19] KVM: x86: Allow FRED/LKGS to be advertised to guests Xin Li (Intel)
2025-06-24 16:38 ` Sean Christopherson
2025-06-25 18:05 ` Xin Li
2025-06-25 18:29 ` Sean Christopherson
2025-03-28 17:12 ` [PATCH v4 16/19] KVM: nVMX: Add support for the secondary VM exit controls Xin Li (Intel)
2025-06-24 16:54 ` Sean Christopherson
2025-03-28 17:12 ` [PATCH v4 17/19] KVM: nVMX: Add FRED VMCS fields to nested VMX context management Xin Li (Intel)
2025-03-28 17:12 ` [PATCH v4 18/19] KVM: nVMX: Add VMCS FRED states checking Xin Li (Intel)
2025-03-28 17:12 ` [PATCH v4 19/19] KVM: nVMX: Allow VMX FRED controls Xin Li (Intel)
2025-03-28 17:25 ` [PATCH v4 00/19] Enable FRED with KVM VMX Xin Li
2025-06-24 17:06 ` Sean Christopherson
2025-06-24 17:43 ` Xin Li
2025-06-24 17:47 ` H. Peter Anvin
2025-06-24 18:02 ` Xin Li
2025-06-24 18:40 ` H. Peter Anvin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).