* [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening
@ 2025-11-10 22:29 Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 01/13] KVM: SVM: Switch svm_copy_lbrs() to a macro Yosry Ahmed
` (12 more replies)
0 siblings, 13 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed
A group of semi-related fixes, cleanups, and hardening patches for nSVM.
This series is based on kvm/master.
Patches 1-3 here are v2 of the last 3 patches in in the LBRV fixes
series [1]. The first 3 patches of [1] are already in kvm/master. The
rest of this series is v2 of [2].
Patches 4-6 fix or add missing consistency checks.
Patches 7-8 are renames to clarify some VMCB fields.
Patches 9-12 add hardening to reading the VMCB12, caching all used
fields in the save area to prevent theoritical TOC-TOU bugs, sanitizing
used fields in the control area, and restricting accesses to the VMCB12
through guest memory.
Patch 13 further restricts fields copied from VMCB01 to VMCB12.
v1 -> v2:
- Prepended some patches from the LBRV series.
- Used nested_npt_enabled() to guard consistency checks in patch 4.
- Best effort attempt to dynamically determine supported exception
vectors in patch 6.
- Commit logs massaging and minor nits.
[1]https://lore.kernel.org/kvm/20251108004524.1600006-1-yosry.ahmed@linux.dev/
[2]https://lore.kernel.org/kvm/20251104195949.3528411-1-yosry.ahmed@linux.dev/
Yosry Ahmed (13):
KVM: SVM: Switch svm_copy_lbrs() to a macro
KVM: SVM: Add missing save/restore handling of LBR MSRs
KVM: selftests: Add a test for LBR save/restore (ft. nested)
KVM: nSVM: Fix consistency checks for NP_ENABLE
KVM: nSVM: Add missing consistency check for EFER, CR0, CR4, and CS
KVM: nSVM: Add missing consistency check for event_inj
KVM: SVM: Rename vmcb->nested_ctl to vmcb->misc_ctl
KVM: SVM: Rename vmcb->virt_ext to vmcb->misc_ctl2
KVM: nSVM: Cache all used fields from VMCB12
KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
KVM: nSVM: Simplify nested_svm_vmrun()
KVM: nSVM: Sanitize control fields copied from VMCB12
KVM: nSVM: Only copy NP_ENABLE from VMCB01's misc_ctl
arch/x86/include/asm/svm.h | 26 +-
arch/x86/kvm/svm/nested.c | 349 ++++++++++++------
arch/x86/kvm/svm/sev.c | 4 +-
arch/x86/kvm/svm/svm.c | 57 +--
arch/x86/kvm/svm/svm.h | 46 ++-
arch/x86/kvm/x86.c | 3 +
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../selftests/kvm/include/x86/processor.h | 5 +
tools/testing/selftests/kvm/include/x86/svm.h | 14 +-
.../selftests/kvm/x86/svm_lbr_nested_state.c | 155 ++++++++
10 files changed, 490 insertions(+), 170 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v2 01/13] KVM: SVM: Switch svm_copy_lbrs() to a macro
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 02/13] KVM: SVM: Add missing save/restore handling of LBR MSRs Yosry Ahmed
` (11 subsequent siblings)
12 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed,
stable
In preparation for using svm_copy_lbrs() with 'struct vmcb_save_area'
without a containing 'struct vmcb', and later even 'struct
vmcb_save_area_cached', make it a macro. Pull the call to
vmcb_mark_dirty() out to the callers.
Macros are generally not preferred compared to functions, mainly due to
type-safety. However, in this case it seems like having a simple macro
copying a few fields is better than copy-pasting the same 5 lines of
code in different places.
On the bright side, pulling vmcb_mark_dirty() calls to the callers makes
it clear that in one case, vmcb_mark_dirty() was being called on VMCB12.
It is not architecturally defined for the CPU to clear arbitrary clean
bits, and it is not needed, so drop that one call.
Technically fixes the non-architectural behavior of setting the dirty
bit on VMCB12.
Fixes: d20c796ca370 ("KVM: x86: nSVM: implement nested LBR virtualization")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 16 ++++++++++------
arch/x86/kvm/svm/svm.c | 11 -----------
arch/x86/kvm/svm/svm.h | 10 +++++++++-
3 files changed, 19 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index da6e80b3ac353..a37bd5c1f36fa 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -675,10 +675,12 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
* Reserved bits of DEBUGCTL are ignored. Be consistent with
* svm_set_msr's definition of reserved bits.
*/
- svm_copy_lbrs(vmcb02, vmcb12);
+ svm_copy_lbrs(&vmcb02->save, &vmcb12->save);
+ vmcb_mark_dirty(vmcb02, VMCB_LBR);
vmcb02->save.dbgctl &= ~DEBUGCTL_RESERVED_BITS;
} else {
- svm_copy_lbrs(vmcb02, vmcb01);
+ svm_copy_lbrs(&vmcb02->save, &vmcb01->save);
+ vmcb_mark_dirty(vmcb02, VMCB_LBR);
}
svm_update_lbrv(&svm->vcpu);
}
@@ -1184,10 +1186,12 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
kvm_make_request(KVM_REQ_EVENT, &svm->vcpu);
if (unlikely(guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) &&
- (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK)))
- svm_copy_lbrs(vmcb12, vmcb02);
- else
- svm_copy_lbrs(vmcb01, vmcb02);
+ (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) {
+ svm_copy_lbrs(&vmcb12->save, &vmcb02->save);
+ } else {
+ svm_copy_lbrs(&vmcb01->save, &vmcb02->save);
+ vmcb_mark_dirty(vmcb01, VMCB_LBR);
+ }
svm_update_lbrv(vcpu);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 10c21e4c5406f..711276e8ee84f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -795,17 +795,6 @@ static void svm_recalc_msr_intercepts(struct kvm_vcpu *vcpu)
*/
}
-void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb)
-{
- to_vmcb->save.dbgctl = from_vmcb->save.dbgctl;
- to_vmcb->save.br_from = from_vmcb->save.br_from;
- to_vmcb->save.br_to = from_vmcb->save.br_to;
- to_vmcb->save.last_excp_from = from_vmcb->save.last_excp_from;
- to_vmcb->save.last_excp_to = from_vmcb->save.last_excp_to;
-
- vmcb_mark_dirty(to_vmcb, VMCB_LBR);
-}
-
static void __svm_enable_lbrv(struct kvm_vcpu *vcpu)
{
to_svm(vcpu)->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index c856d8e0f95e7..f6fb70ddf7272 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -687,8 +687,16 @@ static inline void *svm_vcpu_alloc_msrpm(void)
return svm_alloc_permissions_map(MSRPM_SIZE, GFP_KERNEL_ACCOUNT);
}
+#define svm_copy_lbrs(to, from) \
+({ \
+ (to)->dbgctl = (from)->dbgctl; \
+ (to)->br_from = (from)->br_from; \
+ (to)->br_to = (from)->br_to; \
+ (to)->last_excp_from = (from)->last_excp_from; \
+ (to)->last_excp_to = (from)->last_excp_to; \
+})
+
void svm_vcpu_free_msrpm(void *msrpm);
-void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb);
void svm_enable_lbrv(struct kvm_vcpu *vcpu);
void svm_update_lbrv(struct kvm_vcpu *vcpu);
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 02/13] KVM: SVM: Add missing save/restore handling of LBR MSRs
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 01/13] KVM: SVM: Switch svm_copy_lbrs() to a macro Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 03/13] KVM: selftests: Add a test for LBR save/restore (ft. nested) Yosry Ahmed
` (10 subsequent siblings)
12 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed,
stable
MSR_IA32_DEBUGCTLMSR and LBR MSRs are currently not enumerated by
KVM_GET_MSR_INDEX_LIST, and LBR MSRs cannot be set with KVM_SET_MSRS. So
save/restore is completely broken.
Fix it by adding the MSRs to msrs_to_save_base, and allowing writes to
LBR MSRs from userspace only (as they are read-only MSRs). Additionally,
to correctly restore L1's LBRs while L2 is running, make sure the LBRs
are copied from the captured VMCB01 save area in svm_copy_vmrun_state().
Fixes: 24e09cbf480a ("KVM: SVM: enable LBR virtualization")
Cc: stable@vger.kernel.org
Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 3 +++
arch/x86/kvm/svm/svm.c | 20 ++++++++++++++++++++
arch/x86/kvm/x86.c | 3 +++
3 files changed, 26 insertions(+)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index a37bd5c1f36fa..74211c5c68026 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1055,6 +1055,9 @@ void svm_copy_vmrun_state(struct vmcb_save_area *to_save,
to_save->isst_addr = from_save->isst_addr;
to_save->ssp = from_save->ssp;
}
+
+ if (lbrv)
+ svm_copy_lbrs(to_save, from_save);
}
void svm_copy_vmloadsave_state(struct vmcb *to_vmcb, struct vmcb *from_vmcb)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 711276e8ee84f..af0e9c26527e3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2983,6 +2983,26 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
vmcb_mark_dirty(svm->vmcb, VMCB_LBR);
svm_update_lbrv(vcpu);
break;
+ case MSR_IA32_LASTBRANCHFROMIP:
+ if (!msr->host_initiated)
+ return 1;
+ svm->vmcb->save.br_from = data;
+ break;
+ case MSR_IA32_LASTBRANCHTOIP:
+ if (!msr->host_initiated)
+ return 1;
+ svm->vmcb->save.br_to = data;
+ break;
+ case MSR_IA32_LASTINTFROMIP:
+ if (!msr->host_initiated)
+ return 1;
+ svm->vmcb->save.last_excp_from = data;
+ break;
+ case MSR_IA32_LASTINTTOIP:
+ if (!msr->host_initiated)
+ return 1;
+ svm->vmcb->save.last_excp_to = data;
+ break;
case MSR_VM_HSAVE_PA:
/*
* Old kernels did not validate the value written to
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c9c2aa6f4705e..9cb824f9cf644 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -348,6 +348,9 @@ static const u32 msrs_to_save_base[] = {
MSR_IA32_U_CET, MSR_IA32_S_CET,
MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP, MSR_IA32_PL2_SSP,
MSR_IA32_PL3_SSP, MSR_IA32_INT_SSP_TAB,
+ MSR_IA32_DEBUGCTLMSR,
+ MSR_IA32_LASTBRANCHFROMIP, MSR_IA32_LASTBRANCHTOIP,
+ MSR_IA32_LASTINTFROMIP, MSR_IA32_LASTINTTOIP,
};
static const u32 msrs_to_save_pmu[] = {
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 03/13] KVM: selftests: Add a test for LBR save/restore (ft. nested)
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 01/13] KVM: SVM: Switch svm_copy_lbrs() to a macro Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 02/13] KVM: SVM: Add missing save/restore handling of LBR MSRs Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE Yosry Ahmed
` (9 subsequent siblings)
12 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed
Add a selftest exercising save/restore with usage of LBRs in both L1 and
L2, and making sure all LBRs remain intact.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../selftests/kvm/include/x86/processor.h | 5 +
.../selftests/kvm/x86/svm_lbr_nested_state.c | 155 ++++++++++++++++++
3 files changed, 161 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 148d427ff24be..9a19554ffd3c1 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -105,6 +105,7 @@ TEST_GEN_PROGS_x86 += x86/svm_vmcall_test
TEST_GEN_PROGS_x86 += x86/svm_int_ctl_test
TEST_GEN_PROGS_x86 += x86/svm_nested_shutdown_test
TEST_GEN_PROGS_x86 += x86/svm_nested_soft_inject_test
+TEST_GEN_PROGS_x86 += x86/svm_lbr_nested_state
TEST_GEN_PROGS_x86 += x86/tsc_scaling_sync
TEST_GEN_PROGS_x86 += x86/sync_regs_test
TEST_GEN_PROGS_x86 += x86/ucna_injection_test
diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h
index 51cd84b9ca664..aee4b83c47b19 100644
--- a/tools/testing/selftests/kvm/include/x86/processor.h
+++ b/tools/testing/selftests/kvm/include/x86/processor.h
@@ -1367,6 +1367,11 @@ static inline bool kvm_is_ignore_msrs(void)
return get_kvm_param_bool("ignore_msrs");
}
+static inline bool kvm_is_lbrv_enabled(void)
+{
+ return !!get_kvm_amd_param_integer("lbrv");
+}
+
uint64_t *__vm_get_page_table_entry(struct kvm_vm *vm, uint64_t vaddr,
int *level);
uint64_t *vm_get_page_table_entry(struct kvm_vm *vm, uint64_t vaddr);
diff --git a/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c
new file mode 100644
index 0000000000000..a343279546fd8
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * svm_lbr_nested_state
+ *
+ * Test that LBRs are maintained correctly in both L1 and L2 during
+ * save/restore.
+ *
+ * Copyright (C) 2025, Google, Inc.
+ */
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "svm_util.h"
+
+
+#define L2_GUEST_STACK_SIZE 64
+
+#define DO_BRANCH() asm volatile("jmp 1f\n 1: nop")
+
+struct lbr_branch {
+ u64 from, to;
+};
+
+volatile struct lbr_branch l2_branch;
+
+#define RECORD_BRANCH(b, s) \
+({ \
+ wrmsr(MSR_IA32_DEBUGCTLMSR, DEBUGCTLMSR_LBR); \
+ DO_BRANCH(); \
+ (b)->from = rdmsr(MSR_IA32_LASTBRANCHFROMIP); \
+ (b)->to = rdmsr(MSR_IA32_LASTBRANCHTOIP); \
+ /* Disabe LBR right after to avoid overriding the IPs */ \
+ wrmsr(MSR_IA32_DEBUGCTLMSR, 0); \
+ \
+ GUEST_ASSERT_NE((b)->from, 0); \
+ GUEST_ASSERT_NE((b)->to, 0); \
+ GUEST_PRINTF("%s: (0x%lx, 0x%lx)\n", (s), (b)->from, (b)->to); \
+}) \
+
+#define CHECK_BRANCH_MSRS(b) \
+({ \
+ GUEST_ASSERT_EQ((b)->from, rdmsr(MSR_IA32_LASTBRANCHFROMIP)); \
+ GUEST_ASSERT_EQ((b)->to, rdmsr(MSR_IA32_LASTBRANCHTOIP)); \
+})
+
+#define CHECK_BRANCH_VMCB(b, vmcb) \
+({ \
+ GUEST_ASSERT_EQ((b)->from, vmcb->save.br_from); \
+ GUEST_ASSERT_EQ((b)->to, vmcb->save.br_to); \
+}) \
+
+static void l2_guest_code(struct svm_test_data *svm)
+{
+ /* Record a branch, trigger save/restore, and make sure LBRs are intact */
+ RECORD_BRANCH(&l2_branch, "L2 branch");
+ GUEST_SYNC(true);
+ CHECK_BRANCH_MSRS(&l2_branch);
+ vmmcall();
+}
+
+static void l1_guest_code(struct svm_test_data *svm, bool nested_lbrv)
+{
+ unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+ struct vmcb *vmcb = svm->vmcb;
+ struct lbr_branch l1_branch;
+
+ /* Record a branch, trigger save/restore, and make sure LBRs are intact */
+ RECORD_BRANCH(&l1_branch, "L1 branch");
+ GUEST_SYNC(true);
+ CHECK_BRANCH_MSRS(&l1_branch);
+
+ /* Run L2, which will also do the same */
+ generic_svm_setup(svm, l2_guest_code,
+ &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+
+ if (nested_lbrv)
+ vmcb->control.virt_ext = LBR_CTL_ENABLE_MASK;
+ else
+ vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK;
+
+ run_guest(vmcb, svm->vmcb_gpa);
+ GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL);
+
+ /* Trigger save/restore one more time before checking, just for kicks */
+ GUEST_SYNC(true);
+
+ /*
+ * If LBR_CTL_ENABLE is set, L1 and L2 should have separate LBR MSRs, so
+ * expect L1's LBRs to remain intact and L2 LBRs to be in the VMCB.
+ * Otherwise, the MSRs are shared between L1 & L2 so expect L2's LBRs.
+ */
+ if (nested_lbrv) {
+ CHECK_BRANCH_MSRS(&l1_branch);
+ CHECK_BRANCH_VMCB(&l2_branch, vmcb);
+ } else {
+ CHECK_BRANCH_MSRS(&l2_branch);
+ }
+ GUEST_DONE();
+}
+
+void test_lbrv_nested_state(bool nested_lbrv)
+{
+ struct kvm_x86_state *state = NULL;
+ struct kvm_vcpu *vcpu;
+ vm_vaddr_t svm_gva;
+ struct kvm_vm *vm;
+ struct ucall uc;
+
+ pr_info("Testing with nested LBRV %s\n", nested_lbrv ? "enabled" : "disabled");
+
+ vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code);
+ vcpu_alloc_svm(vm, &svm_gva);
+ vcpu_args_set(vcpu, 2, svm_gva, nested_lbrv);
+
+ for (;;) {
+ vcpu_run(vcpu);
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+ switch (get_ucall(vcpu, &uc)) {
+ case UCALL_SYNC:
+ /* Save the vCPU state and restore it in a new VM on sync */
+ pr_info("Guest triggered save/restore.\n");
+ state = vcpu_save_state(vcpu);
+ kvm_vm_release(vm);
+ vcpu = vm_recreate_with_one_vcpu(vm);
+ vcpu_load_state(vcpu, state);
+ break;
+ case UCALL_ABORT:
+ REPORT_GUEST_ASSERT(uc);
+ /* NOT REACHED */
+ case UCALL_DONE:
+ goto done;
+ case UCALL_PRINTF:
+ pr_info("%s", uc.buffer);
+ break;
+ default:
+ TEST_FAIL("Unknown ucall %lu", uc.cmd);
+ }
+ }
+done:
+ if (state)
+ kvm_x86_state_cleanup(state);
+ kvm_vm_free(vm);
+}
+
+int main(int argc, char *argv[])
+{
+ TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM));
+ TEST_REQUIRE(kvm_is_lbrv_enabled());
+
+ test_lbrv_nested_state(/*nested_lbrv=*/false);
+ test_lbrv_nested_state(/*nested_lbrv=*/true);
+
+ return 0;
+}
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
` (2 preceding siblings ...)
2025-11-10 22:29 ` [PATCH v2 03/13] KVM: selftests: Add a test for LBR save/restore (ft. nested) Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-12-09 16:27 ` Sean Christopherson
2025-11-10 22:29 ` [PATCH v2 05/13] KVM: nSVM: Add missing consistency check for EFER, CR0, CR4, and CS Yosry Ahmed
` (8 subsequent siblings)
12 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed,
stable
KVM currenty fails a nested VMRUN and injects VMEXIT_INVALID (aka
SVM_EXIT_ERR) if L1 sets NP_ENABLE and the host does not support NPTs.
On first glance, it seems like the check should actually be for
guest_cpu_cap_has(X86_FEATURE_NPT) instead, as it is possible for the
host to support NPTs but the guest CPUID to not advertise it.
However, the consistency check is not architectural to begin with. The
APM does not mention VMEXIT_INVALID if NP_ENABLE is set on a processor
that does not have X86_FEATURE_NPT. Hence, NP_ENABLE should be ignored
if X86_FEATURE_NPT is not available for L1. Apart from the consistency
check, this is currently the case because NP_ENABLE is actually copied
from VMCB01 to VMCB02, not from VMCB12.
On the other hand, the APM does mention two other consistency checks for
NP_ENABLE, both of which are missing (paraphrased):
In Volume #2, 15.25.3 (24593—Rev. 3.42—March 2024):
If VMRUN is executed with hCR0.PG cleared to zero and NP_ENABLE set to
1, VMRUN terminates with #VMEXIT(VMEXIT_INVALID)
In Volume #2, 15.25.4 (24593—Rev. 3.42—March 2024):
When VMRUN is executed with nested paging enabled (NP_ENABLE = 1), the
following conditions are considered illegal state combinations, in
addition to those mentioned in “Canonicalization and Consistency
Checks”:
• Any MBZ bit of nCR3 is set.
• Any G_PAT.PA field has an unsupported type encoding or any
reserved field in G_PAT has a nonzero value.
Replace the existing consistency check with consistency checks on
hCR0.PG and nCR3. Only perform the consistency checks if L1 has
X86_FEATURE_NPT and NP_ENABLE is set in VMCB12. The G_PAT consistency
check will be addressed separately.
As it is now possible for an L1 to run L2 with NP_ENABLE set but
ignored, also check that L1 has X86_FEATURE_NPT in nested_npt_enabled().
Pass L1's CR0 to __nested_vmcb_check_controls(). In
nested_vmcb_check_controls(), L1's CR0 is available through
kvm_read_cr0(), as vcpu->arch.cr0 is not updated to L2's CR0 until later
through nested_vmcb02_prepare_save() -> svm_set_cr0().
In svm_set_nested_state(), L1's CR0 is available in the captured save
area, as svm_get_nested_state() captures L1's save area when running L2,
and L1's CR0 is stashed in VMCB01 on nested VMRUN (in
nested_svm_vmrun()).
Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 21 ++++++++++++++++-----
arch/x86/kvm/svm/svm.h | 3 ++-
2 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 74211c5c68026..87bcc5eff96e8 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -325,7 +325,8 @@ static bool nested_svm_check_bitmap_pa(struct kvm_vcpu *vcpu, u64 pa, u32 size)
}
static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
- struct vmcb_ctrl_area_cached *control)
+ struct vmcb_ctrl_area_cached *control,
+ unsigned long l1_cr0)
{
if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
return false;
@@ -333,8 +334,12 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
if (CC(control->asid == 0))
return false;
- if (CC((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && !npt_enabled))
- return false;
+ if (nested_npt_enabled(to_svm(vcpu))) {
+ if (CC(!kvm_vcpu_is_legal_gpa(vcpu, control->nested_cr3)))
+ return false;
+ if (CC(!(l1_cr0 & X86_CR0_PG)))
+ return false;
+ }
if (CC(!nested_svm_check_bitmap_pa(vcpu, control->msrpm_base_pa,
MSRPM_SIZE)))
@@ -400,7 +405,12 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
struct vcpu_svm *svm = to_svm(vcpu);
struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
- return __nested_vmcb_check_controls(vcpu, ctl);
+ /*
+ * Make sure we did not enter guest mode yet, in which case
+ * kvm_read_cr0() could return L2's CR0.
+ */
+ WARN_ON_ONCE(is_guest_mode(vcpu));
+ return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
}
static
@@ -1831,7 +1841,8 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
ret = -EINVAL;
__nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl);
- if (!__nested_vmcb_check_controls(vcpu, &ctl_cached))
+ /* 'save' contains L1 state saved from before VMRUN */
+ if (!__nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
goto out_free;
/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f6fb70ddf7272..3e805a43ffcdb 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -552,7 +552,8 @@ static inline bool gif_set(struct vcpu_svm *svm)
static inline bool nested_npt_enabled(struct vcpu_svm *svm)
{
- return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
+ return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_NPT) &&
+ svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
}
static inline bool nested_vnmi_enabled(struct vcpu_svm *svm)
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 05/13] KVM: nSVM: Add missing consistency check for EFER, CR0, CR4, and CS
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
` (3 preceding siblings ...)
2025-11-10 22:29 ` [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 06/13] KVM: nSVM: Add missing consistency check for event_inj Yosry Ahmed
` (7 subsequent siblings)
12 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed
According to the APM Volume #2, 15.5, Canonicalization and Consistency
Checks (24593—Rev. 3.42—March 2024), the following condition (among
others) results in a #VMEXIT with VMEXIT_INVALID (aka SVM_EXIT_ERR):
EFER.LME, CR0.PG, CR4.PAE, CS.L, and CS.D are all non-zero.
Add the missing consistency check. This is functionally a nop because
the nested VMRUN results in SVM_EXIT_ERR in HW, which is forwarded to
L1, but KVM makes all consistency checks before a VMRUN is actually
attempted.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 7 +++++++
arch/x86/kvm/svm/svm.h | 1 +
2 files changed, 8 insertions(+)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 87bcc5eff96e8..abdaacb04dd9e 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -380,6 +380,11 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
CC(!(save->cr0 & X86_CR0_PE)) ||
CC(!kvm_vcpu_is_legal_cr3(vcpu, save->cr3)))
return false;
+
+ if (CC((save->cr4 & X86_CR4_PAE) &&
+ (save->cs.attrib & SVM_SELECTOR_L_MASK) &&
+ (save->cs.attrib & SVM_SELECTOR_DB_MASK)))
+ return false;
}
/* Note, SVM doesn't have any additional restrictions on CR4. */
@@ -473,6 +478,8 @@ static void __nested_copy_vmcb_save_to_cache(struct vmcb_save_area_cached *to,
* Copy only fields that are validated, as we need them
* to avoid TOC/TOU races.
*/
+ to->cs = from->cs;
+
to->efer = from->efer;
to->cr0 = from->cr0;
to->cr3 = from->cr3;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 3e805a43ffcdb..a6913a0820125 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -142,6 +142,7 @@ struct kvm_vmcb_info {
};
struct vmcb_save_area_cached {
+ struct vmcb_seg cs;
u64 efer;
u64 cr4;
u64 cr3;
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 06/13] KVM: nSVM: Add missing consistency check for event_inj
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
` (4 preceding siblings ...)
2025-11-10 22:29 ` [PATCH v2 05/13] KVM: nSVM: Add missing consistency check for EFER, CR0, CR4, and CS Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 07/13] KVM: SVM: Rename vmcb->nested_ctl to vmcb->misc_ctl Yosry Ahmed
` (6 subsequent siblings)
12 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed
According to the APM Volume #2, 15.20 (24593—Rev. 3.42—March 2024):
VMRUN exits with VMEXIT_INVALID error code if either:
• Reserved values of TYPE have been specified, or
• TYPE = 3 (exception) has been specified with a vector that does not
correspond to an exception (this includes vector 2, which is an NMI,
not an exception).
Add the missing consistency checks to KVM. For the second point, inject
VMEXIT_INVALID if the vector is anything but the vectors defined by the
APM for exceptions. Reserved vectors are also considered invalid, which
matches the HW behavior. Vector 9 (i.e. #CSO) is considered invalid
because it is reserved on modern CPUs, and according to LLMs no CPUs
exist supporting SVM and producing #CSOs.
Defined exceptions could be different between virtual CPUs as new CPUs
define new vectors. In a best effort to dynamically define the valid
vectors, make all currently defined vectors as valid except those
obviously tied to a CPU feature: SHSTK -> #CP and SEV-ES -> #VC. As new
vectors are defined, they can similarly be tied to corresponding CPU
features.
Invalid vectors on specific (e.g. old) CPUs that are missed by KVM
should be rejected by HW anyway.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 51 +++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index abdaacb04dd9e..418d6aa4e32e8 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -324,6 +324,54 @@ static bool nested_svm_check_bitmap_pa(struct kvm_vcpu *vcpu, u64 pa, u32 size)
kvm_vcpu_is_legal_gpa(vcpu, addr + size - 1);
}
+static bool nested_svm_event_inj_valid_exept(struct kvm_vcpu *vcpu, u8 vector)
+{
+ /*
+ * Vectors that do not correspond to a defined exception are invalid
+ * (including #NMI and reserved vectors). In a best to define valid
+ * exceptions based on the virtual CPU, make all exceptions always valid
+ * except those obviously tied to a CPU feature.
+ */
+ switch (vector) {
+ case DE_VECTOR: case DB_VECTOR: case BP_VECTOR: case OF_VECTOR:
+ case BR_VECTOR: case UD_VECTOR: case NM_VECTOR: case DF_VECTOR:
+ case TS_VECTOR: case NP_VECTOR: case SS_VECTOR: case GP_VECTOR:
+ case PF_VECTOR: case MF_VECTOR: case AC_VECTOR: case MC_VECTOR:
+ case XM_VECTOR: case HV_VECTOR: case SX_VECTOR:
+ return true;
+ case CP_VECTOR:
+ return guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK);
+ case VC_VECTOR:
+ return guest_cpu_cap_has(vcpu, X86_FEATURE_SEV_ES);
+ }
+ return false;
+}
+
+/*
+ * According to the APM, VMRUN exits with SVM_EXIT_ERR if SVM_EVTINJ_VALID is
+ * set and:
+ * - The type of event_inj is not one of the defined values.
+ * - The type is SVM_EVTINJ_TYPE_EXEPT, but the vector is not a valid exception.
+ */
+static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj)
+{
+ u32 type = event_inj & SVM_EVTINJ_TYPE_MASK;
+ u8 vector = event_inj & SVM_EVTINJ_VEC_MASK;
+
+ if (!(event_inj & SVM_EVTINJ_VALID))
+ return true;
+
+ if (type != SVM_EVTINJ_TYPE_INTR && type != SVM_EVTINJ_TYPE_NMI &&
+ type != SVM_EVTINJ_TYPE_EXEPT && type != SVM_EVTINJ_TYPE_SOFT)
+ return false;
+
+ if (type == SVM_EVTINJ_TYPE_EXEPT &&
+ !nested_svm_event_inj_valid_exept(vcpu, vector))
+ return false;
+
+ return true;
+}
+
static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
struct vmcb_ctrl_area_cached *control,
unsigned long l1_cr0)
@@ -353,6 +401,9 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
return false;
}
+ if (CC(!nested_svm_check_event_inj(vcpu, control->event_inj)))
+ return false;
+
return true;
}
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 07/13] KVM: SVM: Rename vmcb->nested_ctl to vmcb->misc_ctl
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
` (5 preceding siblings ...)
2025-11-10 22:29 ` [PATCH v2 06/13] KVM: nSVM: Add missing consistency check for event_inj Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 08/13] KVM: SVM: Rename vmcb->virt_ext to vmcb->misc_ctl2 Yosry Ahmed
` (5 subsequent siblings)
12 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed
The 'nested_ctl' field is misnamed. Although the first bit is for nested
paging, the other defined bits are for SEV/SEV-ES. Other bits in the
same field according to the APM (but not defined by KVM) include "Guest
Mode Execution Trap", "Enable INVLPGB/TLBSYNC", and other control bits
unrelated to 'nested'.
There is nothing common among these bits, so just name the field
misc_ctl. Also rename the flags accordingly.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/include/asm/svm.h | 8 ++++----
arch/x86/kvm/svm/nested.c | 8 ++++----
arch/x86/kvm/svm/sev.c | 4 ++--
arch/x86/kvm/svm/svm.c | 4 ++--
arch/x86/kvm/svm/svm.h | 4 ++--
tools/testing/selftests/kvm/include/x86/svm.h | 6 +++---
6 files changed, 17 insertions(+), 17 deletions(-)
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 17f6c3fedeee7..76ec1d40e6461 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -142,7 +142,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u64 exit_info_2;
u32 exit_int_info;
u32 exit_int_info_err;
- u64 nested_ctl;
+ u64 misc_ctl;
u64 avic_vapic_bar;
u64 ghcb_gpa;
u32 event_inj;
@@ -236,9 +236,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
#define SVM_IOIO_SIZE_MASK (7 << SVM_IOIO_SIZE_SHIFT)
#define SVM_IOIO_ASIZE_MASK (7 << SVM_IOIO_ASIZE_SHIFT)
-#define SVM_NESTED_CTL_NP_ENABLE BIT(0)
-#define SVM_NESTED_CTL_SEV_ENABLE BIT(1)
-#define SVM_NESTED_CTL_SEV_ES_ENABLE BIT(2)
+#define SVM_MISC_CTL_NP_ENABLE BIT(0)
+#define SVM_MISC_CTL_SEV_ENABLE BIT(1)
+#define SVM_MISC_CTL_SEV_ES_ENABLE BIT(2)
#define SVM_TSC_RATIO_RSVD 0xffffff0000000000ULL
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 418d6aa4e32e8..2a5c3788f954b 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -492,7 +492,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
to->exit_info_2 = from->exit_info_2;
to->exit_int_info = from->exit_int_info;
to->exit_int_info_err = from->exit_int_info_err;
- to->nested_ctl = from->nested_ctl;
+ to->misc_ctl = from->misc_ctl;
to->event_inj = from->event_inj;
to->event_inj_err = from->event_inj_err;
to->next_rip = from->next_rip;
@@ -818,7 +818,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
}
/* Copied from vmcb01. msrpm_base can be overwritten later. */
- vmcb02->control.nested_ctl = vmcb01->control.nested_ctl;
+ vmcb02->control.misc_ctl = vmcb01->control.misc_ctl;
vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa;
vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
@@ -964,7 +964,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
vmcb12->save.rip,
vmcb12->control.int_ctl,
vmcb12->control.event_inj,
- vmcb12->control.nested_ctl,
+ vmcb12->control.misc_ctl,
vmcb12->control.nested_cr3,
vmcb12->save.cr3,
KVM_ISA_SVM);
@@ -1759,7 +1759,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
dst->exit_info_2 = from->exit_info_2;
dst->exit_int_info = from->exit_int_info;
dst->exit_int_info_err = from->exit_int_info_err;
- dst->nested_ctl = from->nested_ctl;
+ dst->misc_ctl = from->misc_ctl;
dst->event_inj = from->event_inj;
dst->event_inj_err = from->event_inj_err;
dst->next_rip = from->next_rip;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 0835c664fbfdb..4eff5cc43821a 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -4553,7 +4553,7 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm, bool init_event)
struct kvm_sev_info *sev = to_kvm_sev_info(svm->vcpu.kvm);
struct vmcb *vmcb = svm->vmcb01.ptr;
- svm->vmcb->control.nested_ctl |= SVM_NESTED_CTL_SEV_ES_ENABLE;
+ svm->vmcb->control.misc_ctl |= SVM_MISC_CTL_SEV_ES_ENABLE;
/*
* An SEV-ES guest requires a VMSA area that is a separate from the
@@ -4624,7 +4624,7 @@ void sev_init_vmcb(struct vcpu_svm *svm, bool init_event)
{
struct kvm_vcpu *vcpu = &svm->vcpu;
- svm->vmcb->control.nested_ctl |= SVM_NESTED_CTL_SEV_ENABLE;
+ svm->vmcb->control.misc_ctl |= SVM_MISC_CTL_SEV_ENABLE;
clr_exception_intercept(svm, UD_VECTOR);
/*
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index af0e9c26527e3..b5b4965e6bfdd 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1098,7 +1098,7 @@ static void init_vmcb(struct kvm_vcpu *vcpu, bool init_event)
if (npt_enabled) {
/* Setup VMCB for Nested Paging */
- control->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE;
+ control->misc_ctl |= SVM_MISC_CTL_NP_ENABLE;
svm_clr_intercept(svm, INTERCEPT_INVLPG);
clr_exception_intercept(svm, PF_VECTOR);
svm_clr_intercept(svm, INTERCEPT_CR3_READ);
@@ -3273,7 +3273,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
pr_err("%-20s%016llx\n", "exit_info2:", control->exit_info_2);
pr_err("%-20s%08x\n", "exit_int_info:", control->exit_int_info);
pr_err("%-20s%08x\n", "exit_int_info_err:", control->exit_int_info_err);
- pr_err("%-20s%lld\n", "nested_ctl:", control->nested_ctl);
+ pr_err("%-20s%lld\n", "misc_ctl:", control->misc_ctl);
pr_err("%-20s%016llx\n", "nested_cr3:", control->nested_cr3);
pr_err("%-20s%016llx\n", "avic_vapic_bar:", control->avic_vapic_bar);
pr_err("%-20s%016llx\n", "ghcb:", control->ghcb_gpa);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index a6913a0820125..861ed9c33977b 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -169,7 +169,7 @@ struct vmcb_ctrl_area_cached {
u64 exit_info_2;
u32 exit_int_info;
u32 exit_int_info_err;
- u64 nested_ctl;
+ u64 misc_ctl;
u32 event_inj;
u32 event_inj_err;
u64 next_rip;
@@ -554,7 +554,7 @@ static inline bool gif_set(struct vcpu_svm *svm)
static inline bool nested_npt_enabled(struct vcpu_svm *svm)
{
return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_NPT) &&
- svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
+ svm->nested.ctl.misc_ctl & SVM_MISC_CTL_NP_ENABLE;
}
static inline bool nested_vnmi_enabled(struct vcpu_svm *svm)
diff --git a/tools/testing/selftests/kvm/include/x86/svm.h b/tools/testing/selftests/kvm/include/x86/svm.h
index 29cffd0a91816..5d2bcce34c019 100644
--- a/tools/testing/selftests/kvm/include/x86/svm.h
+++ b/tools/testing/selftests/kvm/include/x86/svm.h
@@ -98,7 +98,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u64 exit_info_2;
u32 exit_int_info;
u32 exit_int_info_err;
- u64 nested_ctl;
+ u64 misc_ctl;
u64 avic_vapic_bar;
u8 reserved_4[8];
u32 event_inj;
@@ -176,8 +176,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
#define SVM_VM_CR_SVM_LOCK_MASK 0x0008ULL
#define SVM_VM_CR_SVM_DIS_MASK 0x0010ULL
-#define SVM_NESTED_CTL_NP_ENABLE BIT(0)
-#define SVM_NESTED_CTL_SEV_ENABLE BIT(1)
+#define SVM_MISC_CTL_CTL_NP_ENABLE BIT(0)
+#define SVM_MISC_CTL_SEV_ENABLE BIT(1)
struct __attribute__ ((__packed__)) vmcb_seg {
u16 selector;
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 08/13] KVM: SVM: Rename vmcb->virt_ext to vmcb->misc_ctl2
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
` (6 preceding siblings ...)
2025-11-10 22:29 ` [PATCH v2 07/13] KVM: SVM: Rename vmcb->nested_ctl to vmcb->misc_ctl Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 09/13] KVM: nSVM: Cache all used fields from VMCB12 Yosry Ahmed
` (4 subsequent siblings)
12 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed
'virt' is confusing in the VMCB because it is relative and ambiguous.
The 'virt_ext' field includes bits for LBR virtualization and
VMSAVE/VMLOAD virtualization, so it's just another miscellaneous control
field. Name it as such.
While at it, move the definitions of the bits below those for
'misc_ctl' and rename them for consistency.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/include/asm/svm.h | 7 +++----
arch/x86/kvm/svm/nested.c | 18 ++++++++---------
arch/x86/kvm/svm/svm.c | 20 +++++++++----------
arch/x86/kvm/svm/svm.h | 2 +-
tools/testing/selftests/kvm/include/x86/svm.h | 8 ++++----
.../selftests/kvm/x86/svm_lbr_nested_state.c | 4 ++--
6 files changed, 29 insertions(+), 30 deletions(-)
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 76ec1d40e6461..a842018952d2c 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -148,7 +148,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u32 event_inj;
u32 event_inj_err;
u64 nested_cr3;
- u64 virt_ext;
+ u64 misc_ctl2;
u32 clean;
u32 reserved_5;
u64 next_rip;
@@ -219,9 +219,6 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
#define X2APIC_MODE_SHIFT 30
#define X2APIC_MODE_MASK (1 << X2APIC_MODE_SHIFT)
-#define LBR_CTL_ENABLE_MASK BIT_ULL(0)
-#define VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK BIT_ULL(1)
-
#define SVM_INTERRUPT_SHADOW_MASK BIT_ULL(0)
#define SVM_GUEST_INTERRUPT_MASK BIT_ULL(1)
@@ -240,6 +237,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
#define SVM_MISC_CTL_SEV_ENABLE BIT(1)
#define SVM_MISC_CTL_SEV_ES_ENABLE BIT(2)
+#define SVM_MISC_CTL2_LBR_CTL_ENABLE BIT_ULL(0)
+#define SVM_MISC_CTL2_V_VMLOAD_VMSAVE_ENABLE BIT_ULL(1)
#define SVM_TSC_RATIO_RSVD 0xffffff0000000000ULL
#define SVM_TSC_RATIO_MIN 0x0000000000000001ULL
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 2a5c3788f954b..b8d65832c64de 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -117,7 +117,7 @@ static bool nested_vmcb_needs_vls_intercept(struct vcpu_svm *svm)
if (!nested_npt_enabled(svm))
return true;
- if (!(svm->nested.ctl.virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK))
+ if (!(svm->nested.ctl.misc_ctl2 & SVM_MISC_CTL2_V_VMLOAD_VMSAVE_ENABLE))
return true;
return false;
@@ -180,7 +180,7 @@ void recalc_intercepts(struct vcpu_svm *svm)
vmcb_set_intercept(c, INTERCEPT_VMLOAD);
vmcb_set_intercept(c, INTERCEPT_VMSAVE);
} else {
- WARN_ON(!(c->virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK));
+ WARN_ON(!(c->misc_ctl2 & SVM_MISC_CTL2_V_VMLOAD_VMSAVE_ENABLE));
}
}
@@ -497,7 +497,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
to->event_inj_err = from->event_inj_err;
to->next_rip = from->next_rip;
to->nested_cr3 = from->nested_cr3;
- to->virt_ext = from->virt_ext;
+ to->misc_ctl2 = from->misc_ctl2;
to->pause_filter_count = from->pause_filter_count;
to->pause_filter_thresh = from->pause_filter_thresh;
@@ -738,7 +738,7 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
}
if (unlikely(guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) &&
- (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) {
+ (svm->nested.ctl.misc_ctl2 & SVM_MISC_CTL2_LBR_CTL_ENABLE))) {
/*
* Reserved bits of DEBUGCTL are ignored. Be consistent with
* svm_set_msr's definition of reserved bits.
@@ -902,10 +902,10 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
svm->soft_int_next_rip = vmcb12_rip;
}
- /* LBR_CTL_ENABLE_MASK is controlled by svm_update_lbrv() */
+ /* SVM_MISC_CTL2_LBR_CTL_ENABLE is controlled by svm_update_lbrv() */
if (!nested_vmcb_needs_vls_intercept(svm))
- vmcb02->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
+ vmcb02->control.misc_ctl2 |= SVM_MISC_CTL2_V_VMLOAD_VMSAVE_ENABLE;
if (guest_cpu_cap_has(vcpu, X86_FEATURE_PAUSEFILTER))
pause_count12 = svm->nested.ctl.pause_filter_count;
@@ -1257,7 +1257,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
kvm_make_request(KVM_REQ_EVENT, &svm->vcpu);
if (unlikely(guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) &&
- (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) {
+ (svm->nested.ctl.misc_ctl2 & SVM_MISC_CTL2_LBR_CTL_ENABLE))) {
svm_copy_lbrs(&vmcb12->save, &vmcb02->save);
} else {
svm_copy_lbrs(&vmcb01->save, &vmcb02->save);
@@ -1763,8 +1763,8 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
dst->event_inj = from->event_inj;
dst->event_inj_err = from->event_inj_err;
dst->next_rip = from->next_rip;
- dst->nested_cr3 = from->nested_cr3;
- dst->virt_ext = from->virt_ext;
+ dst->nested_cr3 = from->nested_cr3;
+ dst->misc_ctl2 = from->misc_ctl2;
dst->pause_filter_count = from->pause_filter_count;
dst->pause_filter_thresh = from->pause_filter_thresh;
/* 'clean' and 'hv_enlightenments' are not changed by KVM */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b5b4965e6bfdd..9789f7e72ae97 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -705,7 +705,7 @@ void *svm_alloc_permissions_map(unsigned long size, gfp_t gfp_mask)
static void svm_recalc_lbr_msr_intercepts(struct kvm_vcpu *vcpu)
{
- bool intercept = !(to_svm(vcpu)->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK);
+ bool intercept = !(to_svm(vcpu)->vmcb->control.misc_ctl2 & SVM_MISC_CTL2_LBR_CTL_ENABLE);
svm_set_intercept_for_msr(vcpu, MSR_IA32_LASTBRANCHFROMIP, MSR_TYPE_RW, intercept);
svm_set_intercept_for_msr(vcpu, MSR_IA32_LASTBRANCHTOIP, MSR_TYPE_RW, intercept);
@@ -797,7 +797,7 @@ static void svm_recalc_msr_intercepts(struct kvm_vcpu *vcpu)
static void __svm_enable_lbrv(struct kvm_vcpu *vcpu)
{
- to_svm(vcpu)->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK;
+ to_svm(vcpu)->vmcb->control.misc_ctl2 |= SVM_MISC_CTL2_LBR_CTL_ENABLE;
}
void svm_enable_lbrv(struct kvm_vcpu *vcpu)
@@ -809,16 +809,16 @@ void svm_enable_lbrv(struct kvm_vcpu *vcpu)
static void __svm_disable_lbrv(struct kvm_vcpu *vcpu)
{
KVM_BUG_ON(sev_es_guest(vcpu->kvm), vcpu->kvm);
- to_svm(vcpu)->vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK;
+ to_svm(vcpu)->vmcb->control.misc_ctl2 &= ~SVM_MISC_CTL2_LBR_CTL_ENABLE;
}
void svm_update_lbrv(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
- bool current_enable_lbrv = svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK;
+ bool current_enable_lbrv = svm->vmcb->control.misc_ctl2 & SVM_MISC_CTL2_LBR_CTL_ENABLE;
bool enable_lbrv = (svm->vmcb->save.dbgctl & DEBUGCTLMSR_LBR) ||
(is_guest_mode(vcpu) && guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) &&
- (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK));
+ (svm->nested.ctl.misc_ctl2 & SVM_MISC_CTL2_LBR_CTL_ENABLE));
if (enable_lbrv && !current_enable_lbrv)
__svm_enable_lbrv(vcpu);
@@ -979,7 +979,7 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu)
if (guest_cpuid_is_intel_compatible(vcpu)) {
svm_set_intercept(svm, INTERCEPT_VMLOAD);
svm_set_intercept(svm, INTERCEPT_VMSAVE);
- svm->vmcb->control.virt_ext &= ~VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
+ svm->vmcb->control.misc_ctl2 &= ~SVM_MISC_CTL2_V_VMLOAD_VMSAVE_ENABLE;
} else {
/*
* If hardware supports Virtual VMLOAD VMSAVE then enable it
@@ -988,7 +988,7 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu)
if (vls) {
svm_clr_intercept(svm, INTERCEPT_VMLOAD);
svm_clr_intercept(svm, INTERCEPT_VMSAVE);
- svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
+ svm->vmcb->control.misc_ctl2 |= SVM_MISC_CTL2_V_VMLOAD_VMSAVE_ENABLE;
}
}
}
@@ -3279,7 +3279,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
pr_err("%-20s%016llx\n", "ghcb:", control->ghcb_gpa);
pr_err("%-20s%08x\n", "event_inj:", control->event_inj);
pr_err("%-20s%08x\n", "event_inj_err:", control->event_inj_err);
- pr_err("%-20s%lld\n", "virt_ext:", control->virt_ext);
+ pr_err("%-20s%lld\n", "misc_ctl2:", control->misc_ctl2);
pr_err("%-20s%016llx\n", "next_rip:", control->next_rip);
pr_err("%-20s%016llx\n", "avic_backing_page:", control->avic_backing_page);
pr_err("%-20s%016llx\n", "avic_logical_id:", control->avic_logical_id);
@@ -4261,7 +4261,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
* VM-Exit), as running with the host's DEBUGCTL can negatively affect
* guest state and can even be fatal, e.g. due to Bus Lock Detect.
*/
- if (!(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK) &&
+ if (!(svm->vmcb->control.misc_ctl2 & SVM_MISC_CTL2_LBR_CTL_ENABLE) &&
vcpu->arch.host_debugctl != svm->vmcb->save.dbgctl)
update_debugctlmsr(svm->vmcb->save.dbgctl);
@@ -4292,7 +4292,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
- if (!(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK) &&
+ if (!(svm->vmcb->control.misc_ctl2 & SVM_MISC_CTL2_LBR_CTL_ENABLE) &&
vcpu->arch.host_debugctl != svm->vmcb->save.dbgctl)
update_debugctlmsr(vcpu->arch.host_debugctl);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 861ed9c33977b..68be3a08e3e62 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -174,7 +174,7 @@ struct vmcb_ctrl_area_cached {
u32 event_inj_err;
u64 next_rip;
u64 nested_cr3;
- u64 virt_ext;
+ u64 misc_ctl2;
u32 clean;
u64 bus_lock_rip;
union {
diff --git a/tools/testing/selftests/kvm/include/x86/svm.h b/tools/testing/selftests/kvm/include/x86/svm.h
index 5d2bcce34c019..a3f4eadffeb46 100644
--- a/tools/testing/selftests/kvm/include/x86/svm.h
+++ b/tools/testing/selftests/kvm/include/x86/svm.h
@@ -104,7 +104,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u32 event_inj;
u32 event_inj_err;
u64 nested_cr3;
- u64 virt_ext;
+ u64 misc_ctl2;
u32 clean;
u32 reserved_5;
u64 next_rip;
@@ -156,9 +156,6 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
#define AVIC_ENABLE_SHIFT 31
#define AVIC_ENABLE_MASK (1 << AVIC_ENABLE_SHIFT)
-#define LBR_CTL_ENABLE_MASK BIT_ULL(0)
-#define VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK BIT_ULL(1)
-
#define SVM_INTERRUPT_SHADOW_MASK 1
#define SVM_IOIO_STR_SHIFT 2
@@ -179,6 +176,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
#define SVM_MISC_CTL_CTL_NP_ENABLE BIT(0)
#define SVM_MISC_CTL_SEV_ENABLE BIT(1)
+#define SVM_MISC_CTL2_LBR_CTL_ENABLE BIT_ULL(0)
+#define SVM_MISC_CTL2_V_VMLOAD_VMSAVE_ENABLE BIT_ULL(1)
+
struct __attribute__ ((__packed__)) vmcb_seg {
u16 selector;
u16 attrib;
diff --git a/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c
index a343279546fd8..4a9e644b8931e 100644
--- a/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c
+++ b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c
@@ -75,9 +75,9 @@ static void l1_guest_code(struct svm_test_data *svm, bool nested_lbrv)
&l2_guest_stack[L2_GUEST_STACK_SIZE]);
if (nested_lbrv)
- vmcb->control.virt_ext = LBR_CTL_ENABLE_MASK;
+ vmcb->control.misc_ctl2 = SVM_MISC_CTL2_LBR_CTL_ENABLE;
else
- vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK;
+ vmcb->control.misc_ctl2 &= ~SVM_MISC_CTL2_LBR_CTL_ENABLE;
run_guest(vmcb, svm->vmcb_gpa);
GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL);
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 09/13] KVM: nSVM: Cache all used fields from VMCB12
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
` (7 preceding siblings ...)
2025-11-10 22:29 ` [PATCH v2 08/13] KVM: SVM: Rename vmcb->virt_ext to vmcb->misc_ctl2 Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN Yosry Ahmed
` (3 subsequent siblings)
12 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed
Currently, most fields used from VMCB12 are cached in
svm->nested.{ctl/save}. This is mainly to avoid TOC-TOU bugs. However,
for the save area, only the fields used in the consistency checks (i.e.
nested_vmcb_check_save()) were being cached. Other fields are read
directly from guest memory in nested_vmcb02_prepare_save().
While probably benign, this still makes it possible for TOC-TOU bugs to
happen. For example, RAX, RSP, and RIP are read twice, once to store in
VMCB02, and once to store in vcpu->arch.regs. It is possible for the
guest to modify the value between both reads, potentially causing nasty
bugs.
Harden against such bugs by caching everything in svm->nested.save.
Cache all the needed fields, and keep all accesses to the VMCB12
strictly in nested_svm_vmrun() for caching and early error injection.
Following changes will further limit the access to the VMCB12 in the
nested VMRUN path.
Introduce vmcb12_is_dirty() to use with the cached control fields
instead of vmcb_is_dirty(), similar to vmcb12_is_intercept().
Opportunistically order the copies in __nested_copy_vmcb_save_to_cache()
by the order in which the fields are defined in struct vmcb_save_area.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 116 ++++++++++++++++++++++----------------
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/svm/svm.h | 27 ++++++++-
3 files changed, 93 insertions(+), 52 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index b8d65832c64de..ddcd545ec1c3c 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -525,19 +525,34 @@ void nested_copy_vmcb_control_to_cache(struct vcpu_svm *svm,
static void __nested_copy_vmcb_save_to_cache(struct vmcb_save_area_cached *to,
struct vmcb_save_area *from)
{
- /*
- * Copy only fields that are validated, as we need them
- * to avoid TOC/TOU races.
- */
+ to->es = from->es;
to->cs = from->cs;
+ to->ss = from->ss;
+ to->ds = from->ds;
+ to->gdtr = from->gdtr;
+ to->idtr = from->idtr;
+
+ to->cpl = from->cpl;
to->efer = from->efer;
- to->cr0 = from->cr0;
- to->cr3 = from->cr3;
to->cr4 = from->cr4;
-
- to->dr6 = from->dr6;
+ to->cr3 = from->cr3;
+ to->cr0 = from->cr0;
to->dr7 = from->dr7;
+ to->dr6 = from->dr6;
+
+ to->rflags = from->rflags;
+ to->rip = from->rip;
+ to->rsp = from->rsp;
+
+ to->s_cet = from->s_cet;
+ to->ssp = from->ssp;
+ to->isst_addr = from->isst_addr;
+
+ to->rax = from->rax;
+ to->cr2 = from->cr2;
+
+ svm_copy_lbrs(to, from);
}
void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
@@ -673,8 +688,10 @@ void nested_vmcb02_compute_g_pat(struct vcpu_svm *svm)
svm->nested.vmcb02.ptr->save.g_pat = svm->vmcb01.ptr->save.g_pat;
}
-static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12)
+static void nested_vmcb02_prepare_save(struct vcpu_svm *svm)
{
+ struct vmcb_ctrl_area_cached *control = &svm->nested.ctl;
+ struct vmcb_save_area_cached *save = &svm->nested.save;
bool new_vmcb12 = false;
struct vmcb *vmcb01 = svm->vmcb01.ptr;
struct vmcb *vmcb02 = svm->nested.vmcb02.ptr;
@@ -689,49 +706,49 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
svm->nested.force_msr_bitmap_recalc = true;
}
- if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_SEG))) {
- vmcb02->save.es = vmcb12->save.es;
- vmcb02->save.cs = vmcb12->save.cs;
- vmcb02->save.ss = vmcb12->save.ss;
- vmcb02->save.ds = vmcb12->save.ds;
- vmcb02->save.cpl = vmcb12->save.cpl;
+ if (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_SEG))) {
+ vmcb02->save.es = save->es;
+ vmcb02->save.cs = save->cs;
+ vmcb02->save.ss = save->ss;
+ vmcb02->save.ds = save->ds;
+ vmcb02->save.cpl = save->cpl;
vmcb_mark_dirty(vmcb02, VMCB_SEG);
}
- if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_DT))) {
- vmcb02->save.gdtr = vmcb12->save.gdtr;
- vmcb02->save.idtr = vmcb12->save.idtr;
+ if (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_DT))) {
+ vmcb02->save.gdtr = save->gdtr;
+ vmcb02->save.idtr = save->idtr;
vmcb_mark_dirty(vmcb02, VMCB_DT);
}
if (guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK) &&
- (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_CET)))) {
- vmcb02->save.s_cet = vmcb12->save.s_cet;
- vmcb02->save.isst_addr = vmcb12->save.isst_addr;
- vmcb02->save.ssp = vmcb12->save.ssp;
+ (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_CET)))) {
+ vmcb02->save.s_cet = save->s_cet;
+ vmcb02->save.isst_addr = save->isst_addr;
+ vmcb02->save.ssp = save->ssp;
vmcb_mark_dirty(vmcb02, VMCB_CET);
}
- kvm_set_rflags(vcpu, vmcb12->save.rflags | X86_EFLAGS_FIXED);
+ kvm_set_rflags(vcpu, save->rflags | X86_EFLAGS_FIXED);
svm_set_efer(vcpu, svm->nested.save.efer);
svm_set_cr0(vcpu, svm->nested.save.cr0);
svm_set_cr4(vcpu, svm->nested.save.cr4);
- svm->vcpu.arch.cr2 = vmcb12->save.cr2;
+ svm->vcpu.arch.cr2 = save->cr2;
- kvm_rax_write(vcpu, vmcb12->save.rax);
- kvm_rsp_write(vcpu, vmcb12->save.rsp);
- kvm_rip_write(vcpu, vmcb12->save.rip);
+ kvm_rax_write(vcpu, save->rax);
+ kvm_rsp_write(vcpu, save->rsp);
+ kvm_rip_write(vcpu, save->rip);
/* In case we don't even reach vcpu_run, the fields are not updated */
- vmcb02->save.rax = vmcb12->save.rax;
- vmcb02->save.rsp = vmcb12->save.rsp;
- vmcb02->save.rip = vmcb12->save.rip;
+ vmcb02->save.rax = save->rax;
+ vmcb02->save.rsp = save->rsp;
+ vmcb02->save.rip = save->rip;
/* These bits will be set properly on the first execution when new_vmc12 is true */
- if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_DR))) {
+ if (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_DR))) {
vmcb02->save.dr7 = svm->nested.save.dr7 | DR7_FIXED_1;
svm->vcpu.arch.dr6 = svm->nested.save.dr6 | DR6_ACTIVE_LOW;
vmcb_mark_dirty(vmcb02, VMCB_DR);
@@ -743,7 +760,7 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
* Reserved bits of DEBUGCTL are ignored. Be consistent with
* svm_set_msr's definition of reserved bits.
*/
- svm_copy_lbrs(&vmcb02->save, &vmcb12->save);
+ svm_copy_lbrs(&vmcb02->save, save);
vmcb_mark_dirty(vmcb02, VMCB_LBR);
vmcb02->save.dbgctl &= ~DEBUGCTL_RESERVED_BITS;
} else {
@@ -953,28 +970,29 @@ static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to
to_vmcb->save.spec_ctrl = from_vmcb->save.spec_ctrl;
}
-int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
- struct vmcb *vmcb12, bool from_vmrun)
+int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ struct vmcb_ctrl_area_cached *control = &svm->nested.ctl;
+ struct vmcb_save_area_cached *save = &svm->nested.save;
int ret;
trace_kvm_nested_vmenter(svm->vmcb->save.rip,
vmcb12_gpa,
- vmcb12->save.rip,
- vmcb12->control.int_ctl,
- vmcb12->control.event_inj,
- vmcb12->control.misc_ctl,
- vmcb12->control.nested_cr3,
- vmcb12->save.cr3,
+ save->rip,
+ control->int_ctl,
+ control->event_inj,
+ control->misc_ctl,
+ control->nested_cr3,
+ save->cr3,
KVM_ISA_SVM);
- trace_kvm_nested_intercepts(vmcb12->control.intercepts[INTERCEPT_CR] & 0xffff,
- vmcb12->control.intercepts[INTERCEPT_CR] >> 16,
- vmcb12->control.intercepts[INTERCEPT_EXCEPTION],
- vmcb12->control.intercepts[INTERCEPT_WORD3],
- vmcb12->control.intercepts[INTERCEPT_WORD4],
- vmcb12->control.intercepts[INTERCEPT_WORD5]);
+ trace_kvm_nested_intercepts(control->intercepts[INTERCEPT_CR] & 0xffff,
+ control->intercepts[INTERCEPT_CR] >> 16,
+ control->intercepts[INTERCEPT_EXCEPTION],
+ control->intercepts[INTERCEPT_WORD3],
+ control->intercepts[INTERCEPT_WORD4],
+ control->intercepts[INTERCEPT_WORD5]);
svm->nested.vmcb12_gpa = vmcb12_gpa;
@@ -984,8 +1002,8 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
svm_switch_vmcb(svm, &svm->nested.vmcb02);
- nested_vmcb02_prepare_control(svm, vmcb12->save.rip, vmcb12->save.cs.base);
- nested_vmcb02_prepare_save(svm, vmcb12);
+ nested_vmcb02_prepare_control(svm, save->rip, save->cs.base);
+ nested_vmcb02_prepare_save(svm);
ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
nested_npt_enabled(svm), from_vmrun);
@@ -1074,7 +1092,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
svm->nested.nested_run_pending = 1;
- if (enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12, true))
+ if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true))
goto out_exit_err;
if (nested_svm_merge_msrpm(vcpu))
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9789f7e72ae97..2fbb0b88c6a3e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4762,7 +4762,7 @@ static int svm_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
vmcb12 = map.hva;
nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
- ret = enter_svm_guest_mode(vcpu, smram64->svm_guest_vmcb_gpa, vmcb12, false);
+ ret = enter_svm_guest_mode(vcpu, smram64->svm_guest_vmcb_gpa, false);
if (ret)
goto unmap_save;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 68be3a08e3e62..ef6bdce630dc0 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -142,13 +142,32 @@ struct kvm_vmcb_info {
};
struct vmcb_save_area_cached {
+ struct vmcb_seg es;
struct vmcb_seg cs;
+ struct vmcb_seg ss;
+ struct vmcb_seg ds;
+ struct vmcb_seg gdtr;
+ struct vmcb_seg idtr;
+ u8 cpl;
u64 efer;
u64 cr4;
u64 cr3;
u64 cr0;
u64 dr7;
u64 dr6;
+ u64 rflags;
+ u64 rip;
+ u64 rsp;
+ u64 s_cet;
+ u64 ssp;
+ u64 isst_addr;
+ u64 rax;
+ u64 cr2;
+ u64 dbgctl;
+ u64 br_from;
+ u64 br_to;
+ u64 last_excp_from;
+ u64 last_excp_to;
};
struct vmcb_ctrl_area_cached {
@@ -422,6 +441,11 @@ static inline bool vmcb_is_dirty(struct vmcb *vmcb, int bit)
return !test_bit(bit, (unsigned long *)&vmcb->control.clean);
}
+static inline bool vmcb12_is_dirty(struct vmcb_ctrl_area_cached *control, int bit)
+{
+ return !test_bit(bit, (unsigned long *)&control->clean);
+}
+
static __always_inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
{
return container_of(vcpu, struct vcpu_svm, vcpu);
@@ -760,8 +784,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
int __init nested_svm_init_msrpm_merge_offsets(void);
-int enter_svm_guest_mode(struct kvm_vcpu *vcpu,
- u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun);
+int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb_gpa, bool from_vmrun);
void svm_leave_nested(struct kvm_vcpu *vcpu);
void svm_free_nested(struct vcpu_svm *svm);
int svm_allocate_nested(struct vcpu_svm *svm);
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
` (8 preceding siblings ...)
2025-11-10 22:29 ` [PATCH v2 09/13] KVM: nSVM: Cache all used fields from VMCB12 Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-12-09 16:03 ` Sean Christopherson
2025-11-10 22:29 ` [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun() Yosry Ahmed
` (2 subsequent siblings)
12 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed
All accesses to the VMCB12 in the guest memory are limited to
nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
the function execution. Unmapping right after the consistency checks is
possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
used after being unmapped.
Move all accesses to the VMCB12 into a new helper,
nested_svm_vmrun_read_vmcb12(), that maps the VMCB12,
caches the needed fields, performs consistency checks, and unmaps it.
This limits the scope of the VMCB12 mapping appropriately. It also
slightly simplifies the cleanup path of nested_svm_vmrun().
nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
fail, maintaining the current behavior of skipping the instructions and
unmapping the VMCB12 (although in the opposite order).
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
1 file changed, 34 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index ddcd545ec1c3c..a48668c36a191 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
return 0;
}
+static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+ struct kvm_host_map map;
+ struct vmcb *vmcb12;
+ int ret;
+
+ ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
+ if (ret)
+ return ret;
+
+ vmcb12 = map.hva;
+
+ nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
+ nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
+
+ if (!nested_vmcb_check_save(vcpu) ||
+ !nested_vmcb_check_controls(vcpu)) {
+ vmcb12->control.exit_code = SVM_EXIT_ERR;
+ vmcb12->control.exit_code_hi = 0;
+ vmcb12->control.exit_info_1 = 0;
+ vmcb12->control.exit_info_2 = 0;
+ ret = -1;
+ }
+
+ kvm_vcpu_unmap(vcpu, &map);
+ return ret;
+}
+
int nested_svm_vmrun(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
int ret;
- struct vmcb *vmcb12;
- struct kvm_host_map map;
u64 vmcb12_gpa;
struct vmcb *vmcb01 = svm->vmcb01.ptr;
@@ -1049,8 +1076,11 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
return ret;
}
+ if (WARN_ON_ONCE(!svm->nested.initialized))
+ return -EINVAL;
+
vmcb12_gpa = svm->vmcb->save.rax;
- ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
+ ret = nested_svm_vmrun_read_vmcb12(vcpu, vmcb12_gpa);
if (ret == -EINVAL) {
kvm_inject_gp(vcpu, 0);
return 1;
@@ -1060,23 +1090,6 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
ret = kvm_skip_emulated_instruction(vcpu);
- vmcb12 = map.hva;
-
- if (WARN_ON_ONCE(!svm->nested.initialized))
- return -EINVAL;
-
- nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
- nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
-
- if (!nested_vmcb_check_save(vcpu) ||
- !nested_vmcb_check_controls(vcpu)) {
- vmcb12->control.exit_code = SVM_EXIT_ERR;
- vmcb12->control.exit_code_hi = 0;
- vmcb12->control.exit_info_1 = 0;
- vmcb12->control.exit_info_2 = 0;
- goto out;
- }
-
/*
* Since vmcb01 is not in use, we can use it to store some of the L1
* state.
@@ -1096,7 +1109,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
goto out_exit_err;
if (nested_svm_merge_msrpm(vcpu))
- goto out;
+ return ret;
out_exit_err:
svm->nested.nested_run_pending = 0;
@@ -1109,10 +1122,6 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
svm->vmcb->control.exit_info_2 = 0;
nested_svm_vmexit(svm);
-
-out:
- kvm_vcpu_unmap(vcpu, &map);
-
return ret;
}
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun()
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
` (9 preceding siblings ...)
2025-11-10 22:29 ` [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-12-09 16:11 ` Sean Christopherson
2025-11-10 22:29 ` [PATCH v2 12/13] KVM: nSVM: Sanitize control fields copied from VMCB12 Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 13/13] KVM: nSVM: Only copy NP_ENABLE from VMCB01's misc_ctl Yosry Ahmed
12 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed
Call nested_svm_merge_msrpm() from enter_svm_guest_mode() if called from
the VMRUN path, instead of making the call in nested_svm_vmrun(). This
simplifies the flow of nested_svm_vmrun() and removes all jumps to
cleanup labels.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 28 +++++++++++++---------------
1 file changed, 13 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index a48668c36a191..89830380cebc5 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1020,6 +1020,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
nested_svm_hv_update_vm_vp_ids(vcpu);
+ if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
+ return -1;
+
return 0;
}
@@ -1105,23 +1108,18 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
svm->nested.nested_run_pending = 1;
- if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true))
- goto out_exit_err;
-
- if (nested_svm_merge_msrpm(vcpu))
- return ret;
-
-out_exit_err:
- svm->nested.nested_run_pending = 0;
- svm->nmi_l1_to_l2 = false;
- svm->soft_int_injected = false;
+ if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) {
+ svm->nested.nested_run_pending = 0;
+ svm->nmi_l1_to_l2 = false;
+ svm->soft_int_injected = false;
- svm->vmcb->control.exit_code = SVM_EXIT_ERR;
- svm->vmcb->control.exit_code_hi = 0;
- svm->vmcb->control.exit_info_1 = 0;
- svm->vmcb->control.exit_info_2 = 0;
+ svm->vmcb->control.exit_code = SVM_EXIT_ERR;
+ svm->vmcb->control.exit_code_hi = 0;
+ svm->vmcb->control.exit_info_1 = 0;
+ svm->vmcb->control.exit_info_2 = 0;
- nested_svm_vmexit(svm);
+ nested_svm_vmexit(svm);
+ }
return ret;
}
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 12/13] KVM: nSVM: Sanitize control fields copied from VMCB12
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
` (10 preceding siblings ...)
2025-11-10 22:29 ` [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun() Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-12-09 16:19 ` Sean Christopherson
2025-11-10 22:29 ` [PATCH v2 13/13] KVM: nSVM: Only copy NP_ENABLE from VMCB01's misc_ctl Yosry Ahmed
12 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed
Make sure all fields used from VMCB12 in creating the VMCB02 are
sanitized, such no unhandled or reserved bits end up in the VMCB02.
The following control fields are read from VMCB12 and have bits that are
either reserved or not handled/advertised by KVM: tlb_ctl, int_ctl,
int_state, int_vector, event_inj, misc_ctl, and misc_ctl2.
The following fields do not require any extra sanitizing:
- int_ctl: bits from VMCB12 are copied bit-by-bit as needed.
- misc_ctl: only used in consistency checks (particularly NP_ENABLE).
- misc_ctl2: bits from VMCB12 are copied bit-by-bit as needed.
For the remaining fields, make sure only defined bits are copied from
VMCB12 by defining appropriate masks where needed. The only exception is
tlb_ctl, which is unused, so remove it.
Opportunisitcally move some existing definitions in svm.h around such
that they are ordered by bit position, and cleanup ignoring the lower
bits of {io/msr}pm_base_pa in __nested_copy_vmcb_control_to_cache() by
using PAGE_MASK. Also, expand the comment about the ASID being copied
only for consistency checks.
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/include/asm/svm.h | 11 ++++++++---
arch/x86/kvm/svm/nested.c | 26 ++++++++++++++------------
arch/x86/kvm/svm/svm.h | 1 -
3 files changed, 22 insertions(+), 16 deletions(-)
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index a842018952d2c..44f2cfcd8d4ff 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -213,11 +213,13 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
#define V_NMI_ENABLE_SHIFT 26
#define V_NMI_ENABLE_MASK (1 << V_NMI_ENABLE_SHIFT)
+#define X2APIC_MODE_SHIFT 30
+#define X2APIC_MODE_MASK (1 << X2APIC_MODE_SHIFT)
+
#define AVIC_ENABLE_SHIFT 31
#define AVIC_ENABLE_MASK (1 << AVIC_ENABLE_SHIFT)
-#define X2APIC_MODE_SHIFT 30
-#define X2APIC_MODE_MASK (1 << X2APIC_MODE_SHIFT)
+#define SVM_INT_VECTOR_MASK (0xff)
#define SVM_INTERRUPT_SHADOW_MASK BIT_ULL(0)
#define SVM_GUEST_INTERRUPT_MASK BIT_ULL(1)
@@ -626,8 +628,11 @@ static inline void __unused_size_checks(void)
#define SVM_EVTINJ_TYPE_EXEPT (3 << SVM_EVTINJ_TYPE_SHIFT)
#define SVM_EVTINJ_TYPE_SOFT (4 << SVM_EVTINJ_TYPE_SHIFT)
-#define SVM_EVTINJ_VALID (1 << 31)
#define SVM_EVTINJ_VALID_ERR (1 << 11)
+#define SVM_EVTINJ_VALID (1 << 31)
+
+#define SVM_EVTINJ_RESERVED_BITS ~(SVM_EVTINJ_VEC_MASK | SVM_EVTINJ_TYPE_MASK | \
+ SVM_EVTINJ_VALID_ERR | SVM_EVTINJ_VALID)
#define SVM_EXITINTINFO_VEC_MASK SVM_EVTINJ_VEC_MASK
#define SVM_EXITINTINFO_TYPE_MASK SVM_EVTINJ_TYPE_MASK
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 89830380cebc5..503cb7f5a4c5f 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -479,10 +479,11 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
for (i = 0; i < MAX_INTERCEPT; i++)
to->intercepts[i] = from->intercepts[i];
- to->iopm_base_pa = from->iopm_base_pa;
- to->msrpm_base_pa = from->msrpm_base_pa;
+ /* Lower bits of IOPM_BASE_PA and MSRPM_BASE_PA are ignored */
+ to->iopm_base_pa = from->iopm_base_pa & PAGE_MASK;
+ to->msrpm_base_pa = from->msrpm_base_pa & PAGE_MASK;
+
to->tsc_offset = from->tsc_offset;
- to->tlb_ctl = from->tlb_ctl;
to->int_ctl = from->int_ctl;
to->int_vector = from->int_vector;
to->int_state = from->int_state;
@@ -492,19 +493,21 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
to->exit_info_2 = from->exit_info_2;
to->exit_int_info = from->exit_int_info;
to->exit_int_info_err = from->exit_int_info_err;
- to->misc_ctl = from->misc_ctl;
+ to->misc_ctl = from->misc_ctl;
to->event_inj = from->event_inj;
to->event_inj_err = from->event_inj_err;
to->next_rip = from->next_rip;
to->nested_cr3 = from->nested_cr3;
- to->misc_ctl2 = from->misc_ctl2;
+ to->misc_ctl2 = from->misc_ctl2;
to->pause_filter_count = from->pause_filter_count;
to->pause_filter_thresh = from->pause_filter_thresh;
- /* Copy asid here because nested_vmcb_check_controls will check it. */
+ /*
+ * Copy asid here because nested_vmcb_check_controls() will check it.
+ * The ASID could be invalid, or conflict with another VM's ASID , so it
+ * should never be used directly to run L2.
+ */
to->asid = from->asid;
- to->msrpm_base_pa &= ~0x0fffULL;
- to->iopm_base_pa &= ~0x0fffULL;
#ifdef CONFIG_KVM_HYPERV
/* Hyper-V extensions (Enlightened VMCB) */
@@ -890,9 +893,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
(svm->nested.ctl.int_ctl & int_ctl_vmcb12_bits) |
(vmcb01->control.int_ctl & int_ctl_vmcb01_bits);
- vmcb02->control.int_vector = svm->nested.ctl.int_vector;
- vmcb02->control.int_state = svm->nested.ctl.int_state;
- vmcb02->control.event_inj = svm->nested.ctl.event_inj;
+ vmcb02->control.int_vector = svm->nested.ctl.int_vector & SVM_INT_VECTOR_MASK;
+ vmcb02->control.int_state = svm->nested.ctl.int_state & SVM_INTERRUPT_SHADOW_MASK;
+ vmcb02->control.event_inj = svm->nested.ctl.event_inj & ~SVM_EVTINJ_RESERVED_BITS;
vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err;
/*
@@ -1774,7 +1777,6 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
dst->msrpm_base_pa = from->msrpm_base_pa;
dst->tsc_offset = from->tsc_offset;
dst->asid = from->asid;
- dst->tlb_ctl = from->tlb_ctl;
dst->int_ctl = from->int_ctl;
dst->int_vector = from->int_vector;
dst->int_state = from->int_state;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index ef6bdce630dc0..c8d43793aa9d6 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -178,7 +178,6 @@ struct vmcb_ctrl_area_cached {
u64 msrpm_base_pa;
u64 tsc_offset;
u32 asid;
- u8 tlb_ctl;
u32 int_ctl;
u32 int_vector;
u32 int_state;
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2 13/13] KVM: nSVM: Only copy NP_ENABLE from VMCB01's misc_ctl
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
` (11 preceding siblings ...)
2025-11-10 22:29 ` [PATCH v2 12/13] KVM: nSVM: Sanitize control fields copied from VMCB12 Yosry Ahmed
@ 2025-11-10 22:29 ` Yosry Ahmed
2025-12-09 16:23 ` Sean Christopherson
12 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-11-10 22:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, Yosry Ahmed
The 'misc_ctl' field in VMCB02 is taken as-is from VMCB01. However, the
only bit that needs to copied is NP_ENABLE. This is a nop now because
other bits are for SEV guests, which do not support nested.
Nonetheless, this hardens against future bugs if/when other bits are
set for L1 but should not be set for L2.
Opportunistically add a comment explaining why NP_ENABLE is taken from
VMCB01 and not VMCB02.
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 503cb7f5a4c5f..4e278c1f9e6b3 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -837,8 +837,16 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
V_NMI_BLOCKING_MASK);
}
- /* Copied from vmcb01. msrpm_base can be overwritten later. */
- vmcb02->control.misc_ctl = vmcb01->control.misc_ctl;
+ /*
+ * Copied from vmcb01. msrpm_base can be overwritten later.
+ *
+ * NP_ENABLE in vmcb12 is only used for consistency checks. If L1
+ * enables NPTs, KVM shadows L1's NPTs and uses those to run L2. If L1
+ * disables NPT, KVM runs L2 with the same NPTs used to run L1. For the
+ * latter, L1 runs L2 with shadow page tables that translate L2 GVAs to
+ * L1 GPAs, so the same NPTs can be used for L1 and L2.
+ */
+ vmcb02->control.misc_ctl = vmcb01->control.misc_ctl & SVM_MISC_CTL_NP_ENABLE;
vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa;
vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
2025-11-10 22:29 ` [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN Yosry Ahmed
@ 2025-12-09 16:03 ` Sean Christopherson
2025-12-09 18:24 ` Yosry Ahmed
2025-12-10 23:05 ` Yosry Ahmed
0 siblings, 2 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-12-09 16:03 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> All accesses to the VMCB12 in the guest memory are limited to
> nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
> the function execution. Unmapping right after the consistency checks is
> possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
> used after being unmapped.
>
> Move all accesses to the VMCB12 into a new helper,
> nested_svm_vmrun_read_vmcb12(), that maps the VMCB12,
> caches the needed fields, performs consistency checks, and unmaps it.
> This limits the scope of the VMCB12 mapping appropriately. It also
> slightly simplifies the cleanup path of nested_svm_vmrun().
>
> nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
> fail, maintaining the current behavior of skipping the instructions and
> unmapping the VMCB12 (although in the opposite order).
>
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
> arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
> 1 file changed, 34 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index ddcd545ec1c3c..a48668c36a191 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> return 0;
> }
>
> +static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
"read_vmcb12"() sounds like a generic helper to read a specific field. And if
the name is more specific, then I think we can drop the "vmrun" scoping. To
aligh with similar functions in VMX and __nested_copy_vmcb_save_to_cache(), how
about nested_svm_copy_vmcb12_to_cache()?
> +{
> + struct vcpu_svm *svm = to_svm(vcpu);
> + struct kvm_host_map map;
> + struct vmcb *vmcb12;
> + int ret;
> +
> + ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> + if (ret)
> + return ret;
> +
> + vmcb12 = map.hva;
> +
> + nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> + nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> +
> + if (!nested_vmcb_check_save(vcpu) ||
> + !nested_vmcb_check_controls(vcpu)) {
> + vmcb12->control.exit_code = SVM_EXIT_ERR;
> + vmcb12->control.exit_code_hi = 0;
> + vmcb12->control.exit_info_1 = 0;
> + vmcb12->control.exit_info_2 = 0;
> + ret = -1;
I don't love shoving the consistency checks in here. I get why you did it, but
it's very surprising to see (and/or easy to miss) these consistency checks. The
caller also ends up quite wonky:
if (ret == -EINVAL) {
kvm_inject_gp(vcpu, 0);
return 1;
} else if (ret) {
return kvm_skip_emulated_instruction(vcpu);
}
ret = kvm_skip_emulated_instruction(vcpu);
Ha! And it's buggy. __kvm_vcpu_map() can return -EFAULT if creating a host
mapping fails. Eww, and blindly using '-1' as the "failed a consistency check"
is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
weird way.
Ugh, and there's also this nastiness in nested_vmcb_check_controls():
* Make sure we did not enter guest mode yet, in which case
* kvm_read_cr0() could return L2's CR0.
*/
WARN_ON_ONCE(is_guest_mode(vcpu));
return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
nested_vmcb_check_save() and nested_vmcb_check_controls() really shouldn't exist.
They just make it harder to see what KVM is checking in the "normal" flow.
Aha! And I'm fairly certain there are at least two pre-existing bugs due to KVM
doing "early" consistency checks in nested_svm_vmrun().
1. KVM doesn't clear GIF on the early #VMEXIT. In classic APM fashion, nothing
_requires_ GIF=0 before VMRUN:
It is assumed that VMM software cleared GIF some time before executing
the VMRUN instruction, to ensure an atomic state switch.
And so VMRUN with GIF=1 that hits an "early" consistency check #VMEXIT would
incorrectly leave GIF=1.
2. svm_leave_smm() is missing consistency checks on the newly loaded guest state,
because the checks aren't performed by enter_svm_guest_mode(). I don't see
anything that would prevent vmcb12 from being modified by the guest bewteen
SMI and RSM.
Moving the consistency checks into enter_svm_guest_mode() would solve all three
(four?) problems. And as a bonus, nested_svm_copy_vmcb12_to_cache() can use
kvm_vcpu_map_readonly().
Compile tested only, but I think we can end up with delta like so:
---
arch/x86/kvm/svm/nested.c | 67 ++++++++++++---------------------------
1 file changed, 20 insertions(+), 47 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 7c86987fdaca..8a0df6c535b5 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -372,9 +372,9 @@ static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj)
return true;
}
-static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
- struct vmcb_ctrl_area_cached *control,
- unsigned long l1_cr0)
+static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
+ struct vmcb_ctrl_area_cached *control,
+ unsigned long l1_cr0)
{
if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
return false;
@@ -408,8 +408,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
}
/* Common checks that apply to both L1 and L2 state. */
-static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
- struct vmcb_save_area_cached *save)
+static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu,
+ struct vmcb_save_area_cached *save)
{
if (CC(!(save->efer & EFER_SVME)))
return false;
@@ -448,27 +448,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
return true;
}
-static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu)
-{
- struct vcpu_svm *svm = to_svm(vcpu);
- struct vmcb_save_area_cached *save = &svm->nested.save;
-
- return __nested_vmcb_check_save(vcpu, save);
-}
-
-static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
-{
- struct vcpu_svm *svm = to_svm(vcpu);
- struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
-
- /*
- * Make sure we did not enter guest mode yet, in which case
- * kvm_read_cr0() could return L2's CR0.
- */
- WARN_ON_ONCE(is_guest_mode(vcpu));
- return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
-}
-
static
void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
struct vmcb_ctrl_area_cached *to,
@@ -1004,6 +983,12 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
svm_switch_vmcb(svm, &svm->nested.vmcb02);
+
+ if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
+ !nested_vmcb_check_controls(vcpu, &svm->nested.ctl,
+ svm->vmcb01.ptr->save.cr0))
+ return -EINVAL;
+
nested_vmcb02_prepare_control(svm, save->rip, save->cs.base);
nested_vmcb02_prepare_save(svm);
@@ -1025,33 +1010,24 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
return 0;
}
-static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
+static int nested_svm_copy_vmcb12_to_cache(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
{
struct vcpu_svm *svm = to_svm(vcpu);
struct kvm_host_map map;
struct vmcb *vmcb12;
- int ret;
+ int r;
- ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
- if (ret)
- return ret;
+ r = kvm_vcpu_map_readonly(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
+ if (r)
+ return r;
vmcb12 = map.hva;
nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
- if (!nested_vmcb_check_save(vcpu) ||
- !nested_vmcb_check_controls(vcpu)) {
- vmcb12->control.exit_code = SVM_EXIT_ERR;
- vmcb12->control.exit_code_hi = -1u;
- vmcb12->control.exit_info_1 = 0;
- vmcb12->control.exit_info_2 = 0;
- ret = -1;
- }
-
kvm_vcpu_unmap(vcpu, &map);
- return ret;
+ return 0;
}
int nested_svm_vmrun(struct kvm_vcpu *vcpu)
@@ -1082,12 +1058,9 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
return -EINVAL;
vmcb12_gpa = svm->vmcb->save.rax;
- ret = nested_svm_vmrun_read_vmcb12(vcpu, vmcb12_gpa);
- if (ret == -EINVAL) {
+ if (nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa)) {
kvm_inject_gp(vcpu, 0);
return 1;
- } else if (ret) {
- return kvm_skip_emulated_instruction(vcpu);
}
ret = kvm_skip_emulated_instruction(vcpu);
@@ -1919,7 +1892,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
ret = -EINVAL;
__nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl);
/* 'save' contains L1 state saved from before VMRUN */
- if (!__nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
+ if (!nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
goto out_free;
/*
@@ -1938,7 +1911,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
if (!(save->cr0 & X86_CR0_PG) ||
!(save->cr0 & X86_CR0_PE) ||
(save->rflags & X86_EFLAGS_VM) ||
- !__nested_vmcb_check_save(vcpu, &save_cached))
+ !nested_vmcb_check_save(vcpu, &save_cached))
goto out_free;
base-commit: 01597a665f5dcf8d7cfbedf36f4e6d46d045eb4f
--
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun()
2025-11-10 22:29 ` [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun() Yosry Ahmed
@ 2025-12-09 16:11 ` Sean Christopherson
2025-12-09 18:30 ` Yosry Ahmed
` (2 more replies)
0 siblings, 3 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-12-09 16:11 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> Call nested_svm_merge_msrpm() from enter_svm_guest_mode() if called from
> the VMRUN path, instead of making the call in nested_svm_vmrun(). This
> simplifies the flow of nested_svm_vmrun() and removes all jumps to
> cleanup labels.
>
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
> arch/x86/kvm/svm/nested.c | 28 +++++++++++++---------------
> 1 file changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index a48668c36a191..89830380cebc5 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1020,6 +1020,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
>
> nested_svm_hv_update_vm_vp_ids(vcpu);
>
> + if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
This is silly, just do:
if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) ||
nested_svm_merge_msrpm(vcpu)) {
svm->nested.nested_run_pending = 0;
svm->nmi_l1_to_l2 = false;
svm->soft_int_injected = false;
svm->vmcb->control.exit_code = SVM_EXIT_ERR;
svm->vmcb->control.exit_code_hi = -1u;
svm->vmcb->control.exit_info_1 = 0;
svm->vmcb->control.exit_info_2 = 0;
nested_svm_vmexit(svm);
}
> + return -1;
Please stop returning -1, use a proper -errno.
> +
> return 0;
> }
>
> @@ -1105,23 +1108,18 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
>
> svm->nested.nested_run_pending = 1;
>
> - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true))
> - goto out_exit_err;
> -
> - if (nested_svm_merge_msrpm(vcpu))
> - return ret;
> -
> -out_exit_err:
> - svm->nested.nested_run_pending = 0;
> - svm->nmi_l1_to_l2 = false;
> - svm->soft_int_injected = false;
> + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) {
> + svm->nested.nested_run_pending = 0;
> + svm->nmi_l1_to_l2 = false;
> + svm->soft_int_injected = false;
>
> - svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> - svm->vmcb->control.exit_code_hi = 0;
> - svm->vmcb->control.exit_info_1 = 0;
> - svm->vmcb->control.exit_info_2 = 0;
> + svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> + svm->vmcb->control.exit_code_hi = 0;
> + svm->vmcb->control.exit_info_1 = 0;
> + svm->vmcb->control.exit_info_2 = 0;
>
> - nested_svm_vmexit(svm);
> + nested_svm_vmexit(svm);
Note, there's a pre-existing bug in nested_svm_vmexit(). Lovely, and it's a
user-triggerable WARN_ON() (and not even a WARN_ON_ONCE() at that).
If nested_svm_vmexit() fails to map vmcb12, it (unbelievably stupidly) injects a
#GP and hopes for the best. Oh FFS, it also has the asinine -EINVAL "logic".
Anyways, it injects #GP (maybe), and bails early, which leaves
KVM_REQ_GET_NESTED_STATE_PAGES set. KVM will then process that on the next
vcpu_enter_guest() and trip the WARN_ON() in svm_get_nested_state_pages().
Something like this to clean up the mess:
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index d4c872843a9d..96f8009a0d45 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1018,9 +1018,6 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
nested_svm_hv_update_vm_vp_ids(vcpu);
- if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
- return -1;
-
return 0;
}
@@ -1094,7 +1091,8 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
svm->nested.nested_run_pending = 1;
- if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) {
+ if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) ||
+ nested_svm_merge_msrpm(vcpu)) {
svm->nested.nested_run_pending = 0;
svm->nmi_l1_to_l2 = false;
svm->soft_int_injected = false;
@@ -1158,24 +1156,16 @@ void svm_copy_vmloadsave_state(struct vmcb *to_vmcb, struct vmcb *from_vmcb)
int nested_svm_vmexit(struct vcpu_svm *svm)
{
struct kvm_vcpu *vcpu = &svm->vcpu;
+ gpa_t vmcb12_gpa = svm->nested.vmcb12_gpa;
struct vmcb *vmcb01 = svm->vmcb01.ptr;
struct vmcb *vmcb02 = svm->nested.vmcb02.ptr;
struct vmcb *vmcb12;
struct kvm_host_map map;
- int rc;
-
- rc = kvm_vcpu_map(vcpu, gpa_to_gfn(svm->nested.vmcb12_gpa), &map);
- if (rc) {
- if (rc == -EINVAL)
- kvm_inject_gp(vcpu, 0);
- return 1;
- }
vmcb12 = map.hva;
/* Exit Guest-Mode */
leave_guest_mode(vcpu);
- svm->nested.vmcb12_gpa = 0;
WARN_ON_ONCE(svm->nested.nested_run_pending);
kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
@@ -1183,6 +1173,13 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
/* in case we halted in L2 */
kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE);
+ svm->nested.vmcb12_gpa = 0;
+
+ if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) {
+ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+ return 1;
+ }
+
/* Give the current vmcb to the guest */
vmcb12->save.es = vmcb02->save.es;
@@ -1973,7 +1970,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
{
- if (WARN_ON(!is_guest_mode(vcpu)))
+ if (WARN_ON_ONCE(!is_guest_mode(vcpu)))
return true;
if (!vcpu->arch.pdptrs_from_userspace &&
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH v2 12/13] KVM: nSVM: Sanitize control fields copied from VMCB12
2025-11-10 22:29 ` [PATCH v2 12/13] KVM: nSVM: Sanitize control fields copied from VMCB12 Yosry Ahmed
@ 2025-12-09 16:19 ` Sean Christopherson
2025-12-09 18:37 ` Yosry Ahmed
0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2025-12-09 16:19 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> Make sure all fields used from VMCB12 in creating the VMCB02 are
> sanitized, such no unhandled or reserved bits end up in the VMCB02.
>
> The following control fields are read from VMCB12 and have bits that are
> either reserved or not handled/advertised by KVM: tlb_ctl, int_ctl,
> int_state, int_vector, event_inj, misc_ctl, and misc_ctl2.
>
> The following fields do not require any extra sanitizing:
> - int_ctl: bits from VMCB12 are copied bit-by-bit as needed.
> - misc_ctl: only used in consistency checks (particularly NP_ENABLE).
> - misc_ctl2: bits from VMCB12 are copied bit-by-bit as needed.
>
> For the remaining fields, make sure only defined bits are copied from
> VMCB12 by defining appropriate masks where needed. The only exception is
> tlb_ctl, which is unused, so remove it.
>
> Opportunisitcally move some existing definitions in svm.h around such
Opportunistically. But moot point, because please put such cleanups in a separate
patch. There are so many opportunistic cleanups in this patch that I genuinely
can't see what's changing, and I don't have the patience right now to stare hard.
Cleanups will making *related* changes are totally fine, e.g. bundling the use
of PAGE_MASK in conjuction with changing the code to do "from->iopm_base_pa & ..."
instead of "to->msrpm_base_pa &= ..." is fine, but those changes have nothing to
do with the rest of the patch.
> that they are ordered by bit position, and cleanup ignoring the lower
> bits of {io/msr}pm_base_pa in __nested_copy_vmcb_control_to_cache() by
> using PAGE_MASK. Also, expand the comment about the ASID being copied
> only for consistency checks.
>
> Suggested-by: Jim Mattson <jmattson@google.com>
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
> arch/x86/include/asm/svm.h | 11 ++++++++---
> arch/x86/kvm/svm/nested.c | 26 ++++++++++++++------------
> arch/x86/kvm/svm/svm.h | 1 -
> 3 files changed, 22 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index a842018952d2c..44f2cfcd8d4ff 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -213,11 +213,13 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
> #define V_NMI_ENABLE_SHIFT 26
> #define V_NMI_ENABLE_MASK (1 << V_NMI_ENABLE_SHIFT)
>
> +#define X2APIC_MODE_SHIFT 30
> +#define X2APIC_MODE_MASK (1 << X2APIC_MODE_SHIFT)
> +
> #define AVIC_ENABLE_SHIFT 31
> #define AVIC_ENABLE_MASK (1 << AVIC_ENABLE_SHIFT)
>
> -#define X2APIC_MODE_SHIFT 30
> -#define X2APIC_MODE_MASK (1 << X2APIC_MODE_SHIFT)
> +#define SVM_INT_VECTOR_MASK (0xff)
>
> #define SVM_INTERRUPT_SHADOW_MASK BIT_ULL(0)
> #define SVM_GUEST_INTERRUPT_MASK BIT_ULL(1)
> @@ -626,8 +628,11 @@ static inline void __unused_size_checks(void)
> #define SVM_EVTINJ_TYPE_EXEPT (3 << SVM_EVTINJ_TYPE_SHIFT)
> #define SVM_EVTINJ_TYPE_SOFT (4 << SVM_EVTINJ_TYPE_SHIFT)
>
> -#define SVM_EVTINJ_VALID (1 << 31)
> #define SVM_EVTINJ_VALID_ERR (1 << 11)
> +#define SVM_EVTINJ_VALID (1 << 31)
If you want to do cleanup, these should all use BIT()...
> +
> +#define SVM_EVTINJ_RESERVED_BITS ~(SVM_EVTINJ_VEC_MASK | SVM_EVTINJ_TYPE_MASK | \
> + SVM_EVTINJ_VALID_ERR | SVM_EVTINJ_VALID)
Because then I don't have to think hard about what exactly this will generate.
> #define SVM_EXITINTINFO_VEC_MASK SVM_EVTINJ_VEC_MASK
> #define SVM_EXITINTINFO_TYPE_MASK SVM_EVTINJ_TYPE_MASK
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 89830380cebc5..503cb7f5a4c5f 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -479,10 +479,11 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> for (i = 0; i < MAX_INTERCEPT; i++)
> to->intercepts[i] = from->intercepts[i];
>
> - to->iopm_base_pa = from->iopm_base_pa;
> - to->msrpm_base_pa = from->msrpm_base_pa;
> + /* Lower bits of IOPM_BASE_PA and MSRPM_BASE_PA are ignored */
> + to->iopm_base_pa = from->iopm_base_pa & PAGE_MASK;
> + to->msrpm_base_pa = from->msrpm_base_pa & PAGE_MASK;
> +
> to->tsc_offset = from->tsc_offset;
> - to->tlb_ctl = from->tlb_ctl;
> to->int_ctl = from->int_ctl;
> to->int_vector = from->int_vector;
> to->int_state = from->int_state;
> @@ -492,19 +493,21 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> to->exit_info_2 = from->exit_info_2;
> to->exit_int_info = from->exit_int_info;
> to->exit_int_info_err = from->exit_int_info_err;
> - to->misc_ctl = from->misc_ctl;
> + to->misc_ctl = from->misc_ctl;
> to->event_inj = from->event_inj;
> to->event_inj_err = from->event_inj_err;
> to->next_rip = from->next_rip;
> to->nested_cr3 = from->nested_cr3;
> - to->misc_ctl2 = from->misc_ctl2;
> + to->misc_ctl2 = from->misc_ctl2;
> to->pause_filter_count = from->pause_filter_count;
> to->pause_filter_thresh = from->pause_filter_thresh;
>
> - /* Copy asid here because nested_vmcb_check_controls will check it. */
> + /*
> + * Copy asid here because nested_vmcb_check_controls() will check it.
> + * The ASID could be invalid, or conflict with another VM's ASID , so it
> + * should never be used directly to run L2.
> + */
> to->asid = from->asid;
> - to->msrpm_base_pa &= ~0x0fffULL;
> - to->iopm_base_pa &= ~0x0fffULL;
>
> #ifdef CONFIG_KVM_HYPERV
> /* Hyper-V extensions (Enlightened VMCB) */
> @@ -890,9 +893,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> (svm->nested.ctl.int_ctl & int_ctl_vmcb12_bits) |
> (vmcb01->control.int_ctl & int_ctl_vmcb01_bits);
>
> - vmcb02->control.int_vector = svm->nested.ctl.int_vector;
> - vmcb02->control.int_state = svm->nested.ctl.int_state;
> - vmcb02->control.event_inj = svm->nested.ctl.event_inj;
> + vmcb02->control.int_vector = svm->nested.ctl.int_vector & SVM_INT_VECTOR_MASK;
> + vmcb02->control.int_state = svm->nested.ctl.int_state & SVM_INTERRUPT_SHADOW_MASK;
> + vmcb02->control.event_inj = svm->nested.ctl.event_inj & ~SVM_EVTINJ_RESERVED_BITS;
> vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err;
>
> /*
> @@ -1774,7 +1777,6 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
> dst->msrpm_base_pa = from->msrpm_base_pa;
> dst->tsc_offset = from->tsc_offset;
> dst->asid = from->asid;
> - dst->tlb_ctl = from->tlb_ctl;
> dst->int_ctl = from->int_ctl;
> dst->int_vector = from->int_vector;
> dst->int_state = from->int_state;
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index ef6bdce630dc0..c8d43793aa9d6 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -178,7 +178,6 @@ struct vmcb_ctrl_area_cached {
> u64 msrpm_base_pa;
> u64 tsc_offset;
> u32 asid;
> - u8 tlb_ctl;
> u32 int_ctl;
> u32 int_vector;
> u32 int_state;
> --
> 2.51.2.1041.gc1ab5b90ca-goog
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 13/13] KVM: nSVM: Only copy NP_ENABLE from VMCB01's misc_ctl
2025-11-10 22:29 ` [PATCH v2 13/13] KVM: nSVM: Only copy NP_ENABLE from VMCB01's misc_ctl Yosry Ahmed
@ 2025-12-09 16:23 ` Sean Christopherson
2025-12-09 18:38 ` Yosry Ahmed
0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2025-12-09 16:23 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> The 'misc_ctl' field in VMCB02 is taken as-is from VMCB01. However, the
> only bit that needs to copied is NP_ENABLE.
Nit, explicitly state that all other existing bits are for SEV right away, e.g.
However, the only bit that needs to copied is NP_ENABLE, as all other known
bits in misc_ctl are related to SEV guests, and KVM doesn't support nested
virtualization for SEV guests.
> This is a nop now because other bits are for SEV guests, which do not support
> nested. Nonetheless, this hardens against future bugs if/when other bits are
> set for L1 but should not be set for L2.
>
> Opportunistically add a comment explaining why NP_ENABLE is taken from
> VMCB01 and not VMCB02.
>
> Suggested-by: Jim Mattson <jmattson@google.com>
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
> arch/x86/kvm/svm/nested.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 503cb7f5a4c5f..4e278c1f9e6b3 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -837,8 +837,16 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> V_NMI_BLOCKING_MASK);
> }
>
> - /* Copied from vmcb01. msrpm_base can be overwritten later. */
> - vmcb02->control.misc_ctl = vmcb01->control.misc_ctl;
> + /*
> + * Copied from vmcb01. msrpm_base can be overwritten later.
> + *
> + * NP_ENABLE in vmcb12 is only used for consistency checks. If L1
> + * enables NPTs, KVM shadows L1's NPTs and uses those to run L2. If L1
> + * disables NPT, KVM runs L2 with the same NPTs used to run L1. For the
> + * latter, L1 runs L2 with shadow page tables that translate L2 GVAs to
> + * L1 GPAs, so the same NPTs can be used for L1 and L2.
> + */
> + vmcb02->control.misc_ctl = vmcb01->control.misc_ctl & SVM_MISC_CTL_NP_ENABLE;
> vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa;
> vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
>
> --
> 2.51.2.1041.gc1ab5b90ca-goog
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE
2025-11-10 22:29 ` [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE Yosry Ahmed
@ 2025-12-09 16:27 ` Sean Christopherson
2025-12-09 18:07 ` Yosry Ahmed
0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2025-12-09 16:27 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, stable
On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> KVM currenty fails a nested VMRUN and injects VMEXIT_INVALID (aka
> SVM_EXIT_ERR) if L1 sets NP_ENABLE and the host does not support NPTs.
> On first glance, it seems like the check should actually be for
> guest_cpu_cap_has(X86_FEATURE_NPT) instead, as it is possible for the
> host to support NPTs but the guest CPUID to not advertise it.
>
> However, the consistency check is not architectural to begin with. The
> APM does not mention VMEXIT_INVALID if NP_ENABLE is set on a processor
> that does not have X86_FEATURE_NPT. Hence, NP_ENABLE should be ignored
> if X86_FEATURE_NPT is not available for L1. Apart from the consistency
> check, this is currently the case because NP_ENABLE is actually copied
> from VMCB01 to VMCB02, not from VMCB12.
>
> On the other hand, the APM does mention two other consistency checks for
> NP_ENABLE, both of which are missing (paraphrased):
>
> In Volume #2, 15.25.3 (24593—Rev. 3.42—March 2024):
>
> If VMRUN is executed with hCR0.PG cleared to zero and NP_ENABLE set to
> 1, VMRUN terminates with #VMEXIT(VMEXIT_INVALID)
>
> In Volume #2, 15.25.4 (24593—Rev. 3.42—March 2024):
>
> When VMRUN is executed with nested paging enabled (NP_ENABLE = 1), the
> following conditions are considered illegal state combinations, in
> addition to those mentioned in “Canonicalization and Consistency
> Checks”:
> • Any MBZ bit of nCR3 is set.
> • Any G_PAT.PA field has an unsupported type encoding or any
> reserved field in G_PAT has a nonzero value.
This should be three patches, one each for the new consistency checks, and one
to the made-up check. Shortlogs like "Fix all the bugs" are strong hints that
a patch is doing too much.
> Replace the existing consistency check with consistency checks on
> hCR0.PG and nCR3. Only perform the consistency checks if L1 has
> X86_FEATURE_NPT and NP_ENABLE is set in VMCB12. The G_PAT consistency
> check will be addressed separately.
>
> As it is now possible for an L1 to run L2 with NP_ENABLE set but
> ignored, also check that L1 has X86_FEATURE_NPT in nested_npt_enabled().
>
> Pass L1's CR0 to __nested_vmcb_check_controls(). In
> nested_vmcb_check_controls(), L1's CR0 is available through
> kvm_read_cr0(), as vcpu->arch.cr0 is not updated to L2's CR0 until later
> through nested_vmcb02_prepare_save() -> svm_set_cr0().
>
> In svm_set_nested_state(), L1's CR0 is available in the captured save
> area, as svm_get_nested_state() captures L1's save area when running L2,
> and L1's CR0 is stashed in VMCB01 on nested VMRUN (in
> nested_svm_vmrun()).
>
> Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN")
> Cc: stable@vger.kernel.org
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
> arch/x86/kvm/svm/nested.c | 21 ++++++++++++++++-----
> arch/x86/kvm/svm/svm.h | 3 ++-
> 2 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 74211c5c68026..87bcc5eff96e8 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -325,7 +325,8 @@ static bool nested_svm_check_bitmap_pa(struct kvm_vcpu *vcpu, u64 pa, u32 size)
> }
>
> static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> - struct vmcb_ctrl_area_cached *control)
> + struct vmcb_ctrl_area_cached *control,
> + unsigned long l1_cr0)
> {
> if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
> return false;
> @@ -333,8 +334,12 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> if (CC(control->asid == 0))
> return false;
>
> - if (CC((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && !npt_enabled))
> - return false;
> + if (nested_npt_enabled(to_svm(vcpu))) {
> + if (CC(!kvm_vcpu_is_legal_gpa(vcpu, control->nested_cr3)))
> + return false;
> + if (CC(!(l1_cr0 & X86_CR0_PG)))
> + return false;
> + }
>
> if (CC(!nested_svm_check_bitmap_pa(vcpu, control->msrpm_base_pa,
> MSRPM_SIZE)))
> @@ -400,7 +405,12 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
> struct vcpu_svm *svm = to_svm(vcpu);
> struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
>
> - return __nested_vmcb_check_controls(vcpu, ctl);
> + /*
> + * Make sure we did not enter guest mode yet, in which case
No pronouns.
> + * kvm_read_cr0() could return L2's CR0.
> + */
> + WARN_ON_ONCE(is_guest_mode(vcpu));
> + return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> }
>
> static
> @@ -1831,7 +1841,8 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>
> ret = -EINVAL;
> __nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl);
> - if (!__nested_vmcb_check_controls(vcpu, &ctl_cached))
> + /* 'save' contains L1 state saved from before VMRUN */
> + if (!__nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
> goto out_free;
>
> /*
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index f6fb70ddf7272..3e805a43ffcdb 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -552,7 +552,8 @@ static inline bool gif_set(struct vcpu_svm *svm)
>
> static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> {
> - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> + return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_NPT) &&
> + svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
I would rather rely on Kevin's patch to clear unsupported features.
> }
>
> static inline bool nested_vnmi_enabled(struct vcpu_svm *svm)
> --
> 2.51.2.1041.gc1ab5b90ca-goog
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE
2025-12-09 16:27 ` Sean Christopherson
@ 2025-12-09 18:07 ` Yosry Ahmed
2025-12-09 18:26 ` Sean Christopherson
0 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-09 18:07 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, stable
On Tue, Dec 09, 2025 at 08:27:39AM -0800, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > KVM currenty fails a nested VMRUN and injects VMEXIT_INVALID (aka
> > SVM_EXIT_ERR) if L1 sets NP_ENABLE and the host does not support NPTs.
> > On first glance, it seems like the check should actually be for
> > guest_cpu_cap_has(X86_FEATURE_NPT) instead, as it is possible for the
> > host to support NPTs but the guest CPUID to not advertise it.
> >
> > However, the consistency check is not architectural to begin with. The
> > APM does not mention VMEXIT_INVALID if NP_ENABLE is set on a processor
> > that does not have X86_FEATURE_NPT. Hence, NP_ENABLE should be ignored
> > if X86_FEATURE_NPT is not available for L1. Apart from the consistency
> > check, this is currently the case because NP_ENABLE is actually copied
> > from VMCB01 to VMCB02, not from VMCB12.
> >
> > On the other hand, the APM does mention two other consistency checks for
> > NP_ENABLE, both of which are missing (paraphrased):
> >
> > In Volume #2, 15.25.3 (24593—Rev. 3.42—March 2024):
> >
> > If VMRUN is executed with hCR0.PG cleared to zero and NP_ENABLE set to
> > 1, VMRUN terminates with #VMEXIT(VMEXIT_INVALID)
> >
> > In Volume #2, 15.25.4 (24593—Rev. 3.42—March 2024):
> >
> > When VMRUN is executed with nested paging enabled (NP_ENABLE = 1), the
> > following conditions are considered illegal state combinations, in
> > addition to those mentioned in “Canonicalization and Consistency
> > Checks”:
> > • Any MBZ bit of nCR3 is set.
> > • Any G_PAT.PA field has an unsupported type encoding or any
> > reserved field in G_PAT has a nonzero value.
>
> This should be three patches, one each for the new consistency checks, and one
> to the made-up check. Shortlogs like "Fix all the bugs" are strong hints that
> a patch is doing too much.
Ack, will split it.
>
> > Replace the existing consistency check with consistency checks on
> > hCR0.PG and nCR3. Only perform the consistency checks if L1 has
> > X86_FEATURE_NPT and NP_ENABLE is set in VMCB12. The G_PAT consistency
> > check will be addressed separately.
> >
> > As it is now possible for an L1 to run L2 with NP_ENABLE set but
> > ignored, also check that L1 has X86_FEATURE_NPT in nested_npt_enabled().
> >
> > Pass L1's CR0 to __nested_vmcb_check_controls(). In
> > nested_vmcb_check_controls(), L1's CR0 is available through
> > kvm_read_cr0(), as vcpu->arch.cr0 is not updated to L2's CR0 until later
> > through nested_vmcb02_prepare_save() -> svm_set_cr0().
> >
> > In svm_set_nested_state(), L1's CR0 is available in the captured save
> > area, as svm_get_nested_state() captures L1's save area when running L2,
> > and L1's CR0 is stashed in VMCB01 on nested VMRUN (in
> > nested_svm_vmrun()).
> >
> > Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> > arch/x86/kvm/svm/nested.c | 21 ++++++++++++++++-----
> > arch/x86/kvm/svm/svm.h | 3 ++-
> > 2 files changed, 18 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 74211c5c68026..87bcc5eff96e8 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -325,7 +325,8 @@ static bool nested_svm_check_bitmap_pa(struct kvm_vcpu *vcpu, u64 pa, u32 size)
> > }
> >
> > static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> > - struct vmcb_ctrl_area_cached *control)
> > + struct vmcb_ctrl_area_cached *control,
> > + unsigned long l1_cr0)
> > {
> > if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
> > return false;
> > @@ -333,8 +334,12 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> > if (CC(control->asid == 0))
> > return false;
> >
> > - if (CC((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && !npt_enabled))
> > - return false;
> > + if (nested_npt_enabled(to_svm(vcpu))) {
> > + if (CC(!kvm_vcpu_is_legal_gpa(vcpu, control->nested_cr3)))
> > + return false;
> > + if (CC(!(l1_cr0 & X86_CR0_PG)))
> > + return false;
> > + }
> >
> > if (CC(!nested_svm_check_bitmap_pa(vcpu, control->msrpm_base_pa,
> > MSRPM_SIZE)))
> > @@ -400,7 +405,12 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
> > struct vcpu_svm *svm = to_svm(vcpu);
> > struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
> >
> > - return __nested_vmcb_check_controls(vcpu, ctl);
> > + /*
> > + * Make sure we did not enter guest mode yet, in which case
>
> No pronouns.
I thought that rule was for commit logs. There are plenty of 'we's in
the KVM x86 code (and all x86 code for that matter) :P
>
> > + * kvm_read_cr0() could return L2's CR0.
> > + */
> > + WARN_ON_ONCE(is_guest_mode(vcpu));
> > + return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> > }
> >
> > static
> > @@ -1831,7 +1841,8 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> >
> > ret = -EINVAL;
> > __nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl);
> > - if (!__nested_vmcb_check_controls(vcpu, &ctl_cached))
> > + /* 'save' contains L1 state saved from before VMRUN */
> > + if (!__nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
> > goto out_free;
> >
> > /*
> > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > index f6fb70ddf7272..3e805a43ffcdb 100644
> > --- a/arch/x86/kvm/svm/svm.h
> > +++ b/arch/x86/kvm/svm/svm.h
> > @@ -552,7 +552,8 @@ static inline bool gif_set(struct vcpu_svm *svm)
> >
> > static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> > {
> > - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > + return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_NPT) &&
> > + svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
>
> I would rather rely on Kevin's patch to clear unsupported features.
Not sure how Kevin's patch is relevant here, could you please clarify?
This is to account for removing the artifical consistency check. It's
now possible to have SVM_NESTED_CTL_NP_ENABLE set and ignored, so we
need to also check that the guest can actually use NPTs here.
>
> > }
> >
> > static inline bool nested_vnmi_enabled(struct vcpu_svm *svm)
> > --
> > 2.51.2.1041.gc1ab5b90ca-goog
> >
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
2025-12-09 16:03 ` Sean Christopherson
@ 2025-12-09 18:24 ` Yosry Ahmed
2025-12-09 18:49 ` Sean Christopherson
2025-12-10 23:05 ` Yosry Ahmed
1 sibling, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-09 18:24 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Tue, Dec 09, 2025 at 08:03:15AM -0800, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > All accesses to the VMCB12 in the guest memory are limited to
> > nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
> > the function execution. Unmapping right after the consistency checks is
> > possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
> > used after being unmapped.
> >
> > Move all accesses to the VMCB12 into a new helper,
> > nested_svm_vmrun_read_vmcb12(), that maps the VMCB12,
> > caches the needed fields, performs consistency checks, and unmaps it.
> > This limits the scope of the VMCB12 mapping appropriately. It also
> > slightly simplifies the cleanup path of nested_svm_vmrun().
> >
> > nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
> > fail, maintaining the current behavior of skipping the instructions and
> > unmapping the VMCB12 (although in the opposite order).
> >
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> > arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
> > 1 file changed, 34 insertions(+), 25 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index ddcd545ec1c3c..a48668c36a191 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> > return 0;
> > }
> >
> > +static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
>
> "read_vmcb12"() sounds like a generic helper to read a specific field. And if
> the name is more specific, then I think we can drop the "vmrun" scoping. To
> aligh with similar functions in VMX and __nested_copy_vmcb_save_to_cache(), how
> about nested_svm_copy_vmcb12_to_cache()?
nested_svm_copy_vmcb12_to_cache() sounds good.
>
> > +{
> > + struct vcpu_svm *svm = to_svm(vcpu);
> > + struct kvm_host_map map;
> > + struct vmcb *vmcb12;
> > + int ret;
> > +
> > + ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> > + if (ret)
> > + return ret;
> > +
> > + vmcb12 = map.hva;
> > +
> > + nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> > + nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> > +
> > + if (!nested_vmcb_check_save(vcpu) ||
> > + !nested_vmcb_check_controls(vcpu)) {
> > + vmcb12->control.exit_code = SVM_EXIT_ERR;
> > + vmcb12->control.exit_code_hi = 0;
> > + vmcb12->control.exit_info_1 = 0;
> > + vmcb12->control.exit_info_2 = 0;
> > + ret = -1;
>
> I don't love shoving the consistency checks in here. I get why you did it, but
> it's very surprising to see (and/or easy to miss) these consistency checks. The
> caller also ends up quite wonky:
>
> if (ret == -EINVAL) {
> kvm_inject_gp(vcpu, 0);
> return 1;
> } else if (ret) {
> return kvm_skip_emulated_instruction(vcpu);
> }
>
> ret = kvm_skip_emulated_instruction(vcpu);
>
> Ha! And it's buggy. __kvm_vcpu_map() can return -EFAULT if creating a host
> mapping fails. Eww, and blindly using '-1' as the "failed a consistency check"
> is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
> weird way.
I was trying to maintain the pre-existing behavior as much as possible,
and I think the existing code will handle -EFAULT from kvm_vcpu_map() in
the same way (skip the instruction and return).
I guess I shouldn't have assumed maintaining the existing behavior is
the right thing to do.
It's honestly really hard to detangle the return values of different KVM
functions and what they mean. "return 1" here is not very meaningful,
and the return code from kvm_skip_emulated_instruction() is not
documented, so I don't really know what we're supposed to return here in
what cases. The error code are usually not interpreted until a few
layers higher up the callstack.
I agree that returning -1 is not great, but in this case the caller (and
the existing code) only cared about differentiating -EINVAL from others,
and I found other KVM functions returning -1 so I thought I shouldn't
overthink the return value. But yeah, you're right, no more -1's :)
Hence, I preferred to leave things as-is as much as possible.
>
> Ugh, and there's also this nastiness in nested_vmcb_check_controls():
>
> * Make sure we did not enter guest mode yet, in which case
> * kvm_read_cr0() could return L2's CR0.
> */
> WARN_ON_ONCE(is_guest_mode(vcpu));
> return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
>
> nested_vmcb_check_save() and nested_vmcb_check_controls() really shouldn't exist.
> They just make it harder to see what KVM is checking in the "normal" flow.
>
> Aha! And I'm fairly certain there are at least two pre-existing bugs due to KVM
> doing "early" consistency checks in nested_svm_vmrun().
>
> 1. KVM doesn't clear GIF on the early #VMEXIT. In classic APM fashion, nothing
> _requires_ GIF=0 before VMRUN:
>
> It is assumed that VMM software cleared GIF some time before executing
> the VMRUN instruction, to ensure an atomic state switch.
>
> And so VMRUN with GIF=1 that hits an "early" consistency check #VMEXIT would
> incorrectly leave GIF=1.
>
>
> 2. svm_leave_smm() is missing consistency checks on the newly loaded guest state,
> because the checks aren't performed by enter_svm_guest_mode(). I don't see
> anything that would prevent vmcb12 from being modified by the guest bewteen
> SMI and RSM.
>
> Moving the consistency checks into enter_svm_guest_mode() would solve all three
> (four?) problems. And as a bonus, nested_svm_copy_vmcb12_to_cache() can use
> kvm_vcpu_map_readonly().
Anyway, I will move the consistency checks as you suggested, I agree
that is much better. Thanks!
>
> Compile tested only, but I think we can end up with delta like so:
>
> ---
> arch/x86/kvm/svm/nested.c | 67 ++++++++++++---------------------------
> 1 file changed, 20 insertions(+), 47 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 7c86987fdaca..8a0df6c535b5 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -372,9 +372,9 @@ static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj)
> return true;
> }
>
> -static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> - struct vmcb_ctrl_area_cached *control,
> - unsigned long l1_cr0)
> +static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> + struct vmcb_ctrl_area_cached *control,
> + unsigned long l1_cr0)
> {
> if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
> return false;
> @@ -408,8 +408,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> }
>
> /* Common checks that apply to both L1 and L2 state. */
> -static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> - struct vmcb_save_area_cached *save)
> +static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> + struct vmcb_save_area_cached *save)
> {
> if (CC(!(save->efer & EFER_SVME)))
> return false;
> @@ -448,27 +448,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> return true;
> }
>
> -static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu)
> -{
> - struct vcpu_svm *svm = to_svm(vcpu);
> - struct vmcb_save_area_cached *save = &svm->nested.save;
> -
> - return __nested_vmcb_check_save(vcpu, save);
> -}
> -
> -static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
> -{
> - struct vcpu_svm *svm = to_svm(vcpu);
> - struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
> -
> - /*
> - * Make sure we did not enter guest mode yet, in which case
> - * kvm_read_cr0() could return L2's CR0.
> - */
> - WARN_ON_ONCE(is_guest_mode(vcpu));
> - return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> -}
> -
> static
> void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> struct vmcb_ctrl_area_cached *to,
> @@ -1004,6 +983,12 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>
> svm_switch_vmcb(svm, &svm->nested.vmcb02);
> +
> + if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
> + !nested_vmcb_check_controls(vcpu, &svm->nested.ctl,
> + svm->vmcb01.ptr->save.cr0))
> + return -EINVAL;
> +
> nested_vmcb02_prepare_control(svm, save->rip, save->cs.base);
> nested_vmcb02_prepare_save(svm);
>
> @@ -1025,33 +1010,24 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> return 0;
> }
>
> -static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
> +static int nested_svm_copy_vmcb12_to_cache(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> struct kvm_host_map map;
> struct vmcb *vmcb12;
> - int ret;
> + int r;
>
> - ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> - if (ret)
> - return ret;
> + r = kvm_vcpu_map_readonly(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> + if (r)
> + return r;
>
> vmcb12 = map.hva;
>
> nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
>
> - if (!nested_vmcb_check_save(vcpu) ||
> - !nested_vmcb_check_controls(vcpu)) {
> - vmcb12->control.exit_code = SVM_EXIT_ERR;
> - vmcb12->control.exit_code_hi = -1u;
> - vmcb12->control.exit_info_1 = 0;
> - vmcb12->control.exit_info_2 = 0;
> - ret = -1;
> - }
> -
> kvm_vcpu_unmap(vcpu, &map);
> - return ret;
> + return 0;
> }
>
> int nested_svm_vmrun(struct kvm_vcpu *vcpu)
> @@ -1082,12 +1058,9 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
> return -EINVAL;
>
> vmcb12_gpa = svm->vmcb->save.rax;
> - ret = nested_svm_vmrun_read_vmcb12(vcpu, vmcb12_gpa);
> - if (ret == -EINVAL) {
> + if (nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa)) {
> kvm_inject_gp(vcpu, 0);
> return 1;
> - } else if (ret) {
> - return kvm_skip_emulated_instruction(vcpu);
> }
>
> ret = kvm_skip_emulated_instruction(vcpu);
> @@ -1919,7 +1892,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> ret = -EINVAL;
> __nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl);
> /* 'save' contains L1 state saved from before VMRUN */
> - if (!__nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
> + if (!nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
> goto out_free;
>
> /*
> @@ -1938,7 +1911,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> if (!(save->cr0 & X86_CR0_PG) ||
> !(save->cr0 & X86_CR0_PE) ||
> (save->rflags & X86_EFLAGS_VM) ||
> - !__nested_vmcb_check_save(vcpu, &save_cached))
> + !nested_vmcb_check_save(vcpu, &save_cached))
> goto out_free;
>
>
>
> base-commit: 01597a665f5dcf8d7cfbedf36f4e6d46d045eb4f
> --
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE
2025-12-09 18:07 ` Yosry Ahmed
@ 2025-12-09 18:26 ` Sean Christopherson
2025-12-09 18:35 ` Yosry Ahmed
0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2025-12-09 18:26 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, stable
On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> On Tue, Dec 09, 2025 at 08:27:39AM -0800, Sean Christopherson wrote:
> > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > > @@ -400,7 +405,12 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
> > > struct vcpu_svm *svm = to_svm(vcpu);
> > > struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
> > >
> > > - return __nested_vmcb_check_controls(vcpu, ctl);
> > > + /*
> > > + * Make sure we did not enter guest mode yet, in which case
> >
> > No pronouns.
>
> I thought that rule was for commit logs.
In KVM x86, it's a rule everywhere. Pronouns often add ambiguity, and it's much
easier to have a hard "no pronouns" rule than to try and enforce an inherently
subjective "is this ambiguous or not" rule.
> There are plenty of 'we's in the KVM x86 code (and all x86 code for that
> matter) :P
Ya, KVM is an 18+ year old code base. There's also a ton of bare "unsigned" usage,
and other things that are frowned upon and/or flagged by checkpatch. I'm all for
cleaning things up when touching the code, but I'm staunchly against "tree"-wide
cleanups just to make checkpatch happy, and so there's quite a few historical
violations of the current "rules".
> > > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > > index f6fb70ddf7272..3e805a43ffcdb 100644
> > > --- a/arch/x86/kvm/svm/svm.h
> > > +++ b/arch/x86/kvm/svm/svm.h
> > > @@ -552,7 +552,8 @@ static inline bool gif_set(struct vcpu_svm *svm)
> > >
> > > static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> > > {
> > > - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > > + return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_NPT) &&
> > > + svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> >
> > I would rather rely on Kevin's patch to clear unsupported features.
>
> Not sure how Kevin's patch is relevant here, could you please clarify?
Doh, Kevin's patch only touches intercepts. What I was trying to say is that I
would rather sanitize the snapshot (the approach Kevin's patch takes with the
intercepts), as opposed to guarding the accessor. That way we can't have bugs
where KVM checks svm->nested.ctl.nested_ctl directly and bypasses the caps check.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun()
2025-12-09 16:11 ` Sean Christopherson
@ 2025-12-09 18:30 ` Yosry Ahmed
2025-12-09 19:09 ` Sean Christopherson
2025-12-11 19:25 ` Yosry Ahmed
2025-12-15 18:34 ` Yosry Ahmed
2 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-09 18:30 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Tue, Dec 09, 2025 at 08:11:41AM -0800, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > Call nested_svm_merge_msrpm() from enter_svm_guest_mode() if called from
> > the VMRUN path, instead of making the call in nested_svm_vmrun(). This
> > simplifies the flow of nested_svm_vmrun() and removes all jumps to
> > cleanup labels.
> >
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> > arch/x86/kvm/svm/nested.c | 28 +++++++++++++---------------
> > 1 file changed, 13 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index a48668c36a191..89830380cebc5 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1020,6 +1020,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> >
> > nested_svm_hv_update_vm_vp_ids(vcpu);
> >
> > + if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
>
> This is silly, just do:
Ack. Any objections to just dropping from_vmrun and moving
kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES) to svm_leave_smm()? I
like the consistency of completely relying on from_vmrun or not at all
:P
>
> if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) ||
> nested_svm_merge_msrpm(vcpu)) {
> svm->nested.nested_run_pending = 0;
> svm->nmi_l1_to_l2 = false;
> svm->soft_int_injected = false;
>
> svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> svm->vmcb->control.exit_code_hi = -1u;
> svm->vmcb->control.exit_info_1 = 0;
> svm->vmcb->control.exit_info_2 = 0;
>
> nested_svm_vmexit(svm);
> }
>
> > + return -1;
>
> Please stop returning -1, use a proper -errno.
Ack.
>
> > +
> > return 0;
> > }
> >
> > @@ -1105,23 +1108,18 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
> >
> > svm->nested.nested_run_pending = 1;
> >
> > - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true))
> > - goto out_exit_err;
> > -
> > - if (nested_svm_merge_msrpm(vcpu))
> > - return ret;
> > -
> > -out_exit_err:
> > - svm->nested.nested_run_pending = 0;
> > - svm->nmi_l1_to_l2 = false;
> > - svm->soft_int_injected = false;
> > + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) {
> > + svm->nested.nested_run_pending = 0;
> > + svm->nmi_l1_to_l2 = false;
> > + svm->soft_int_injected = false;
> >
> > - svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> > - svm->vmcb->control.exit_code_hi = 0;
> > - svm->vmcb->control.exit_info_1 = 0;
> > - svm->vmcb->control.exit_info_2 = 0;
> > + svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> > + svm->vmcb->control.exit_code_hi = 0;
> > + svm->vmcb->control.exit_info_1 = 0;
> > + svm->vmcb->control.exit_info_2 = 0;
> >
> > - nested_svm_vmexit(svm);
> > + nested_svm_vmexit(svm);
>
> Note, there's a pre-existing bug in nested_svm_vmexit(). Lovely, and it's a
> user-triggerable WARN_ON() (and not even a WARN_ON_ONCE() at that).
>
> If nested_svm_vmexit() fails to map vmcb12, it (unbelievably stupidly) injects a
> #GP and hopes for the best. Oh FFS, it also has the asinine -EINVAL "logic".
> Anyways, it injects #GP (maybe), and bails early, which leaves
> KVM_REQ_GET_NESTED_STATE_PAGES set. KVM will then process that on the next
> vcpu_enter_guest() and trip the WARN_ON() in svm_get_nested_state_pages().
>
> Something like this to clean up the mess:
Will add a patch, thanks for catching this!
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index d4c872843a9d..96f8009a0d45 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1018,9 +1018,6 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
>
> nested_svm_hv_update_vm_vp_ids(vcpu);
>
> - if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
> - return -1;
> -
> return 0;
> }
>
> @@ -1094,7 +1091,8 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
>
> svm->nested.nested_run_pending = 1;
>
> - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) {
> + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) ||
> + nested_svm_merge_msrpm(vcpu)) {
> svm->nested.nested_run_pending = 0;
> svm->nmi_l1_to_l2 = false;
> svm->soft_int_injected = false;
> @@ -1158,24 +1156,16 @@ void svm_copy_vmloadsave_state(struct vmcb *to_vmcb, struct vmcb *from_vmcb)
> int nested_svm_vmexit(struct vcpu_svm *svm)
> {
> struct kvm_vcpu *vcpu = &svm->vcpu;
> + gpa_t vmcb12_gpa = svm->nested.vmcb12_gpa;
> struct vmcb *vmcb01 = svm->vmcb01.ptr;
> struct vmcb *vmcb02 = svm->nested.vmcb02.ptr;
> struct vmcb *vmcb12;
> struct kvm_host_map map;
> - int rc;
> -
> - rc = kvm_vcpu_map(vcpu, gpa_to_gfn(svm->nested.vmcb12_gpa), &map);
> - if (rc) {
> - if (rc == -EINVAL)
> - kvm_inject_gp(vcpu, 0);
> - return 1;
> - }
>
> vmcb12 = map.hva;
>
> /* Exit Guest-Mode */
> leave_guest_mode(vcpu);
> - svm->nested.vmcb12_gpa = 0;
> WARN_ON_ONCE(svm->nested.nested_run_pending);
>
> kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> @@ -1183,6 +1173,13 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
> /* in case we halted in L2 */
> kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE);
>
> + svm->nested.vmcb12_gpa = 0;
> +
> + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) {
> + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> + return 1;
> + }
> +
> /* Give the current vmcb to the guest */
>
> vmcb12->save.es = vmcb02->save.es;
> @@ -1973,7 +1970,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>
> static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
> {
> - if (WARN_ON(!is_guest_mode(vcpu)))
> + if (WARN_ON_ONCE(!is_guest_mode(vcpu)))
> return true;
>
> if (!vcpu->arch.pdptrs_from_userspace &&
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE
2025-12-09 18:26 ` Sean Christopherson
@ 2025-12-09 18:35 ` Yosry Ahmed
2025-12-09 18:42 ` Sean Christopherson
0 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-09 18:35 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, stable
On Tue, Dec 09, 2025 at 10:26:31AM -0800, Sean Christopherson wrote:
> On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > On Tue, Dec 09, 2025 at 08:27:39AM -0800, Sean Christopherson wrote:
> > > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > > > @@ -400,7 +405,12 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
> > > > struct vcpu_svm *svm = to_svm(vcpu);
> > > > struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
> > > >
> > > > - return __nested_vmcb_check_controls(vcpu, ctl);
> > > > + /*
> > > > + * Make sure we did not enter guest mode yet, in which case
> > >
> > > No pronouns.
> >
> > I thought that rule was for commit logs.
>
> In KVM x86, it's a rule everywhere. Pronouns often add ambiguity, and it's much
> easier to have a hard "no pronouns" rule than to try and enforce an inherently
> subjective "is this ambiguous or not" rule.
>
> > There are plenty of 'we's in the KVM x86 code (and all x86 code for that
> > matter) :P
>
> Ya, KVM is an 18+ year old code base. There's also a ton of bare "unsigned" usage,
> and other things that are frowned upon and/or flagged by checkpatch. I'm all for
> cleaning things up when touching the code, but I'm staunchly against "tree"-wide
> cleanups just to make checkpatch happy, and so there's quite a few historical
> violations of the current "rules".
Ack.
>
> > > > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > > > index f6fb70ddf7272..3e805a43ffcdb 100644
> > > > --- a/arch/x86/kvm/svm/svm.h
> > > > +++ b/arch/x86/kvm/svm/svm.h
> > > > @@ -552,7 +552,8 @@ static inline bool gif_set(struct vcpu_svm *svm)
> > > >
> > > > static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> > > > {
> > > > - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > > > + return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_NPT) &&
> > > > + svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > >
> > > I would rather rely on Kevin's patch to clear unsupported features.
> >
> > Not sure how Kevin's patch is relevant here, could you please clarify?
>
> Doh, Kevin's patch only touches intercepts. What I was trying to say is that I
> would rather sanitize the snapshot (the approach Kevin's patch takes with the
> intercepts), as opposed to guarding the accessor. That way we can't have bugs
> where KVM checks svm->nested.ctl.nested_ctl directly and bypasses the caps check.
I see, so clear SVM_NESTED_CTL_NP_ENABLE in
__nested_copy_vmcb_control_to_cache() instead.
If I drop the guest_cpu_cap_has() check here I will want to leave a
comment so that it's obvious to readers that SVM_NESTED_CTL_NP_ENABLE is
sanitized elsewhere if the guest cannot use NPTs. Alternatively, I can
just keep the guest_cpu_cap_has() check as documentation and a second
line of defense.
Any preferences?
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 12/13] KVM: nSVM: Sanitize control fields copied from VMCB12
2025-12-09 16:19 ` Sean Christopherson
@ 2025-12-09 18:37 ` Yosry Ahmed
0 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-09 18:37 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Tue, Dec 09, 2025 at 08:19:02AM -0800, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > Make sure all fields used from VMCB12 in creating the VMCB02 are
> > sanitized, such no unhandled or reserved bits end up in the VMCB02.
> >
> > The following control fields are read from VMCB12 and have bits that are
> > either reserved or not handled/advertised by KVM: tlb_ctl, int_ctl,
> > int_state, int_vector, event_inj, misc_ctl, and misc_ctl2.
> >
> > The following fields do not require any extra sanitizing:
> > - int_ctl: bits from VMCB12 are copied bit-by-bit as needed.
> > - misc_ctl: only used in consistency checks (particularly NP_ENABLE).
> > - misc_ctl2: bits from VMCB12 are copied bit-by-bit as needed.
> >
> > For the remaining fields, make sure only defined bits are copied from
> > VMCB12 by defining appropriate masks where needed. The only exception is
> > tlb_ctl, which is unused, so remove it.
> >
> > Opportunisitcally move some existing definitions in svm.h around such
>
> Opportunistically. But moot point, because please put such cleanups in a separate
> patch. There are so many opportunistic cleanups in this patch that I genuinely
> can't see what's changing, and I don't have the patience right now to stare hard.
Fair enough.
>
> Cleanups will making *related* changes are totally fine, e.g. bundling the use
> of PAGE_MASK in conjuction with changing the code to do "from->iopm_base_pa & ..."
> instead of "to->msrpm_base_pa &= ..." is fine, but those changes have nothing to
> do with the rest of the patch.
>
> > that they are ordered by bit position, and cleanup ignoring the lower
> > bits of {io/msr}pm_base_pa in __nested_copy_vmcb_control_to_cache() by
> > using PAGE_MASK. Also, expand the comment about the ASID being copied
> > only for consistency checks.
> >
> > Suggested-by: Jim Mattson <jmattson@google.com>
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> > arch/x86/include/asm/svm.h | 11 ++++++++---
> > arch/x86/kvm/svm/nested.c | 26 ++++++++++++++------------
> > arch/x86/kvm/svm/svm.h | 1 -
> > 3 files changed, 22 insertions(+), 16 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> > index a842018952d2c..44f2cfcd8d4ff 100644
> > --- a/arch/x86/include/asm/svm.h
> > +++ b/arch/x86/include/asm/svm.h
> > @@ -213,11 +213,13 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
> > #define V_NMI_ENABLE_SHIFT 26
> > #define V_NMI_ENABLE_MASK (1 << V_NMI_ENABLE_SHIFT)
> >
> > +#define X2APIC_MODE_SHIFT 30
> > +#define X2APIC_MODE_MASK (1 << X2APIC_MODE_SHIFT)
> > +
> > #define AVIC_ENABLE_SHIFT 31
> > #define AVIC_ENABLE_MASK (1 << AVIC_ENABLE_SHIFT)
> >
> > -#define X2APIC_MODE_SHIFT 30
> > -#define X2APIC_MODE_MASK (1 << X2APIC_MODE_SHIFT)
> > +#define SVM_INT_VECTOR_MASK (0xff)
> >
> > #define SVM_INTERRUPT_SHADOW_MASK BIT_ULL(0)
> > #define SVM_GUEST_INTERRUPT_MASK BIT_ULL(1)
> > @@ -626,8 +628,11 @@ static inline void __unused_size_checks(void)
> > #define SVM_EVTINJ_TYPE_EXEPT (3 << SVM_EVTINJ_TYPE_SHIFT)
> > #define SVM_EVTINJ_TYPE_SOFT (4 << SVM_EVTINJ_TYPE_SHIFT)
> >
> > -#define SVM_EVTINJ_VALID (1 << 31)
> > #define SVM_EVTINJ_VALID_ERR (1 << 11)
> > +#define SVM_EVTINJ_VALID (1 << 31)
>
> If you want to do cleanup, these should all use BIT()...
I can do that.
>
> > +
> > +#define SVM_EVTINJ_RESERVED_BITS ~(SVM_EVTINJ_VEC_MASK | SVM_EVTINJ_TYPE_MASK | \
> > + SVM_EVTINJ_VALID_ERR | SVM_EVTINJ_VALID)
>
> Because then I don't have to think hard about what exactly this will generate.
Ack.
>
> > #define SVM_EXITINTINFO_VEC_MASK SVM_EVTINJ_VEC_MASK
> > #define SVM_EXITINTINFO_TYPE_MASK SVM_EVTINJ_TYPE_MASK
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 89830380cebc5..503cb7f5a4c5f 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -479,10 +479,11 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> > for (i = 0; i < MAX_INTERCEPT; i++)
> > to->intercepts[i] = from->intercepts[i];
> >
> > - to->iopm_base_pa = from->iopm_base_pa;
> > - to->msrpm_base_pa = from->msrpm_base_pa;
> > + /* Lower bits of IOPM_BASE_PA and MSRPM_BASE_PA are ignored */
> > + to->iopm_base_pa = from->iopm_base_pa & PAGE_MASK;
> > + to->msrpm_base_pa = from->msrpm_base_pa & PAGE_MASK;
> > +
> > to->tsc_offset = from->tsc_offset;
> > - to->tlb_ctl = from->tlb_ctl;
> > to->int_ctl = from->int_ctl;
> > to->int_vector = from->int_vector;
> > to->int_state = from->int_state;
> > @@ -492,19 +493,21 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> > to->exit_info_2 = from->exit_info_2;
> > to->exit_int_info = from->exit_int_info;
> > to->exit_int_info_err = from->exit_int_info_err;
> > - to->misc_ctl = from->misc_ctl;
> > + to->misc_ctl = from->misc_ctl;
> > to->event_inj = from->event_inj;
> > to->event_inj_err = from->event_inj_err;
> > to->next_rip = from->next_rip;
> > to->nested_cr3 = from->nested_cr3;
> > - to->misc_ctl2 = from->misc_ctl2;
> > + to->misc_ctl2 = from->misc_ctl2;
> > to->pause_filter_count = from->pause_filter_count;
> > to->pause_filter_thresh = from->pause_filter_thresh;
> >
> > - /* Copy asid here because nested_vmcb_check_controls will check it. */
> > + /*
> > + * Copy asid here because nested_vmcb_check_controls() will check it.
> > + * The ASID could be invalid, or conflict with another VM's ASID , so it
> > + * should never be used directly to run L2.
> > + */
> > to->asid = from->asid;
> > - to->msrpm_base_pa &= ~0x0fffULL;
> > - to->iopm_base_pa &= ~0x0fffULL;
> >
> > #ifdef CONFIG_KVM_HYPERV
> > /* Hyper-V extensions (Enlightened VMCB) */
> > @@ -890,9 +893,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> > (svm->nested.ctl.int_ctl & int_ctl_vmcb12_bits) |
> > (vmcb01->control.int_ctl & int_ctl_vmcb01_bits);
> >
> > - vmcb02->control.int_vector = svm->nested.ctl.int_vector;
> > - vmcb02->control.int_state = svm->nested.ctl.int_state;
> > - vmcb02->control.event_inj = svm->nested.ctl.event_inj;
> > + vmcb02->control.int_vector = svm->nested.ctl.int_vector & SVM_INT_VECTOR_MASK;
> > + vmcb02->control.int_state = svm->nested.ctl.int_state & SVM_INTERRUPT_SHADOW_MASK;
> > + vmcb02->control.event_inj = svm->nested.ctl.event_inj & ~SVM_EVTINJ_RESERVED_BITS;
> > vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err;
> >
> > /*
> > @@ -1774,7 +1777,6 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
> > dst->msrpm_base_pa = from->msrpm_base_pa;
> > dst->tsc_offset = from->tsc_offset;
> > dst->asid = from->asid;
> > - dst->tlb_ctl = from->tlb_ctl;
> > dst->int_ctl = from->int_ctl;
> > dst->int_vector = from->int_vector;
> > dst->int_state = from->int_state;
> > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > index ef6bdce630dc0..c8d43793aa9d6 100644
> > --- a/arch/x86/kvm/svm/svm.h
> > +++ b/arch/x86/kvm/svm/svm.h
> > @@ -178,7 +178,6 @@ struct vmcb_ctrl_area_cached {
> > u64 msrpm_base_pa;
> > u64 tsc_offset;
> > u32 asid;
> > - u8 tlb_ctl;
> > u32 int_ctl;
> > u32 int_vector;
> > u32 int_state;
> > --
> > 2.51.2.1041.gc1ab5b90ca-goog
> >
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 13/13] KVM: nSVM: Only copy NP_ENABLE from VMCB01's misc_ctl
2025-12-09 16:23 ` Sean Christopherson
@ 2025-12-09 18:38 ` Yosry Ahmed
0 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-09 18:38 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Tue, Dec 09, 2025 at 08:23:28AM -0800, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > The 'misc_ctl' field in VMCB02 is taken as-is from VMCB01. However, the
> > only bit that needs to copied is NP_ENABLE.
>
> Nit, explicitly state that all other existing bits are for SEV right away, e.g.
>
> However, the only bit that needs to copied is NP_ENABLE, as all other known
> bits in misc_ctl are related to SEV guests, and KVM doesn't support nested
> virtualization for SEV guests.
Ack.
>
> > This is a nop now because other bits are for SEV guests, which do not support
> > nested. Nonetheless, this hardens against future bugs if/when other bits are
> > set for L1 but should not be set for L2.
> >
> > Opportunistically add a comment explaining why NP_ENABLE is taken from
> > VMCB01 and not VMCB02.
> >
> > Suggested-by: Jim Mattson <jmattson@google.com>
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> > arch/x86/kvm/svm/nested.c | 12 ++++++++++--
> > 1 file changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 503cb7f5a4c5f..4e278c1f9e6b3 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -837,8 +837,16 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> > V_NMI_BLOCKING_MASK);
> > }
> >
> > - /* Copied from vmcb01. msrpm_base can be overwritten later. */
> > - vmcb02->control.misc_ctl = vmcb01->control.misc_ctl;
> > + /*
> > + * Copied from vmcb01. msrpm_base can be overwritten later.
> > + *
> > + * NP_ENABLE in vmcb12 is only used for consistency checks. If L1
> > + * enables NPTs, KVM shadows L1's NPTs and uses those to run L2. If L1
> > + * disables NPT, KVM runs L2 with the same NPTs used to run L1. For the
> > + * latter, L1 runs L2 with shadow page tables that translate L2 GVAs to
> > + * L1 GPAs, so the same NPTs can be used for L1 and L2.
> > + */
> > + vmcb02->control.misc_ctl = vmcb01->control.misc_ctl & SVM_MISC_CTL_NP_ENABLE;
> > vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa;
> > vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
> >
> > --
> > 2.51.2.1041.gc1ab5b90ca-goog
> >
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE
2025-12-09 18:35 ` Yosry Ahmed
@ 2025-12-09 18:42 ` Sean Christopherson
2025-12-09 20:02 ` Yosry Ahmed
0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2025-12-09 18:42 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, stable
On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> On Tue, Dec 09, 2025 at 10:26:31AM -0800, Sean Christopherson wrote:
> > On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > > > > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > > > > index f6fb70ddf7272..3e805a43ffcdb 100644
> > > > > --- a/arch/x86/kvm/svm/svm.h
> > > > > +++ b/arch/x86/kvm/svm/svm.h
> > > > > @@ -552,7 +552,8 @@ static inline bool gif_set(struct vcpu_svm *svm)
> > > > >
> > > > > static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> > > > > {
> > > > > - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > > > > + return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_NPT) &&
> > > > > + svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > > >
> > > > I would rather rely on Kevin's patch to clear unsupported features.
> > >
> > > Not sure how Kevin's patch is relevant here, could you please clarify?
> >
> > Doh, Kevin's patch only touches intercepts. What I was trying to say is that I
> > would rather sanitize the snapshot (the approach Kevin's patch takes with the
> > intercepts), as opposed to guarding the accessor. That way we can't have bugs
> > where KVM checks svm->nested.ctl.nested_ctl directly and bypasses the caps check.
>
> I see, so clear SVM_NESTED_CTL_NP_ENABLE in
> __nested_copy_vmcb_control_to_cache() instead.
>
> If I drop the guest_cpu_cap_has() check here I will want to leave a
> comment so that it's obvious to readers that SVM_NESTED_CTL_NP_ENABLE is
> sanitized elsewhere if the guest cannot use NPTs. Alternatively, I can
> just keep the guest_cpu_cap_has() check as documentation and a second
> line of defense.
>
> Any preferences?
Honestly, do nothing. I want to solidify sanitizing the cache as standard behavior,
at which point adding a comment implies that nested_npt_enabled() is somehow special,
i.e. that it _doesn't_ follow the standard.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
2025-12-09 18:24 ` Yosry Ahmed
@ 2025-12-09 18:49 ` Sean Christopherson
0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-12-09 18:49 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> On Tue, Dec 09, 2025 at 08:03:15AM -0800, Sean Christopherson wrote:
> > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > > + nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> > > + nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> > > +
> > > + if (!nested_vmcb_check_save(vcpu) ||
> > > + !nested_vmcb_check_controls(vcpu)) {
> > > + vmcb12->control.exit_code = SVM_EXIT_ERR;
> > > + vmcb12->control.exit_code_hi = 0;
> > > + vmcb12->control.exit_info_1 = 0;
> > > + vmcb12->control.exit_info_2 = 0;
> > > + ret = -1;
> >
> > I don't love shoving the consistency checks in here. I get why you did it, but
> > it's very surprising to see (and/or easy to miss) these consistency checks. The
> > caller also ends up quite wonky:
> >
> > if (ret == -EINVAL) {
> > kvm_inject_gp(vcpu, 0);
> > return 1;
> > } else if (ret) {
> > return kvm_skip_emulated_instruction(vcpu);
> > }
> >
> > ret = kvm_skip_emulated_instruction(vcpu);
> >
> > Ha! And it's buggy. __kvm_vcpu_map() can return -EFAULT if creating a host
> > mapping fails. Eww, and blindly using '-1' as the "failed a consistency check"
> > is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
> > weird way.
>
> I was trying to maintain the pre-existing behavior as much as possible,
> and I think the existing code will handle -EFAULT from kvm_vcpu_map() in
> the same way (skip the instruction and return).
>
> I guess I shouldn't have assumed maintaining the existing behavior is
> the right thing to do.
Maintaining existing behavior is absolutely the right thing to do when moving
code around. It's just that sometimes touching code uncovers pre-existing issues,
as is the case here.
> It's honestly really hard to detangle the return values of different KVM
> functions and what they mean. "return 1" here is not very meaningful,
> and the return code from kvm_skip_emulated_instruction() is not
> documented, so I don't really know what we're supposed to return here in
> what cases. The error code are usually not interpreted until a few
> layers higher up the callstack.
LOL, welcome to KVM x86. This has been a complaint since before I started working
on KVM. We're finally getting traction on that mess, but it's a _huge_ mess to
sort out.
https://lore.kernel.org/all/20251205074537.17072-1-jgross@suse.com
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun()
2025-12-09 18:30 ` Yosry Ahmed
@ 2025-12-09 19:09 ` Sean Christopherson
2025-12-10 16:16 ` Yosry Ahmed
0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2025-12-09 19:09 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> On Tue, Dec 09, 2025 at 08:11:41AM -0800, Sean Christopherson wrote:
> > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > > Call nested_svm_merge_msrpm() from enter_svm_guest_mode() if called from
> > > the VMRUN path, instead of making the call in nested_svm_vmrun(). This
> > > simplifies the flow of nested_svm_vmrun() and removes all jumps to
> > > cleanup labels.
> > >
> > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > ---
> > > arch/x86/kvm/svm/nested.c | 28 +++++++++++++---------------
> > > 1 file changed, 13 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index a48668c36a191..89830380cebc5 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -1020,6 +1020,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> > >
> > > nested_svm_hv_update_vm_vp_ids(vcpu);
> > >
> > > + if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
> >
> > This is silly, just do:
>
> Ack. Any objections to just dropping from_vmrun and moving
> kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES) to svm_leave_smm()? I
> like the consistency of completely relying on from_vmrun or not at all
Zero objections. When I was initially going through this, I actually thought you
were _adding_ the flag and was going to yell at you :-)
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE
2025-12-09 18:42 ` Sean Christopherson
@ 2025-12-09 20:02 ` Yosry Ahmed
2025-12-12 18:32 ` Sean Christopherson
0 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-09 20:02 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, stable
On Tue, Dec 09, 2025 at 10:42:21AM -0800, Sean Christopherson wrote:
> On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > On Tue, Dec 09, 2025 at 10:26:31AM -0800, Sean Christopherson wrote:
> > > On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > > > > > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > > > > > index f6fb70ddf7272..3e805a43ffcdb 100644
> > > > > > --- a/arch/x86/kvm/svm/svm.h
> > > > > > +++ b/arch/x86/kvm/svm/svm.h
> > > > > > @@ -552,7 +552,8 @@ static inline bool gif_set(struct vcpu_svm *svm)
> > > > > >
> > > > > > static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> > > > > > {
> > > > > > - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > > > > > + return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_NPT) &&
> > > > > > + svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > > > >
> > > > > I would rather rely on Kevin's patch to clear unsupported features.
> > > >
> > > > Not sure how Kevin's patch is relevant here, could you please clarify?
> > >
> > > Doh, Kevin's patch only touches intercepts. What I was trying to say is that I
> > > would rather sanitize the snapshot (the approach Kevin's patch takes with the
> > > intercepts), as opposed to guarding the accessor. That way we can't have bugs
> > > where KVM checks svm->nested.ctl.nested_ctl directly and bypasses the caps check.
> >
> > I see, so clear SVM_NESTED_CTL_NP_ENABLE in
> > __nested_copy_vmcb_control_to_cache() instead.
> >
> > If I drop the guest_cpu_cap_has() check here I will want to leave a
> > comment so that it's obvious to readers that SVM_NESTED_CTL_NP_ENABLE is
> > sanitized elsewhere if the guest cannot use NPTs. Alternatively, I can
> > just keep the guest_cpu_cap_has() check as documentation and a second
> > line of defense.
> >
> > Any preferences?
>
> Honestly, do nothing. I want to solidify sanitizing the cache as standard behavior,
> at which point adding a comment implies that nested_npt_enabled() is somehow special,
> i.e. that it _doesn't_ follow the standard.
Does this apply to patch 12 as well? In that patch I int_vector,
int_state, and event_inj when copying them to VMCB02 in
nested_vmcb02_prepare_control(). Mainly because
nested_vmcb02_prepare_control() already kinda filters what to copy from
VMCB12 (e.g. int_ctl), so it seemed like a better fit.
Do I keep that as-is, or do you prefer that I also sanitize these fields
when copying to the cache in nested_copy_vmcb_control_to_cache()?
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun()
2025-12-09 19:09 ` Sean Christopherson
@ 2025-12-10 16:16 ` Yosry Ahmed
2025-12-12 23:23 ` Sean Christopherson
0 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-10 16:16 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Tue, Dec 09, 2025 at 11:09:26AM -0800, Sean Christopherson wrote:
> On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > On Tue, Dec 09, 2025 at 08:11:41AM -0800, Sean Christopherson wrote:
> > > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > > > Call nested_svm_merge_msrpm() from enter_svm_guest_mode() if called from
> > > > the VMRUN path, instead of making the call in nested_svm_vmrun(). This
> > > > simplifies the flow of nested_svm_vmrun() and removes all jumps to
> > > > cleanup labels.
> > > >
> > > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > ---
> > > > arch/x86/kvm/svm/nested.c | 28 +++++++++++++---------------
> > > > 1 file changed, 13 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > > index a48668c36a191..89830380cebc5 100644
> > > > --- a/arch/x86/kvm/svm/nested.c
> > > > +++ b/arch/x86/kvm/svm/nested.c
> > > > @@ -1020,6 +1020,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> > > >
> > > > nested_svm_hv_update_vm_vp_ids(vcpu);
> > > >
> > > > + if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
> > >
> > > This is silly, just do:
> >
> > Ack. Any objections to just dropping from_vmrun and moving
> > kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES) to svm_leave_smm()? I
> > like the consistency of completely relying on from_vmrun or not at all
>
> Zero objections. When I was initially going through this, I actually thought you
> were _adding_ the flag and was going to yell at you :-)
Ugh from_vmrun is also plumbed into nested_svm_load_cr3() as
reload_pdptrs. Apparently we shouldn't do that in the call path from
svm_leave_smm()? Anyway, seems like it'll be non-trivial to detangle (at
least for me, I have 0 understanding of SMM), so I will leave it as-is.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
2025-12-09 16:03 ` Sean Christopherson
2025-12-09 18:24 ` Yosry Ahmed
@ 2025-12-10 23:05 ` Yosry Ahmed
2025-12-11 0:55 ` Yosry Ahmed
1 sibling, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-10 23:05 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Tue, Dec 09, 2025 at 08:03:15AM -0800, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > All accesses to the VMCB12 in the guest memory are limited to
> > nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
> > the function execution. Unmapping right after the consistency checks is
> > possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
> > used after being unmapped.
> >
> > Move all accesses to the VMCB12 into a new helper,
> > nested_svm_vmrun_read_vmcb12(), that maps the VMCB12,
> > caches the needed fields, performs consistency checks, and unmaps it.
> > This limits the scope of the VMCB12 mapping appropriately. It also
> > slightly simplifies the cleanup path of nested_svm_vmrun().
> >
> > nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
> > fail, maintaining the current behavior of skipping the instructions and
> > unmapping the VMCB12 (although in the opposite order).
> >
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> > arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
> > 1 file changed, 34 insertions(+), 25 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index ddcd545ec1c3c..a48668c36a191 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> > return 0;
> > }
> >
> > +static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
>
> "read_vmcb12"() sounds like a generic helper to read a specific field. And if
> the name is more specific, then I think we can drop the "vmrun" scoping. To
> aligh with similar functions in VMX and __nested_copy_vmcb_save_to_cache(), how
> about nested_svm_copy_vmcb12_to_cache()?
>
> > +{
> > + struct vcpu_svm *svm = to_svm(vcpu);
> > + struct kvm_host_map map;
> > + struct vmcb *vmcb12;
> > + int ret;
> > +
> > + ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> > + if (ret)
> > + return ret;
> > +
> > + vmcb12 = map.hva;
> > +
> > + nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> > + nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> > +
> > + if (!nested_vmcb_check_save(vcpu) ||
> > + !nested_vmcb_check_controls(vcpu)) {
> > + vmcb12->control.exit_code = SVM_EXIT_ERR;
> > + vmcb12->control.exit_code_hi = 0;
> > + vmcb12->control.exit_info_1 = 0;
> > + vmcb12->control.exit_info_2 = 0;
> > + ret = -1;
>
> I don't love shoving the consistency checks in here. I get why you did it, but
> it's very surprising to see (and/or easy to miss) these consistency checks. The
> caller also ends up quite wonky:
>
> if (ret == -EINVAL) {
> kvm_inject_gp(vcpu, 0);
> return 1;
> } else if (ret) {
> return kvm_skip_emulated_instruction(vcpu);
> }
>
> ret = kvm_skip_emulated_instruction(vcpu);
>
> Ha! And it's buggy. __kvm_vcpu_map() can return -EFAULT if creating a host
> mapping fails. Eww, and blindly using '-1' as the "failed a consistency check"
> is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
> weird way.
>
> Ugh, and there's also this nastiness in nested_vmcb_check_controls():
>
> * Make sure we did not enter guest mode yet, in which case
> * kvm_read_cr0() could return L2's CR0.
> */
> WARN_ON_ONCE(is_guest_mode(vcpu));
> return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
>
> nested_vmcb_check_save() and nested_vmcb_check_controls() really shouldn't exist.
> They just make it harder to see what KVM is checking in the "normal" flow.
>
> Aha! And I'm fairly certain there are at least two pre-existing bugs due to KVM
> doing "early" consistency checks in nested_svm_vmrun().
>
> 1. KVM doesn't clear GIF on the early #VMEXIT. In classic APM fashion, nothing
> _requires_ GIF=0 before VMRUN:
>
> It is assumed that VMM software cleared GIF some time before executing
> the VMRUN instruction, to ensure an atomic state switch.
>
> And so VMRUN with GIF=1 that hits an "early" consistency check #VMEXIT would
> incorrectly leave GIF=1.
>
>
> 2. svm_leave_smm() is missing consistency checks on the newly loaded guest state,
> because the checks aren't performed by enter_svm_guest_mode(). I don't see
> anything that would prevent vmcb12 from being modified by the guest bewteen
> SMI and RSM.
>
> Moving the consistency checks into enter_svm_guest_mode() would solve all three
> (four?) problems. And as a bonus, nested_svm_copy_vmcb12_to_cache() can use
> kvm_vcpu_map_readonly().
>
> Compile tested only, but I think we can end up with delta like so:
>
> ---
> arch/x86/kvm/svm/nested.c | 67 ++++++++++++---------------------------
> 1 file changed, 20 insertions(+), 47 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 7c86987fdaca..8a0df6c535b5 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -372,9 +372,9 @@ static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj)
> return true;
> }
>
> -static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> - struct vmcb_ctrl_area_cached *control,
> - unsigned long l1_cr0)
> +static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> + struct vmcb_ctrl_area_cached *control,
> + unsigned long l1_cr0)
> {
> if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
> return false;
> @@ -408,8 +408,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> }
>
> /* Common checks that apply to both L1 and L2 state. */
> -static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> - struct vmcb_save_area_cached *save)
> +static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> + struct vmcb_save_area_cached *save)
> {
> if (CC(!(save->efer & EFER_SVME)))
> return false;
> @@ -448,27 +448,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> return true;
> }
>
> -static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu)
> -{
> - struct vcpu_svm *svm = to_svm(vcpu);
> - struct vmcb_save_area_cached *save = &svm->nested.save;
> -
> - return __nested_vmcb_check_save(vcpu, save);
> -}
> -
> -static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
> -{
> - struct vcpu_svm *svm = to_svm(vcpu);
> - struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
> -
> - /*
> - * Make sure we did not enter guest mode yet, in which case
> - * kvm_read_cr0() could return L2's CR0.
> - */
> - WARN_ON_ONCE(is_guest_mode(vcpu));
> - return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> -}
> -
> static
> void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> struct vmcb_ctrl_area_cached *to,
> @@ -1004,6 +983,12 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>
> svm_switch_vmcb(svm, &svm->nested.vmcb02);
> +
> + if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
> + !nested_vmcb_check_controls(vcpu, &svm->nested.ctl,
> + svm->vmcb01.ptr->save.cr0))
> + return -EINVAL;
> +
> nested_vmcb02_prepare_control(svm, save->rip, save->cs.base);
> nested_vmcb02_prepare_save(svm);
>
Unfortunately this doesn't work, it breaks the newly introduced
nested_invalid_cr3_test. The problem is that we bail before we fully
initialize VMCB02, then nested_svm_vmrun() calls nested_svm_vmexit(),
which restores state from VMCB02 to VMCB12.
The test first tries to run L2 with a messed up CR3, which fails but
corrupts VMCB12 due to the above, then the second nested entry is
screwed.
There are two fixes, the easy one is just move the consistency checks
after nested_vmcb02_prepare_control() and nested_vmcb02_prepare_save()
(like the existing failure mode of nested_svm_load_cr3()). This works,
but the code doesn't make a lot of sense because we use VMCB12 to create
VMCB02 and THEN check that VMCB12 is valid.
The alternative is unfortunately a lot more involved. We only do a
partial restore or a "fast #VMEXIT" for failed VMRUNs. We'd need to:
1) Move nested_svm_load_cr3() above nested_vmcb02_prepare_control(),
which needs moving nested_svm_init_mmu_context() out of
nested_vmcb02_prepare_control() to remain before
nested_svm_load_cr3().
This makes sure a failed nested VMRUN always needs a "fast #VMEXIT"
2) Figure out which parts of nested_svm_vmexit() are needed in the
failed VMRUN case. We need to at least switch the VMCB, propagate the
error code, and do some cleanups. We can split this out into the
"fast #VMEXIT" path, and use it for failed VMRUNs.
Let me know which way you prefer.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
2025-12-10 23:05 ` Yosry Ahmed
@ 2025-12-11 0:55 ` Yosry Ahmed
2025-12-12 23:30 ` Sean Christopherson
0 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-11 0:55 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Wed, Dec 10, 2025 at 11:05:44PM +0000, Yosry Ahmed wrote:
> On Tue, Dec 09, 2025 at 08:03:15AM -0800, Sean Christopherson wrote:
> > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > > All accesses to the VMCB12 in the guest memory are limited to
> > > nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
> > > the function execution. Unmapping right after the consistency checks is
> > > possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
> > > used after being unmapped.
> > >
> > > Move all accesses to the VMCB12 into a new helper,
> > > nested_svm_vmrun_read_vmcb12(), that maps the VMCB12,
> > > caches the needed fields, performs consistency checks, and unmaps it.
> > > This limits the scope of the VMCB12 mapping appropriately. It also
> > > slightly simplifies the cleanup path of nested_svm_vmrun().
> > >
> > > nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
> > > fail, maintaining the current behavior of skipping the instructions and
> > > unmapping the VMCB12 (although in the opposite order).
> > >
> > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > ---
> > > arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
> > > 1 file changed, 34 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index ddcd545ec1c3c..a48668c36a191 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> > > return 0;
> > > }
> > >
> > > +static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
> >
> > "read_vmcb12"() sounds like a generic helper to read a specific field. And if
> > the name is more specific, then I think we can drop the "vmrun" scoping. To
> > aligh with similar functions in VMX and __nested_copy_vmcb_save_to_cache(), how
> > about nested_svm_copy_vmcb12_to_cache()?
> >
> > > +{
> > > + struct vcpu_svm *svm = to_svm(vcpu);
> > > + struct kvm_host_map map;
> > > + struct vmcb *vmcb12;
> > > + int ret;
> > > +
> > > + ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + vmcb12 = map.hva;
> > > +
> > > + nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> > > + nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> > > +
> > > + if (!nested_vmcb_check_save(vcpu) ||
> > > + !nested_vmcb_check_controls(vcpu)) {
> > > + vmcb12->control.exit_code = SVM_EXIT_ERR;
> > > + vmcb12->control.exit_code_hi = 0;
> > > + vmcb12->control.exit_info_1 = 0;
> > > + vmcb12->control.exit_info_2 = 0;
> > > + ret = -1;
> >
> > I don't love shoving the consistency checks in here. I get why you did it, but
> > it's very surprising to see (and/or easy to miss) these consistency checks. The
> > caller also ends up quite wonky:
> >
> > if (ret == -EINVAL) {
> > kvm_inject_gp(vcpu, 0);
> > return 1;
> > } else if (ret) {
> > return kvm_skip_emulated_instruction(vcpu);
> > }
> >
> > ret = kvm_skip_emulated_instruction(vcpu);
> >
> > Ha! And it's buggy. __kvm_vcpu_map() can return -EFAULT if creating a host
> > mapping fails. Eww, and blindly using '-1' as the "failed a consistency check"
> > is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
> > weird way.
> >
> > Ugh, and there's also this nastiness in nested_vmcb_check_controls():
> >
> > * Make sure we did not enter guest mode yet, in which case
> > * kvm_read_cr0() could return L2's CR0.
> > */
> > WARN_ON_ONCE(is_guest_mode(vcpu));
> > return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> >
> > nested_vmcb_check_save() and nested_vmcb_check_controls() really shouldn't exist.
> > They just make it harder to see what KVM is checking in the "normal" flow.
> >
> > Aha! And I'm fairly certain there are at least two pre-existing bugs due to KVM
> > doing "early" consistency checks in nested_svm_vmrun().
> >
> > 1. KVM doesn't clear GIF on the early #VMEXIT. In classic APM fashion, nothing
> > _requires_ GIF=0 before VMRUN:
> >
> > It is assumed that VMM software cleared GIF some time before executing
> > the VMRUN instruction, to ensure an atomic state switch.
> >
> > And so VMRUN with GIF=1 that hits an "early" consistency check #VMEXIT would
> > incorrectly leave GIF=1.
> >
> >
> > 2. svm_leave_smm() is missing consistency checks on the newly loaded guest state,
> > because the checks aren't performed by enter_svm_guest_mode(). I don't see
> > anything that would prevent vmcb12 from being modified by the guest bewteen
> > SMI and RSM.
> >
> > Moving the consistency checks into enter_svm_guest_mode() would solve all three
> > (four?) problems. And as a bonus, nested_svm_copy_vmcb12_to_cache() can use
> > kvm_vcpu_map_readonly().
> >
> > Compile tested only, but I think we can end up with delta like so:
> >
> > ---
> > arch/x86/kvm/svm/nested.c | 67 ++++++++++++---------------------------
> > 1 file changed, 20 insertions(+), 47 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 7c86987fdaca..8a0df6c535b5 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -372,9 +372,9 @@ static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj)
> > return true;
> > }
> >
> > -static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> > - struct vmcb_ctrl_area_cached *control,
> > - unsigned long l1_cr0)
> > +static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> > + struct vmcb_ctrl_area_cached *control,
> > + unsigned long l1_cr0)
> > {
> > if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
> > return false;
> > @@ -408,8 +408,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> > }
> >
> > /* Common checks that apply to both L1 and L2 state. */
> > -static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> > - struct vmcb_save_area_cached *save)
> > +static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> > + struct vmcb_save_area_cached *save)
> > {
> > if (CC(!(save->efer & EFER_SVME)))
> > return false;
> > @@ -448,27 +448,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> > return true;
> > }
> >
> > -static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu)
> > -{
> > - struct vcpu_svm *svm = to_svm(vcpu);
> > - struct vmcb_save_area_cached *save = &svm->nested.save;
> > -
> > - return __nested_vmcb_check_save(vcpu, save);
> > -}
> > -
> > -static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
> > -{
> > - struct vcpu_svm *svm = to_svm(vcpu);
> > - struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
> > -
> > - /*
> > - * Make sure we did not enter guest mode yet, in which case
> > - * kvm_read_cr0() could return L2's CR0.
> > - */
> > - WARN_ON_ONCE(is_guest_mode(vcpu));
> > - return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> > -}
> > -
> > static
> > void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> > struct vmcb_ctrl_area_cached *to,
> > @@ -1004,6 +983,12 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> > nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
> >
> > svm_switch_vmcb(svm, &svm->nested.vmcb02);
> > +
> > + if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
> > + !nested_vmcb_check_controls(vcpu, &svm->nested.ctl,
> > + svm->vmcb01.ptr->save.cr0))
> > + return -EINVAL;
> > +
> > nested_vmcb02_prepare_control(svm, save->rip, save->cs.base);
> > nested_vmcb02_prepare_save(svm);
> >
>
> Unfortunately this doesn't work, it breaks the newly introduced
> nested_invalid_cr3_test. The problem is that we bail before we fully
> initialize VMCB02, then nested_svm_vmrun() calls nested_svm_vmexit(),
> which restores state from VMCB02 to VMCB12.
>
> The test first tries to run L2 with a messed up CR3, which fails but
> corrupts VMCB12 due to the above, then the second nested entry is
> screwed.
>
> There are two fixes, the easy one is just move the consistency checks
> after nested_vmcb02_prepare_control() and nested_vmcb02_prepare_save()
> (like the existing failure mode of nested_svm_load_cr3()). This works,
> but the code doesn't make a lot of sense because we use VMCB12 to create
> VMCB02 and THEN check that VMCB12 is valid.
>
> The alternative is unfortunately a lot more involved. We only do a
> partial restore or a "fast #VMEXIT" for failed VMRUNs. We'd need to:
>
> 1) Move nested_svm_load_cr3() above nested_vmcb02_prepare_control(),
> which needs moving nested_svm_init_mmu_context() out of
> nested_vmcb02_prepare_control() to remain before
> nested_svm_load_cr3().
>
> This makes sure a failed nested VMRUN always needs a "fast #VMEXIT"
>
> 2) Figure out which parts of nested_svm_vmexit() are needed in the
> failed VMRUN case. We need to at least switch the VMCB, propagate the
> error code, and do some cleanups. We can split this out into the
> "fast #VMEXIT" path, and use it for failed VMRUNs.
>
> Let me know which way you prefer.
I think I prefer (2), the code looks cleaner and I like having a
separate code path for VMRUN failures. Unless there are objections, I
will do that in the next version.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun()
2025-12-09 16:11 ` Sean Christopherson
2025-12-09 18:30 ` Yosry Ahmed
@ 2025-12-11 19:25 ` Yosry Ahmed
2025-12-11 20:13 ` Yosry Ahmed
2025-12-15 18:34 ` Yosry Ahmed
2 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-11 19:25 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Tue, Dec 09, 2025 at 08:11:41AM -0800, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > Call nested_svm_merge_msrpm() from enter_svm_guest_mode() if called from
> > the VMRUN path, instead of making the call in nested_svm_vmrun(). This
> > simplifies the flow of nested_svm_vmrun() and removes all jumps to
> > cleanup labels.
> >
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> > arch/x86/kvm/svm/nested.c | 28 +++++++++++++---------------
> > 1 file changed, 13 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index a48668c36a191..89830380cebc5 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1020,6 +1020,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> >
> > nested_svm_hv_update_vm_vp_ids(vcpu);
> >
> > + if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
>
> This is silly, just do:
>
> if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) ||
> nested_svm_merge_msrpm(vcpu)) {
> svm->nested.nested_run_pending = 0;
> svm->nmi_l1_to_l2 = false;
> svm->soft_int_injected = false;
>
> svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> svm->vmcb->control.exit_code_hi = -1u;
> svm->vmcb->control.exit_info_1 = 0;
> svm->vmcb->control.exit_info_2 = 0;
>
> nested_svm_vmexit(svm);
> }
Actually, if we go with the approach of making all VMRUN failures
happen before preparing the VMCB02 (as discussed in the other thread),
then we will want to call nested_svm_merge_msrpm() from within
enter_svm_guest_mode().
Otherwise, we either have a separate failure path for
nested_svm_merge_msrpm(), or we make all VMRUN failures happen after
preparing the VMCB02 and handled by nested_svm_vmexit().
I like having a separate exit path for VMRUN failures, and it makes more
sense to do the consistency checks on VMCB12 before preparing VMCB02.
But I understand if you prefer to keep things simple and move all
failures after VMCB02.
I already have it implemented with the separate VMRUN failure path, but
I don't wanna spam you with another series if you prefer it the other
way.
>
> > + return -1;
>
> Please stop returning -1, use a proper -errno.
>
> > +
> > return 0;
> > }
> >
> > @@ -1105,23 +1108,18 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
> >
> > svm->nested.nested_run_pending = 1;
> >
> > - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true))
> > - goto out_exit_err;
> > -
> > - if (nested_svm_merge_msrpm(vcpu))
> > - return ret;
> > -
> > -out_exit_err:
> > - svm->nested.nested_run_pending = 0;
> > - svm->nmi_l1_to_l2 = false;
> > - svm->soft_int_injected = false;
> > + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) {
> > + svm->nested.nested_run_pending = 0;
> > + svm->nmi_l1_to_l2 = false;
> > + svm->soft_int_injected = false;
> >
> > - svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> > - svm->vmcb->control.exit_code_hi = 0;
> > - svm->vmcb->control.exit_info_1 = 0;
> > - svm->vmcb->control.exit_info_2 = 0;
> > + svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> > + svm->vmcb->control.exit_code_hi = 0;
> > + svm->vmcb->control.exit_info_1 = 0;
> > + svm->vmcb->control.exit_info_2 = 0;
> >
> > - nested_svm_vmexit(svm);
> > + nested_svm_vmexit(svm);
>
> Note, there's a pre-existing bug in nested_svm_vmexit(). Lovely, and it's a
> user-triggerable WARN_ON() (and not even a WARN_ON_ONCE() at that).
>
> If nested_svm_vmexit() fails to map vmcb12, it (unbelievably stupidly) injects a
> #GP and hopes for the best. Oh FFS, it also has the asinine -EINVAL "logic".
> Anyways, it injects #GP (maybe), and bails early, which leaves
> KVM_REQ_GET_NESTED_STATE_PAGES set. KVM will then process that on the next
> vcpu_enter_guest() and trip the WARN_ON() in svm_get_nested_state_pages().
>
> Something like this to clean up the mess:
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index d4c872843a9d..96f8009a0d45 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1018,9 +1018,6 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
>
> nested_svm_hv_update_vm_vp_ids(vcpu);
>
> - if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
> - return -1;
> -
> return 0;
> }
>
> @@ -1094,7 +1091,8 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
>
> svm->nested.nested_run_pending = 1;
>
> - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) {
> + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) ||
> + nested_svm_merge_msrpm(vcpu)) {
> svm->nested.nested_run_pending = 0;
> svm->nmi_l1_to_l2 = false;
> svm->soft_int_injected = false;
> @@ -1158,24 +1156,16 @@ void svm_copy_vmloadsave_state(struct vmcb *to_vmcb, struct vmcb *from_vmcb)
> int nested_svm_vmexit(struct vcpu_svm *svm)
> {
> struct kvm_vcpu *vcpu = &svm->vcpu;
> + gpa_t vmcb12_gpa = svm->nested.vmcb12_gpa;
> struct vmcb *vmcb01 = svm->vmcb01.ptr;
> struct vmcb *vmcb02 = svm->nested.vmcb02.ptr;
> struct vmcb *vmcb12;
> struct kvm_host_map map;
> - int rc;
> -
> - rc = kvm_vcpu_map(vcpu, gpa_to_gfn(svm->nested.vmcb12_gpa), &map);
> - if (rc) {
> - if (rc == -EINVAL)
> - kvm_inject_gp(vcpu, 0);
> - return 1;
> - }
>
> vmcb12 = map.hva;
>
> /* Exit Guest-Mode */
> leave_guest_mode(vcpu);
> - svm->nested.vmcb12_gpa = 0;
> WARN_ON_ONCE(svm->nested.nested_run_pending);
>
> kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> @@ -1183,6 +1173,13 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
> /* in case we halted in L2 */
> kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE);
>
> + svm->nested.vmcb12_gpa = 0;
> +
> + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) {
> + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> + return 1;
> + }
> +
> /* Give the current vmcb to the guest */
>
> vmcb12->save.es = vmcb02->save.es;
> @@ -1973,7 +1970,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>
> static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
> {
> - if (WARN_ON(!is_guest_mode(vcpu)))
> + if (WARN_ON_ONCE(!is_guest_mode(vcpu)))
> return true;
>
> if (!vcpu->arch.pdptrs_from_userspace &&
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun()
2025-12-11 19:25 ` Yosry Ahmed
@ 2025-12-11 20:13 ` Yosry Ahmed
2025-12-13 0:01 ` Sean Christopherson
0 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-11 20:13 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Thu, Dec 11, 2025 at 07:25:21PM +0000, Yosry Ahmed wrote:
> On Tue, Dec 09, 2025 at 08:11:41AM -0800, Sean Christopherson wrote:
> > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > > Call nested_svm_merge_msrpm() from enter_svm_guest_mode() if called from
> > > the VMRUN path, instead of making the call in nested_svm_vmrun(). This
> > > simplifies the flow of nested_svm_vmrun() and removes all jumps to
> > > cleanup labels.
> > >
> > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > ---
> > > arch/x86/kvm/svm/nested.c | 28 +++++++++++++---------------
> > > 1 file changed, 13 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index a48668c36a191..89830380cebc5 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -1020,6 +1020,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> > >
> > > nested_svm_hv_update_vm_vp_ids(vcpu);
> > >
> > > + if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
> >
> > This is silly, just do:
> >
> > if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) ||
> > nested_svm_merge_msrpm(vcpu)) {
> > svm->nested.nested_run_pending = 0;
> > svm->nmi_l1_to_l2 = false;
> > svm->soft_int_injected = false;
> >
> > svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> > svm->vmcb->control.exit_code_hi = -1u;
> > svm->vmcb->control.exit_info_1 = 0;
> > svm->vmcb->control.exit_info_2 = 0;
> >
> > nested_svm_vmexit(svm);
> > }
>
> Actually, if we go with the approach of making all VMRUN failures
> happen before preparing the VMCB02 (as discussed in the other thread),
> then we will want to call nested_svm_merge_msrpm() from within
> enter_svm_guest_mode().
We can also just call nested_svm_merge_msrpm() before
enter_svm_guest_mode(), which seems to work. Part of me still prefers to
keep all the potential failures bundled together in
enter_svm_guest_mode() though.
>
> Otherwise, we either have a separate failure path for
> nested_svm_merge_msrpm(), or we make all VMRUN failures happen after
> preparing the VMCB02 and handled by nested_svm_vmexit().
>
> I like having a separate exit path for VMRUN failures, and it makes more
> sense to do the consistency checks on VMCB12 before preparing VMCB02.
> But I understand if you prefer to keep things simple and move all
> failures after VMCB02.
>
> I already have it implemented with the separate VMRUN failure path, but
> I don't wanna spam you with another series if you prefer it the other
> way.
>
> >
> > > + return -1;
> >
> > Please stop returning -1, use a proper -errno.
> >
> > > +
> > > return 0;
> > > }
> > >
> > > @@ -1105,23 +1108,18 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
> > >
> > > svm->nested.nested_run_pending = 1;
> > >
> > > - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true))
> > > - goto out_exit_err;
> > > -
> > > - if (nested_svm_merge_msrpm(vcpu))
> > > - return ret;
> > > -
> > > -out_exit_err:
> > > - svm->nested.nested_run_pending = 0;
> > > - svm->nmi_l1_to_l2 = false;
> > > - svm->soft_int_injected = false;
> > > + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) {
> > > + svm->nested.nested_run_pending = 0;
> > > + svm->nmi_l1_to_l2 = false;
> > > + svm->soft_int_injected = false;
> > >
> > > - svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> > > - svm->vmcb->control.exit_code_hi = 0;
> > > - svm->vmcb->control.exit_info_1 = 0;
> > > - svm->vmcb->control.exit_info_2 = 0;
> > > + svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> > > + svm->vmcb->control.exit_code_hi = 0;
> > > + svm->vmcb->control.exit_info_1 = 0;
> > > + svm->vmcb->control.exit_info_2 = 0;
> > >
> > > - nested_svm_vmexit(svm);
> > > + nested_svm_vmexit(svm);
> >
> > Note, there's a pre-existing bug in nested_svm_vmexit(). Lovely, and it's a
> > user-triggerable WARN_ON() (and not even a WARN_ON_ONCE() at that).
> >
> > If nested_svm_vmexit() fails to map vmcb12, it (unbelievably stupidly) injects a
> > #GP and hopes for the best. Oh FFS, it also has the asinine -EINVAL "logic".
> > Anyways, it injects #GP (maybe), and bails early, which leaves
> > KVM_REQ_GET_NESTED_STATE_PAGES set. KVM will then process that on the next
> > vcpu_enter_guest() and trip the WARN_ON() in svm_get_nested_state_pages().
> >
> > Something like this to clean up the mess:
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index d4c872843a9d..96f8009a0d45 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1018,9 +1018,6 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> >
> > nested_svm_hv_update_vm_vp_ids(vcpu);
> >
> > - if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
> > - return -1;
> > -
> > return 0;
> > }
> >
> > @@ -1094,7 +1091,8 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
> >
> > svm->nested.nested_run_pending = 1;
> >
> > - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) {
> > + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) ||
> > + nested_svm_merge_msrpm(vcpu)) {
> > svm->nested.nested_run_pending = 0;
> > svm->nmi_l1_to_l2 = false;
> > svm->soft_int_injected = false;
> > @@ -1158,24 +1156,16 @@ void svm_copy_vmloadsave_state(struct vmcb *to_vmcb, struct vmcb *from_vmcb)
> > int nested_svm_vmexit(struct vcpu_svm *svm)
> > {
> > struct kvm_vcpu *vcpu = &svm->vcpu;
> > + gpa_t vmcb12_gpa = svm->nested.vmcb12_gpa;
> > struct vmcb *vmcb01 = svm->vmcb01.ptr;
> > struct vmcb *vmcb02 = svm->nested.vmcb02.ptr;
> > struct vmcb *vmcb12;
> > struct kvm_host_map map;
> > - int rc;
> > -
> > - rc = kvm_vcpu_map(vcpu, gpa_to_gfn(svm->nested.vmcb12_gpa), &map);
> > - if (rc) {
> > - if (rc == -EINVAL)
> > - kvm_inject_gp(vcpu, 0);
> > - return 1;
> > - }
> >
> > vmcb12 = map.hva;
> >
> > /* Exit Guest-Mode */
> > leave_guest_mode(vcpu);
> > - svm->nested.vmcb12_gpa = 0;
> > WARN_ON_ONCE(svm->nested.nested_run_pending);
> >
> > kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > @@ -1183,6 +1173,13 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
> > /* in case we halted in L2 */
> > kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE);
> >
> > + svm->nested.vmcb12_gpa = 0;
> > +
> > + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) {
> > + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> > + return 1;
> > + }
> > +
> > /* Give the current vmcb to the guest */
> >
> > vmcb12->save.es = vmcb02->save.es;
> > @@ -1973,7 +1970,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> >
> > static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
> > {
> > - if (WARN_ON(!is_guest_mode(vcpu)))
> > + if (WARN_ON_ONCE(!is_guest_mode(vcpu)))
> > return true;
> >
> > if (!vcpu->arch.pdptrs_from_userspace &&
> >
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE
2025-12-09 20:02 ` Yosry Ahmed
@ 2025-12-12 18:32 ` Sean Christopherson
2025-12-12 18:38 ` Yosry Ahmed
0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2025-12-12 18:32 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, stable
On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> On Tue, Dec 09, 2025 at 10:42:21AM -0800, Sean Christopherson wrote:
> > On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > > On Tue, Dec 09, 2025 at 10:26:31AM -0800, Sean Christopherson wrote:
> > > > On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > > > > > > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > > > > > > index f6fb70ddf7272..3e805a43ffcdb 100644
> > > > > > > --- a/arch/x86/kvm/svm/svm.h
> > > > > > > +++ b/arch/x86/kvm/svm/svm.h
> > > > > > > @@ -552,7 +552,8 @@ static inline bool gif_set(struct vcpu_svm *svm)
> > > > > > >
> > > > > > > static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> > > > > > > {
> > > > > > > - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > > > > > > + return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_NPT) &&
> > > > > > > + svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > > > > >
> > > > > > I would rather rely on Kevin's patch to clear unsupported features.
> > > > >
> > > > > Not sure how Kevin's patch is relevant here, could you please clarify?
> > > >
> > > > Doh, Kevin's patch only touches intercepts. What I was trying to say is that I
> > > > would rather sanitize the snapshot (the approach Kevin's patch takes with the
> > > > intercepts), as opposed to guarding the accessor. That way we can't have bugs
> > > > where KVM checks svm->nested.ctl.nested_ctl directly and bypasses the caps check.
> > >
> > > I see, so clear SVM_NESTED_CTL_NP_ENABLE in
> > > __nested_copy_vmcb_control_to_cache() instead.
> > >
> > > If I drop the guest_cpu_cap_has() check here I will want to leave a
> > > comment so that it's obvious to readers that SVM_NESTED_CTL_NP_ENABLE is
> > > sanitized elsewhere if the guest cannot use NPTs. Alternatively, I can
> > > just keep the guest_cpu_cap_has() check as documentation and a second
> > > line of defense.
> > >
> > > Any preferences?
> >
> > Honestly, do nothing. I want to solidify sanitizing the cache as standard behavior,
> > at which point adding a comment implies that nested_npt_enabled() is somehow special,
> > i.e. that it _doesn't_ follow the standard.
>
> Does this apply to patch 12 as well? In that patch I int_vector,
I <something>?
> int_state, and event_inj when copying them to VMCB02 in
> nested_vmcb02_prepare_control(). Mainly because
> nested_vmcb02_prepare_control() already kinda filters what to copy from
> VMCB12 (e.g. int_ctl), so it seemed like a better fit.
>
> Do I keep that as-is, or do you prefer that I also sanitize these fields
> when copying to the cache in nested_copy_vmcb_control_to_cache()?
I don't think I follow. What would the sanitization look like? Note, I don't
think we need to completely sanitize _every_ field. The key fields are ones
where KVM consumes and/or acts on the field.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE
2025-12-12 18:32 ` Sean Christopherson
@ 2025-12-12 18:38 ` Yosry Ahmed
2025-12-13 1:07 ` Sean Christopherson
0 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-12 18:38 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, stable
On Fri, Dec 12, 2025 at 10:32:23AM -0800, Sean Christopherson wrote:
> On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > On Tue, Dec 09, 2025 at 10:42:21AM -0800, Sean Christopherson wrote:
> > > On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > > > On Tue, Dec 09, 2025 at 10:26:31AM -0800, Sean Christopherson wrote:
> > > > > On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > > > > > > > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > > > > > > > index f6fb70ddf7272..3e805a43ffcdb 100644
> > > > > > > > --- a/arch/x86/kvm/svm/svm.h
> > > > > > > > +++ b/arch/x86/kvm/svm/svm.h
> > > > > > > > @@ -552,7 +552,8 @@ static inline bool gif_set(struct vcpu_svm *svm)
> > > > > > > >
> > > > > > > > static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> > > > > > > > {
> > > > > > > > - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > > > > > > > + return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_NPT) &&
> > > > > > > > + svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> > > > > > >
> > > > > > > I would rather rely on Kevin's patch to clear unsupported features.
> > > > > >
> > > > > > Not sure how Kevin's patch is relevant here, could you please clarify?
> > > > >
> > > > > Doh, Kevin's patch only touches intercepts. What I was trying to say is that I
> > > > > would rather sanitize the snapshot (the approach Kevin's patch takes with the
> > > > > intercepts), as opposed to guarding the accessor. That way we can't have bugs
> > > > > where KVM checks svm->nested.ctl.nested_ctl directly and bypasses the caps check.
> > > >
> > > > I see, so clear SVM_NESTED_CTL_NP_ENABLE in
> > > > __nested_copy_vmcb_control_to_cache() instead.
> > > >
> > > > If I drop the guest_cpu_cap_has() check here I will want to leave a
> > > > comment so that it's obvious to readers that SVM_NESTED_CTL_NP_ENABLE is
> > > > sanitized elsewhere if the guest cannot use NPTs. Alternatively, I can
> > > > just keep the guest_cpu_cap_has() check as documentation and a second
> > > > line of defense.
> > > >
> > > > Any preferences?
> > >
> > > Honestly, do nothing. I want to solidify sanitizing the cache as standard behavior,
> > > at which point adding a comment implies that nested_npt_enabled() is somehow special,
> > > i.e. that it _doesn't_ follow the standard.
> >
> > Does this apply to patch 12 as well? In that patch I int_vector,
>
> I <something>?
I "sanitize" int_vector..
Sorry :D
>
> > int_state, and event_inj when copying them to VMCB02 in
> > nested_vmcb02_prepare_control(). Mainly because
> > nested_vmcb02_prepare_control() already kinda filters what to copy from
> > VMCB12 (e.g. int_ctl), so it seemed like a better fit.
> >
> > Do I keep that as-is, or do you prefer that I also sanitize these fields
> > when copying to the cache in nested_copy_vmcb_control_to_cache()?
>
> I don't think I follow. What would the sanitization look like? Note, I don't
> think we need to completely sanitize _every_ field. The key fields are ones
> where KVM consumes and/or acts on the field.
Patch 12 currently sanitizes what is copied from VMCB12 to VMCB02 for
int_vector, int_state, and event_inj in nested_vmcb02_prepare_control():
@@ -890,9 +893,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
(svm->nested.ctl.int_ctl & int_ctl_vmcb12_bits) |
(vmcb01->control.int_ctl & int_ctl_vmcb01_bits);
- vmcb02->control.int_vector = svm->nested.ctl.int_vector;
- vmcb02->control.int_state = svm->nested.ctl.int_state;
- vmcb02->control.event_inj = svm->nested.ctl.event_inj;
+ vmcb02->control.int_vector = svm->nested.ctl.int_vector & SVM_INT_VECTOR_MASK;
+ vmcb02->control.int_state = svm->nested.ctl.int_state & SVM_INTERRUPT_SHADOW_MASK;
+ vmcb02->control.event_inj = svm->nested.ctl.event_inj & ~SVM_EVTINJ_RESERVED_BITS;
vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err;
My question was: given this:
> I want to solidify sanitizing the cache as standard behavior
Do you prefer that I move this sanitization when copying from L1's
VMCB12 to the cached VMCB12 in nested_copy_vmcb_control_to_cache()?
I initially made it part of nested_vmcb02_prepare_control() as it
already filters what to pick from the VMCB12 for some other related
fields like int_ctl based on what features are exposed to the guest.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun()
2025-12-10 16:16 ` Yosry Ahmed
@ 2025-12-12 23:23 ` Sean Christopherson
0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-12-12 23:23 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Wed, Dec 10, 2025, Yosry Ahmed wrote:
> On Tue, Dec 09, 2025 at 11:09:26AM -0800, Sean Christopherson wrote:
> > On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > > On Tue, Dec 09, 2025 at 08:11:41AM -0800, Sean Christopherson wrote:
> > > > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > > > > Call nested_svm_merge_msrpm() from enter_svm_guest_mode() if called from
> > > > > the VMRUN path, instead of making the call in nested_svm_vmrun(). This
> > > > > simplifies the flow of nested_svm_vmrun() and removes all jumps to
> > > > > cleanup labels.
> > > > >
> > > > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > > ---
> > > > > arch/x86/kvm/svm/nested.c | 28 +++++++++++++---------------
> > > > > 1 file changed, 13 insertions(+), 15 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > > > index a48668c36a191..89830380cebc5 100644
> > > > > --- a/arch/x86/kvm/svm/nested.c
> > > > > +++ b/arch/x86/kvm/svm/nested.c
> > > > > @@ -1020,6 +1020,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> > > > >
> > > > > nested_svm_hv_update_vm_vp_ids(vcpu);
> > > > >
> > > > > + if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
> > > >
> > > > This is silly, just do:
> > >
> > > Ack. Any objections to just dropping from_vmrun and moving
> > > kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES) to svm_leave_smm()? I
> > > like the consistency of completely relying on from_vmrun or not at all
> >
> > Zero objections. When I was initially going through this, I actually thought you
> > were _adding_ the flag and was going to yell at you :-)
>
> Ugh from_vmrun is also plumbed into nested_svm_load_cr3() as
> reload_pdptrs. Apparently we shouldn't do that in the call path from
> svm_leave_smm()? Anyway, seems like it'll be non-trivial to detangle (at
> least for me, I have 0 understanding of SMM), so I will leave it as-is.
Agreed, there's enough refactoring going on as it is, no need to turn the
snowball into an avalanche.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
2025-12-11 0:55 ` Yosry Ahmed
@ 2025-12-12 23:30 ` Sean Christopherson
0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-12-12 23:30 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Thu, Dec 11, 2025, Yosry Ahmed wrote:
> On Wed, Dec 10, 2025 at 11:05:44PM +0000, Yosry Ahmed wrote:
> > On Tue, Dec 09, 2025 at 08:03:15AM -0800, Sean Christopherson wrote:
> > > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > Unfortunately this doesn't work, it breaks the newly introduced
> > nested_invalid_cr3_test. The problem is that we bail before we fully
> > initialize VMCB02, then nested_svm_vmrun() calls nested_svm_vmexit(),
> > which restores state from VMCB02 to VMCB12.
> >
> > The test first tries to run L2 with a messed up CR3, which fails but
> > corrupts VMCB12 due to the above, then the second nested entry is
> > screwed.
> >
> > There are two fixes, the easy one is just move the consistency checks
> > after nested_vmcb02_prepare_control() and nested_vmcb02_prepare_save()
> > (like the existing failure mode of nested_svm_load_cr3()). This works,
> > but the code doesn't make a lot of sense because we use VMCB12 to create
> > VMCB02 and THEN check that VMCB12 is valid.
> >
> > The alternative is unfortunately a lot more involved. We only do a
> > partial restore or a "fast #VMEXIT" for failed VMRUNs. We'd need to:
> >
> > 1) Move nested_svm_load_cr3() above nested_vmcb02_prepare_control(),
> > which needs moving nested_svm_init_mmu_context() out of
> > nested_vmcb02_prepare_control() to remain before
> > nested_svm_load_cr3().
> >
> > This makes sure a failed nested VMRUN always needs a "fast #VMEXIT"
> >
> > 2) Figure out which parts of nested_svm_vmexit() are needed in the
> > failed VMRUN case. We need to at least switch the VMCB, propagate the
> > error code, and do some cleanups. We can split this out into the
> > "fast #VMEXIT" path, and use it for failed VMRUNs.
> >
> > Let me know which way you prefer.
>
> I think I prefer (2), the code looks cleaner and I like having a
> separate code path for VMRUN failures. Unless there are objections, I
> will do that in the next version.
With the caveat that I haven't seen the code, that has my vote too. nVMX has a
similar flow, and logically this is equivalent, at least to me. We can probably
even use similar terminology, e.g. vmrun_fail_vmexit instead of vmentry_fail_vmext.
vmentry_fail_vmexit:
vmx_switch_vmcs(vcpu, &vmx->vmcs01);
if (!from_vmentry)
return NVMX_VMENTRY_VMEXIT;
load_vmcs12_host_state(vcpu, vmcs12);
vmcs12->vm_exit_reason = exit_reason.full;
if (enable_shadow_vmcs || nested_vmx_is_evmptr12_valid(vmx))
vmx->nested.need_vmcs12_to_shadow_sync = true;
return NVMX_VMENTRY_VMEXIT;
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun()
2025-12-11 20:13 ` Yosry Ahmed
@ 2025-12-13 0:01 ` Sean Christopherson
0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-12-13 0:01 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Thu, Dec 11, 2025, Yosry Ahmed wrote:
> On Thu, Dec 11, 2025 at 07:25:21PM +0000, Yosry Ahmed wrote:
> > On Tue, Dec 09, 2025 at 08:11:41AM -0800, Sean Christopherson wrote:
> > > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > > > Call nested_svm_merge_msrpm() from enter_svm_guest_mode() if called from
> > > > the VMRUN path, instead of making the call in nested_svm_vmrun(). This
> > > > simplifies the flow of nested_svm_vmrun() and removes all jumps to
> > > > cleanup labels.
> > > >
> > > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > ---
> > > > arch/x86/kvm/svm/nested.c | 28 +++++++++++++---------------
> > > > 1 file changed, 13 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > > index a48668c36a191..89830380cebc5 100644
> > > > --- a/arch/x86/kvm/svm/nested.c
> > > > +++ b/arch/x86/kvm/svm/nested.c
> > > > @@ -1020,6 +1020,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> > > >
> > > > nested_svm_hv_update_vm_vp_ids(vcpu);
> > > >
> > > > + if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
> > >
> > > This is silly, just do:
> > >
> > > if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) ||
> > > nested_svm_merge_msrpm(vcpu)) {
> > > svm->nested.nested_run_pending = 0;
> > > svm->nmi_l1_to_l2 = false;
> > > svm->soft_int_injected = false;
> > >
> > > svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> > > svm->vmcb->control.exit_code_hi = -1u;
> > > svm->vmcb->control.exit_info_1 = 0;
> > > svm->vmcb->control.exit_info_2 = 0;
> > >
> > > nested_svm_vmexit(svm);
> > > }
> >
> > Actually, if we go with the approach of making all VMRUN failures
> > happen before preparing the VMCB02 (as discussed in the other thread),
> > then we will want to call nested_svm_merge_msrpm() from within
> > enter_svm_guest_mode().
>
> We can also just call nested_svm_merge_msrpm() before
> enter_svm_guest_mode(), which seems to work.
That's likely unsafe, nested_vmcb_check_controls() checks fields that are consumed
by nested_svm_merge_msrpm().
> > Otherwise, we either have a separate failure path for
> > nested_svm_merge_msrpm(), or we make all VMRUN failures happen after
> > preparing the VMCB02 and handled by nested_svm_vmexit().
> >
> > I like having a separate exit path for VMRUN failures, and it makes more
> > sense to do the consistency checks on VMCB12 before preparing VMCB02.
> > But I understand if you prefer to keep things simple and move all
> > failures after VMCB02.
> >
> > I already have it implemented with the separate VMRUN failure path, but
> > I don't wanna spam you with another series if you prefer it the other
> > way.
Spam away.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE
2025-12-12 18:38 ` Yosry Ahmed
@ 2025-12-13 1:07 ` Sean Christopherson
0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-12-13 1:07 UTC (permalink / raw)
To: Yosry Ahmed; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel, stable
On Fri, Dec 12, 2025, Yosry Ahmed wrote:
> On Fri, Dec 12, 2025 at 10:32:23AM -0800, Sean Christopherson wrote:
> > On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> > > Do I keep that as-is, or do you prefer that I also sanitize these fields
> > > when copying to the cache in nested_copy_vmcb_control_to_cache()?
> >
> > I don't think I follow. What would the sanitization look like? Note, I don't
> > think we need to completely sanitize _every_ field. The key fields are ones
> > where KVM consumes and/or acts on the field.
>
> Patch 12 currently sanitizes what is copied from VMCB12 to VMCB02 for
> int_vector, int_state, and event_inj in nested_vmcb02_prepare_control():
>
> @@ -890,9 +893,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> (svm->nested.ctl.int_ctl & int_ctl_vmcb12_bits) |
> (vmcb01->control.int_ctl & int_ctl_vmcb01_bits);
>
> - vmcb02->control.int_vector = svm->nested.ctl.int_vector;
> - vmcb02->control.int_state = svm->nested.ctl.int_state;
> - vmcb02->control.event_inj = svm->nested.ctl.event_inj;
> + vmcb02->control.int_vector = svm->nested.ctl.int_vector & SVM_INT_VECTOR_MASK;
> + vmcb02->control.int_state = svm->nested.ctl.int_state & SVM_INTERRUPT_SHADOW_MASK;
> + vmcb02->control.event_inj = svm->nested.ctl.event_inj & ~SVM_EVTINJ_RESERVED_BITS;
> vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err;
>
> My question was: given this:
>
> > I want to solidify sanitizing the cache as standard behavior
>
> Do you prefer that I move this sanitization when copying from L1's
> VMCB12 to the cached VMCB12 in nested_copy_vmcb_control_to_cache()?
Hmm, good question. Probably? If the main motivation for sanitizing is to
guard against effectively exposing new features unintentionally via vmcs12, then
it seems like the safest option is to ensure the "bad" bits are _never_ set in
KVM-controlled state.
> I initially made it part of nested_vmcb02_prepare_control() as it
> already filters what to pick from the VMCB12 for some other related
> fields like int_ctl based on what features are exposed to the guest.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun()
2025-12-09 16:11 ` Sean Christopherson
2025-12-09 18:30 ` Yosry Ahmed
2025-12-11 19:25 ` Yosry Ahmed
@ 2025-12-15 18:34 ` Yosry Ahmed
2 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-12-15 18:34 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, kvm, linux-kernel
On Tue, Dec 09, 2025 at 08:11:41AM -0800, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > Call nested_svm_merge_msrpm() from enter_svm_guest_mode() if called from
> > the VMRUN path, instead of making the call in nested_svm_vmrun(). This
> > simplifies the flow of nested_svm_vmrun() and removes all jumps to
> > cleanup labels.
> >
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> > arch/x86/kvm/svm/nested.c | 28 +++++++++++++---------------
> > 1 file changed, 13 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index a48668c36a191..89830380cebc5 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1020,6 +1020,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> >
> > nested_svm_hv_update_vm_vp_ids(vcpu);
> >
> > + if (from_vmrun && !nested_svm_merge_msrpm(vcpu))
>
> This is silly, just do:
>
> if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) ||
> nested_svm_merge_msrpm(vcpu)) {
> svm->nested.nested_run_pending = 0;
> svm->nmi_l1_to_l2 = false;
> svm->soft_int_injected = false;
>
> svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> svm->vmcb->control.exit_code_hi = -1u;
> svm->vmcb->control.exit_info_1 = 0;
> svm->vmcb->control.exit_info_2 = 0;
>
> nested_svm_vmexit(svm);
> }
>
> > + return -1;
>
> Please stop returning -1, use a proper -errno.
>
> > +
> > return 0;
> > }
> >
> > @@ -1105,23 +1108,18 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
> >
> > svm->nested.nested_run_pending = 1;
> >
> > - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true))
> > - goto out_exit_err;
> > -
> > - if (nested_svm_merge_msrpm(vcpu))
> > - return ret;
> > -
> > -out_exit_err:
> > - svm->nested.nested_run_pending = 0;
> > - svm->nmi_l1_to_l2 = false;
> > - svm->soft_int_injected = false;
> > + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) {
> > + svm->nested.nested_run_pending = 0;
> > + svm->nmi_l1_to_l2 = false;
> > + svm->soft_int_injected = false;
> >
> > - svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> > - svm->vmcb->control.exit_code_hi = 0;
> > - svm->vmcb->control.exit_info_1 = 0;
> > - svm->vmcb->control.exit_info_2 = 0;
> > + svm->vmcb->control.exit_code = SVM_EXIT_ERR;
> > + svm->vmcb->control.exit_code_hi = 0;
> > + svm->vmcb->control.exit_info_1 = 0;
> > + svm->vmcb->control.exit_info_2 = 0;
> >
> > - nested_svm_vmexit(svm);
> > + nested_svm_vmexit(svm);
>
> Note, there's a pre-existing bug in nested_svm_vmexit(). Lovely, and it's a
> user-triggerable WARN_ON() (and not even a WARN_ON_ONCE() at that).
>
> If nested_svm_vmexit() fails to map vmcb12, it (unbelievably stupidly) injects a
> #GP and hopes for the best. Oh FFS, it also has the asinine -EINVAL "logic".
> Anyways, it injects #GP (maybe), and bails early, which leaves
> KVM_REQ_GET_NESTED_STATE_PAGES set. KVM will then process that on the next
> vcpu_enter_guest() and trip the WARN_ON() in svm_get_nested_state_pages().
FWIW, I don't think there will be a warning because when
nested_svm_vmexit() fails to map vmcb12 it also fails to leave guest
mode, so the WARN_ON() should not fire.
I still agree this is a bug and will include a fix/cleanup in the next
version that I will send out shortly.
^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2025-12-15 18:35 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-10 22:29 [PATCH v2 00/13] Nested SVM fixes, cleanups, and hardening Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 01/13] KVM: SVM: Switch svm_copy_lbrs() to a macro Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 02/13] KVM: SVM: Add missing save/restore handling of LBR MSRs Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 03/13] KVM: selftests: Add a test for LBR save/restore (ft. nested) Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 04/13] KVM: nSVM: Fix consistency checks for NP_ENABLE Yosry Ahmed
2025-12-09 16:27 ` Sean Christopherson
2025-12-09 18:07 ` Yosry Ahmed
2025-12-09 18:26 ` Sean Christopherson
2025-12-09 18:35 ` Yosry Ahmed
2025-12-09 18:42 ` Sean Christopherson
2025-12-09 20:02 ` Yosry Ahmed
2025-12-12 18:32 ` Sean Christopherson
2025-12-12 18:38 ` Yosry Ahmed
2025-12-13 1:07 ` Sean Christopherson
2025-11-10 22:29 ` [PATCH v2 05/13] KVM: nSVM: Add missing consistency check for EFER, CR0, CR4, and CS Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 06/13] KVM: nSVM: Add missing consistency check for event_inj Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 07/13] KVM: SVM: Rename vmcb->nested_ctl to vmcb->misc_ctl Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 08/13] KVM: SVM: Rename vmcb->virt_ext to vmcb->misc_ctl2 Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 09/13] KVM: nSVM: Cache all used fields from VMCB12 Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN Yosry Ahmed
2025-12-09 16:03 ` Sean Christopherson
2025-12-09 18:24 ` Yosry Ahmed
2025-12-09 18:49 ` Sean Christopherson
2025-12-10 23:05 ` Yosry Ahmed
2025-12-11 0:55 ` Yosry Ahmed
2025-12-12 23:30 ` Sean Christopherson
2025-11-10 22:29 ` [PATCH v2 11/13] KVM: nSVM: Simplify nested_svm_vmrun() Yosry Ahmed
2025-12-09 16:11 ` Sean Christopherson
2025-12-09 18:30 ` Yosry Ahmed
2025-12-09 19:09 ` Sean Christopherson
2025-12-10 16:16 ` Yosry Ahmed
2025-12-12 23:23 ` Sean Christopherson
2025-12-11 19:25 ` Yosry Ahmed
2025-12-11 20:13 ` Yosry Ahmed
2025-12-13 0:01 ` Sean Christopherson
2025-12-15 18:34 ` Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 12/13] KVM: nSVM: Sanitize control fields copied from VMCB12 Yosry Ahmed
2025-12-09 16:19 ` Sean Christopherson
2025-12-09 18:37 ` Yosry Ahmed
2025-11-10 22:29 ` [PATCH v2 13/13] KVM: nSVM: Only copy NP_ENABLE from VMCB01's misc_ctl Yosry Ahmed
2025-12-09 16:23 ` Sean Christopherson
2025-12-09 18:38 ` Yosry Ahmed
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).