public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] KVM: x86: Fix incorrect handling of triple faults
@ 2026-02-27  1:13 Yosry Ahmed
  2026-02-27  1:13 ` [PATCH 1/3] KVM: x86: Move nested_run_pending to kvm_vcpu_arch Yosry Ahmed
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Yosry Ahmed @ 2026-02-27  1:13 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Yosry Ahmed

Fix a couple of bugs related to handling triple faults, namely KVM
injecting a triple fault into an L2 that hasn't run yet, or KVM
combining #DB/#BP from KVM_SET_GUEST_DEBUG with existing exceptions
causing a triple fault (or #DF).

Either of these bugs can result in a triple fault being injected with
nested_run_pending=1, leading to triggering the warning in
__nested_vmx_vmexit().

The following syzkaller reproducer should trigger it (although it was
manually modified and I cannot test it directly):

r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000180), 0x2, 0x0)
r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0)
r3 = ioctl$KVM_GET_VCPU_MMAP_SIZE(r0, 0xae04)
mmap$KVM_VCPU(&(0x7f0000fe9000/0x3000)=nil, r3, 0x1000003, 0x13, r2, 0x0)
syz_kvm_setup_cpu$x86(r1, r2, &(0x7f0000fe5000/0x18000)=nil, &(0x7f00000000c0)=[@text64={0x40, 0x0}], 0x1, 0x41, 0x0, 0x0)
ioctl$KVM_RUN(r2, 0xae80, 0x0)
ioctl$KVM_SET_VCPU_EVENTS(r2, 0x4040aea0, &(0x7f00000006c0)=@x86={0xf7, 0x8, 0x29, 0x0, 0x5, 0x67, 0x1, 0x9, 0x9, 0xbd, 0x6, 0xff, 0x0, 0x5, 0x4, 0x3, 0x7, 0x7, 0xc, '\x00', 0x7, 0xb})
ioctl$KVM_SET_GUEST_DEBUG_x86(r2, 0x4048ae9b, &(0x7f0000000100)={0x1d0002, 0x0, {[0x8, 0x0, 0x7, 0x2, 0x87c8, 0x5, 0x5, 0x6]}})
ioctl$KVM_RUN(r2, 0xae80, 0x0)

Yosry Ahmed (3):
  KVM: x86: Move nested_run_pending to kvm_vcpu_arch
  KVM: x86: Do not inject triple faults into an L2 with a pending run
  KVM: x86: Check for injected exceptions before queuing a debug
    exception

 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/svm/nested.c       | 14 +++++-----
 arch/x86/kvm/svm/svm.c          | 12 ++++-----
 arch/x86/kvm/svm/svm.h          |  4 ---
 arch/x86/kvm/vmx/nested.c       | 46 ++++++++++++++++-----------------
 arch/x86/kvm/vmx/vmx.c          | 16 ++++++------
 arch/x86/kvm/vmx/vmx.h          |  3 ---
 arch/x86/kvm/x86.c              | 15 ++++++++++-
 8 files changed, 61 insertions(+), 52 deletions(-)


base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] KVM: x86: Move nested_run_pending to kvm_vcpu_arch
  2026-02-27  1:13 [PATCH 0/3] KVM: x86: Fix incorrect handling of triple faults Yosry Ahmed
@ 2026-02-27  1:13 ` Yosry Ahmed
  2026-02-27  1:13 ` [PATCH 2/3] KVM: x86: Do not inject triple faults into an L2 with a pending run Yosry Ahmed
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Yosry Ahmed @ 2026-02-27  1:13 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Yosry Ahmed

Move nested_run_pending field present in both svm_nested_state and
nested_vmx to the common kvm_vcpu_arch. This allows for common code to
use without plumbing it through per-vendor helpers.

nested_run_pending remains zero-initialized, as the entire kvm_vcpu
struct is, and all further accesses are done through vcpu->arch instead
of svm->nested or vmx->nested.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/svm/nested.c       | 14 +++++-----
 arch/x86/kvm/svm/svm.c          | 12 ++++-----
 arch/x86/kvm/svm/svm.h          |  4 ---
 arch/x86/kvm/vmx/nested.c       | 46 ++++++++++++++++-----------------
 arch/x86/kvm/vmx/vmx.c          | 16 ++++++------
 arch/x86/kvm/vmx/vmx.h          |  3 ---
 7 files changed, 47 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ff07c45e3c731..0d989f1b67657 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1098,6 +1098,9 @@ struct kvm_vcpu_arch {
 	 */
 	bool pdptrs_from_userspace;
 
+	/* Pending nested VM-Enter to L2, cannot synthesize a VM-Exit to L1 */
+	bool nested_run_pending;
+
 #if IS_ENABLED(CONFIG_HYPERV)
 	hpa_t hv_root_tdp;
 #endif
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index de90b104a0dd5..c2d4c9c63146e 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1049,7 +1049,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 	if (!npt_enabled)
 		vmcb01->save.cr3 = kvm_read_cr3(vcpu);
 
-	svm->nested.nested_run_pending = 1;
+	vcpu->arch.nested_run_pending = 1;
 
 	if (enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12, true))
 		goto out_exit_err;
@@ -1058,7 +1058,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 		goto out;
 
 out_exit_err:
-	svm->nested.nested_run_pending = 0;
+	vcpu->arch.nested_run_pending = 0;
 	svm->nmi_l1_to_l2 = false;
 	svm->soft_int_injected = false;
 
@@ -1138,7 +1138,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 	/* Exit Guest-Mode */
 	leave_guest_mode(vcpu);
 	svm->nested.vmcb12_gpa = 0;
-	WARN_ON_ONCE(svm->nested.nested_run_pending);
+	WARN_ON_ONCE(vcpu->arch.nested_run_pending);
 
 	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
 
@@ -1399,7 +1399,7 @@ void svm_leave_nested(struct kvm_vcpu *vcpu)
 	struct vcpu_svm *svm = to_svm(vcpu);
 
 	if (is_guest_mode(vcpu)) {
-		svm->nested.nested_run_pending = 0;
+		vcpu->arch.nested_run_pending = 0;
 		svm->nested.vmcb12_gpa = INVALID_GPA;
 
 		leave_guest_mode(vcpu);
@@ -1584,7 +1584,7 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
 	 * previously injected event, the pending exception occurred while said
 	 * event was being delivered and thus needs to be handled.
 	 */
-	bool block_nested_exceptions = svm->nested.nested_run_pending;
+	bool block_nested_exceptions = vcpu->arch.nested_run_pending;
 	/*
 	 * New events (not exceptions) are only recognized at instruction
 	 * boundaries.  If an event needs reinjection, then KVM is handling a
@@ -1761,7 +1761,7 @@ static int svm_get_nested_state(struct kvm_vcpu *vcpu,
 		kvm_state.size += KVM_STATE_NESTED_SVM_VMCB_SIZE;
 		kvm_state.flags |= KVM_STATE_NESTED_GUEST_MODE;
 
-		if (svm->nested.nested_run_pending)
+		if (vcpu->arch.nested_run_pending)
 			kvm_state.flags |= KVM_STATE_NESTED_RUN_PENDING;
 	}
 
@@ -1898,7 +1898,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 
 	svm_set_gif(svm, !!(kvm_state->flags & KVM_STATE_NESTED_GIF_SET));
 
-	svm->nested.nested_run_pending =
+	vcpu->arch.nested_run_pending =
 		!!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING);
 
 	svm->nested.vmcb12_gpa = kvm_state->hdr.svm.vmcb_pa;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 8f8bc863e2143..1d5f119ed2484 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3860,7 +3860,7 @@ bool svm_nmi_blocked(struct kvm_vcpu *vcpu)
 static int svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
-	if (svm->nested.nested_run_pending)
+	if (vcpu->arch.nested_run_pending)
 		return -EBUSY;
 
 	if (svm_nmi_blocked(vcpu))
@@ -3902,7 +3902,7 @@ static int svm_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
-	if (svm->nested.nested_run_pending)
+	if (vcpu->arch.nested_run_pending)
 		return -EBUSY;
 
 	if (svm_interrupt_blocked(vcpu))
@@ -4403,11 +4403,11 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 		nested_sync_control_from_vmcb02(svm);
 
 		/* Track VMRUNs that have made past consistency checking */
-		if (svm->nested.nested_run_pending &&
+		if (vcpu->arch.nested_run_pending &&
 		    !svm_is_vmrun_failure(svm->vmcb->control.exit_code))
                         ++vcpu->stat.nested_run;
 
-		svm->nested.nested_run_pending = 0;
+		vcpu->arch.nested_run_pending = 0;
 	}
 
 	svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
@@ -4766,7 +4766,7 @@ bool svm_smi_blocked(struct kvm_vcpu *vcpu)
 static int svm_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
-	if (svm->nested.nested_run_pending)
+	if (vcpu->arch.nested_run_pending)
 		return -EBUSY;
 
 	if (svm_smi_blocked(vcpu))
@@ -4884,7 +4884,7 @@ static int svm_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
 	if (ret)
 		goto unmap_save;
 
-	svm->nested.nested_run_pending = 1;
+	vcpu->arch.nested_run_pending = 1;
 
 unmap_save:
 	kvm_vcpu_unmap(vcpu, &map_save);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index ebd7b36b1ceb9..e6ca25b1c7807 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -195,10 +195,6 @@ struct svm_nested_state {
 	 */
 	void *msrpm;
 
-	/* A VMRUN has started but has not yet been performed, so
-	 * we cannot inject a nested vmexit yet.  */
-	bool nested_run_pending;
-
 	/* cache for control fields of the guest */
 	struct vmcb_ctrl_area_cached ctl;
 
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 248635da67661..031075467a6dc 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2273,7 +2273,7 @@ static void vmx_start_preemption_timer(struct kvm_vcpu *vcpu,
 
 static u64 nested_vmx_calc_efer(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 {
-	if (vmx->nested.nested_run_pending &&
+	if (vmx->vcpu.arch.nested_run_pending &&
 	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER))
 		return vmcs12->guest_ia32_efer;
 	else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
@@ -2513,7 +2513,7 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
 	/*
 	 * Interrupt/Exception Fields
 	 */
-	if (vmx->nested.nested_run_pending) {
+	if (vmx->vcpu.arch.nested_run_pending) {
 		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
 			     vmcs12->vm_entry_intr_info_field);
 		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
@@ -2621,7 +2621,7 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 			vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
 		}
 
-		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
+		if (kvm_mpx_supported() && vmx->vcpu.arch.nested_run_pending &&
 		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
 			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
 	}
@@ -2718,7 +2718,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 			!(evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1);
 	}
 
-	if (vmx->nested.nested_run_pending &&
+	if (vcpu->arch.nested_run_pending &&
 	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) {
 		kvm_set_dr(vcpu, 7, vmcs12->guest_dr7);
 		vmx_guest_debugctl_write(vcpu, vmcs12->guest_ia32_debugctl &
@@ -2728,13 +2728,13 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 		vmx_guest_debugctl_write(vcpu, vmx->nested.pre_vmenter_debugctl);
 	}
 
-	if (!vmx->nested.nested_run_pending ||
+	if (!vcpu->arch.nested_run_pending ||
 	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE))
 		vmcs_write_cet_state(vcpu, vmx->nested.pre_vmenter_s_cet,
 				     vmx->nested.pre_vmenter_ssp,
 				     vmx->nested.pre_vmenter_ssp_tbl);
 
-	if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
+	if (kvm_mpx_supported() && (!vcpu->arch.nested_run_pending ||
 	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
 		vmcs_write64(GUEST_BNDCFGS, vmx->nested.pre_vmenter_bndcfgs);
 	vmx_set_rflags(vcpu, vmcs12->guest_rflags);
@@ -2747,7 +2747,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 
-	if (vmx->nested.nested_run_pending &&
+	if (vcpu->arch.nested_run_pending &&
 	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)) {
 		vmcs_write64(GUEST_IA32_PAT, vmcs12->guest_ia32_pat);
 		vcpu->arch.pat = vmcs12->guest_ia32_pat;
@@ -3335,7 +3335,7 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
 	 *   to bit 8 (LME) if bit 31 in the CR0 field (corresponding to
 	 *   CR0.PG) is 1.
 	 */
-	if (to_vmx(vcpu)->nested.nested_run_pending &&
+	if (vcpu->arch.nested_run_pending &&
 	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)) {
 		if (CC(!kvm_valid_efer(vcpu, vmcs12->guest_ia32_efer)) ||
 		    CC(ia32e != !!(vmcs12->guest_ia32_efer & EFER_LMA)) ||
@@ -3613,15 +3613,15 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 
 	kvm_service_local_tlb_flush_requests(vcpu);
 
-	if (!vmx->nested.nested_run_pending ||
+	if (!vcpu->arch.nested_run_pending ||
 	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS))
 		vmx->nested.pre_vmenter_debugctl = vmx_guest_debugctl_read();
 	if (kvm_mpx_supported() &&
-	    (!vmx->nested.nested_run_pending ||
+	    (!vcpu->arch.nested_run_pending ||
 	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
 		vmx->nested.pre_vmenter_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
 
-	if (!vmx->nested.nested_run_pending ||
+	if (!vcpu->arch.nested_run_pending ||
 	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE))
 		vmcs_read_cet_state(vcpu, &vmx->nested.pre_vmenter_s_cet,
 				    &vmx->nested.pre_vmenter_ssp,
@@ -3830,7 +3830,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	 * We're finally done with prerequisite checking, and can start with
 	 * the nested entry.
 	 */
-	vmx->nested.nested_run_pending = 1;
+	vcpu->arch.nested_run_pending = 1;
 	vmx->nested.has_preemption_timer_deadline = false;
 	status = nested_vmx_enter_non_root_mode(vcpu, true);
 	if (unlikely(status != NVMX_VMENTRY_SUCCESS))
@@ -3862,12 +3862,12 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 		    !nested_cpu_has(vmcs12, CPU_BASED_NMI_WINDOW_EXITING) &&
 		    !(nested_cpu_has(vmcs12, CPU_BASED_INTR_WINDOW_EXITING) &&
 		      (vmcs12->guest_rflags & X86_EFLAGS_IF))) {
-			vmx->nested.nested_run_pending = 0;
+			vcpu->arch.nested_run_pending = 0;
 			return kvm_emulate_halt_noskip(vcpu);
 		}
 		break;
 	case GUEST_ACTIVITY_WAIT_SIPI:
-		vmx->nested.nested_run_pending = 0;
+		vcpu->arch.nested_run_pending = 0;
 		kvm_set_mp_state(vcpu, KVM_MP_STATE_INIT_RECEIVED);
 		break;
 	default:
@@ -3877,7 +3877,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	return 1;
 
 vmentry_failed:
-	vmx->nested.nested_run_pending = 0;
+	vcpu->arch.nested_run_pending = 0;
 	if (status == NVMX_VMENTRY_KVM_INTERNAL_ERROR)
 		return 0;
 	if (status == NVMX_VMENTRY_VMEXIT)
@@ -4274,7 +4274,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 	 * previously injected event, the pending exception occurred while said
 	 * event was being delivered and thus needs to be handled.
 	 */
-	bool block_nested_exceptions = vmx->nested.nested_run_pending;
+	bool block_nested_exceptions = vcpu->arch.nested_run_pending;
 	/*
 	 * Events that don't require injection, i.e. that are virtualized by
 	 * hardware, aren't blocked by a pending VM-Enter as KVM doesn't need
@@ -4643,7 +4643,7 @@ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 
 	if (nested_cpu_has_preemption_timer(vmcs12) &&
 	    vmcs12->vm_exit_controls & VM_EXIT_SAVE_VMX_PREEMPTION_TIMER &&
-	    !vmx->nested.nested_run_pending)
+	    !vcpu->arch.nested_run_pending)
 		vmcs12->vmx_preemption_timer_value =
 			vmx_get_preemption_timer_value(vcpu);
 
@@ -5042,7 +5042,7 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 	vmx->nested.mtf_pending = false;
 
 	/* trying to cancel vmlaunch/vmresume is a bug */
-	WARN_ON_ONCE(vmx->nested.nested_run_pending);
+	WARN_ON_ONCE(vcpu->arch.nested_run_pending);
 
 #ifdef CONFIG_KVM_HYPERV
 	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
@@ -6665,7 +6665,7 @@ bool nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu)
 	unsigned long exit_qual;
 	u32 exit_intr_info;
 
-	WARN_ON_ONCE(vmx->nested.nested_run_pending);
+	WARN_ON_ONCE(vcpu->arch.nested_run_pending);
 
 	/*
 	 * Late nested VM-Fail shares the same flow as nested VM-Exit since KVM
@@ -6761,7 +6761,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
 		if (is_guest_mode(vcpu)) {
 			kvm_state.flags |= KVM_STATE_NESTED_GUEST_MODE;
 
-			if (vmx->nested.nested_run_pending)
+			if (vcpu->arch.nested_run_pending)
 				kvm_state.flags |= KVM_STATE_NESTED_RUN_PENDING;
 
 			if (vmx->nested.mtf_pending)
@@ -6836,7 +6836,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
 void vmx_leave_nested(struct kvm_vcpu *vcpu)
 {
 	if (is_guest_mode(vcpu)) {
-		to_vmx(vcpu)->nested.nested_run_pending = 0;
+		vcpu->arch.nested_run_pending = 0;
 		nested_vmx_vmexit(vcpu, -1, 0, 0);
 	}
 	free_nested(vcpu);
@@ -6973,7 +6973,7 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
 	if (!(kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE))
 		return 0;
 
-	vmx->nested.nested_run_pending =
+	vcpu->arch.nested_run_pending =
 		!!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING);
 
 	vmx->nested.mtf_pending =
@@ -7025,7 +7025,7 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
 	return 0;
 
 error_guest_mode:
-	vmx->nested.nested_run_pending = 0;
+	vcpu->arch.nested_run_pending = 0;
 	return ret;
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 967b58a8ab9d0..9ef3fb04403d2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5279,7 +5279,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
 
 int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
-	if (to_vmx(vcpu)->nested.nested_run_pending)
+	if (vcpu->arch.nested_run_pending)
 		return -EBUSY;
 
 	/* An NMI must not be injected into L2 if it's supposed to VM-Exit.  */
@@ -5306,7 +5306,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
 
 int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
-	if (to_vmx(vcpu)->nested.nested_run_pending)
+	if (vcpu->arch.nested_run_pending)
 		return -EBUSY;
 
 	/*
@@ -6118,7 +6118,7 @@ static bool vmx_unhandleable_emulation_required(struct kvm_vcpu *vcpu)
 	 * only reachable if userspace modifies L2 guest state after KVM has
 	 * performed the nested VM-Enter consistency checks.
 	 */
-	if (vmx->nested.nested_run_pending)
+	if (vcpu->arch.nested_run_pending)
 		return true;
 
 	/*
@@ -6802,7 +6802,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	 * invalid guest state should never happen as that means KVM knowingly
 	 * allowed a nested VM-Enter with an invalid vmcs12.  More below.
 	 */
-	if (KVM_BUG_ON(vmx->nested.nested_run_pending, vcpu->kvm))
+	if (KVM_BUG_ON(vcpu->arch.nested_run_pending, vcpu->kvm))
 		return -EIO;
 
 	if (is_guest_mode(vcpu)) {
@@ -7730,11 +7730,11 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 		 * Track VMLAUNCH/VMRESUME that have made past guest state
 		 * checking.
 		 */
-		if (vmx->nested.nested_run_pending &&
+		if (vcpu->arch.nested_run_pending &&
 		    !vmx_get_exit_reason(vcpu).failed_vmentry)
 			++vcpu->stat.nested_run;
 
-		vmx->nested.nested_run_pending = 0;
+		vcpu->arch.nested_run_pending = 0;
 	}
 
 	if (unlikely(vmx->fail))
@@ -8491,7 +8491,7 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu)
 int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	/* we need a nested vmexit to enter SMM, postpone if run is pending */
-	if (to_vmx(vcpu)->nested.nested_run_pending)
+	if (vcpu->arch.nested_run_pending)
 		return -EBUSY;
 	return !is_smm(vcpu);
 }
@@ -8532,7 +8532,7 @@ int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
 		if (ret)
 			return ret;
 
-		vmx->nested.nested_run_pending = 1;
+		vcpu->arch.nested_run_pending = 1;
 		vmx->nested.smm.guest_mode = false;
 	}
 	return 0;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 70bfe81dea540..db84e8001da58 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -138,9 +138,6 @@ struct nested_vmx {
 	 */
 	bool enlightened_vmcs_enabled;
 
-	/* L2 must run next, and mustn't decide to exit to L1. */
-	bool nested_run_pending;
-
 	/* Pending MTF VM-exit into L1.  */
 	bool mtf_pending;
 
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] KVM: x86: Do not inject triple faults into an L2 with a pending run
  2026-02-27  1:13 [PATCH 0/3] KVM: x86: Fix incorrect handling of triple faults Yosry Ahmed
  2026-02-27  1:13 ` [PATCH 1/3] KVM: x86: Move nested_run_pending to kvm_vcpu_arch Yosry Ahmed
@ 2026-02-27  1:13 ` Yosry Ahmed
  2026-02-27  1:13 ` [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception Yosry Ahmed
  2026-03-05 17:26 ` [PATCH 0/3] KVM: x86: Fix incorrect handling of triple faults Sean Christopherson
  3 siblings, 0 replies; 13+ messages in thread
From: Yosry Ahmed @ 2026-02-27  1:13 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Yosry Ahmed

If a triple fault is injected while the vCPU is in guest mode, but
before it actually ran, inject it into L1 instead of synthesizing a
SHUTDOWN VM-Exit to L1, as synthesizing a VM-Exit is not allowed before
completing the nested VM-Enter.

This could happen if KVM exits to userspace with nested_run_pending=1,
and userspace injects a triple fault with KVM_SET_VCPU_EVENTS, and
triggers WARN_ON_ONCE(vcpu->arch.nested_run_pending) in
__nested_vmx_vmexit().

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
---
 arch/x86/kvm/x86.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d94..e39c5faf94230 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11914,6 +11914,19 @@ static int kvm_x86_vcpu_pre_run(struct kvm_vcpu *vcpu)
 	    !kvm_apic_init_sipi_allowed(vcpu))
 		return -EINVAL;
 
+	/*
+	 * If a triple fault was injected in guest mode (e.g. through
+	 * KVM_SET_VCPU_EVENTS), but before L2 actually ran, inject it into L1
+	 * instead of synthesizing a SHUTDOWN VM-Exit to L1, as synthesizing a
+	 * VM-Exit is not allowed before completing the nested VM-Enter.
+	 */
+	if (is_guest_mode(vcpu) && vcpu->arch.nested_run_pending &&
+	    kvm_check_request(KVM_REQ_TRIPLE_FAULT, vcpu)) {
+		vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+		vcpu->mmio_needed = 0;
+		return 0;
+	}
+
 	return kvm_x86_call(vcpu_pre_run)(vcpu);
 }
 
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception
  2026-02-27  1:13 [PATCH 0/3] KVM: x86: Fix incorrect handling of triple faults Yosry Ahmed
  2026-02-27  1:13 ` [PATCH 1/3] KVM: x86: Move nested_run_pending to kvm_vcpu_arch Yosry Ahmed
  2026-02-27  1:13 ` [PATCH 2/3] KVM: x86: Do not inject triple faults into an L2 with a pending run Yosry Ahmed
@ 2026-02-27  1:13 ` Yosry Ahmed
  2026-02-27 16:06   ` Sean Christopherson
  2026-03-05 17:26 ` [PATCH 0/3] KVM: x86: Fix incorrect handling of triple faults Sean Christopherson
  3 siblings, 1 reply; 13+ messages in thread
From: Yosry Ahmed @ 2026-02-27  1:13 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Yosry Ahmed

On KVM_SET_GUEST_DEBUG, if a #DB or #BP is injected with
KVM_GUESTDBG_INJECT_DB or KVM_GUESTDBG_INJECT_BP, KVM fails with -EBUSY
if there is an existing pending exception. This was introduced in
commit 4f926bf29186 ("KVM: x86: Polish exception injection via
KVM_SET_GUEST_DEBUG") to avoid a warning in kvm_queue_exception(),
presumably to avoid overriding a pending exception.

This added another (arguably nice) property, if there's a pending
exception, KVM_SET_GUEST_DEBUG cannot cause a #DF or triple fault.
However, if an exception is injected, KVM_SET_GUEST_DEBUG will cause
a #DF or triple fault in the guest, as kvm_multiple_exception() combines
them.

Check for both pending and injected exceptions for
KVM_GUESTDBG_INJECT_DB and KVM_GUESTDBG_INJECT_BP, to avoid accidentally
injecting a #DB or triple fault.

Signed-off-by: Yosry Ahmed <yosry@kernel.org>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e39c5faf94230..0c8aacf1fa67f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12543,7 +12543,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 
 	if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
 		r = -EBUSY;
-		if (kvm_is_exception_pending(vcpu))
+		if (kvm_is_exception_pending(vcpu) || vcpu->arch.exception.injected)
 			goto out;
 		if (dbg->control & KVM_GUESTDBG_INJECT_DB)
 			kvm_queue_exception(vcpu, DB_VECTOR);
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception
  2026-02-27  1:13 ` [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception Yosry Ahmed
@ 2026-02-27 16:06   ` Sean Christopherson
  2026-02-27 16:34     ` Sean Christopherson
  0 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2026-02-27 16:06 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 27, 2026, Yosry Ahmed wrote:
> On KVM_SET_GUEST_DEBUG, if a #DB or #BP is injected with
> KVM_GUESTDBG_INJECT_DB or KVM_GUESTDBG_INJECT_BP, KVM fails with -EBUSY
> if there is an existing pending exception. This was introduced in
> commit 4f926bf29186 ("KVM: x86: Polish exception injection via
> KVM_SET_GUEST_DEBUG") to avoid a warning in kvm_queue_exception(),
> presumably to avoid overriding a pending exception.
> 
> This added another (arguably nice) property, if there's a pending
> exception, KVM_SET_GUEST_DEBUG cannot cause a #DF or triple fault.
> However, if an exception is injected, KVM_SET_GUEST_DEBUG will cause
> a #DF or triple fault in the guest, as kvm_multiple_exception() combines
> them.

First off, this patch looks good irrespective of nested crud.  Disallowing injection
of #DB/#BP while there's already an injected exception aligns with architectural
behavior; KVM needs to finish delivering the exception and thus "complete" the
instruction before queueing a new exception.

As for nested, I _was_ going to say that, assuming the original motivation for
this patch is to avoid triggering a nested VM-Exit while nested_run_pending=1,
this is incomplete.  E.g. if the #BP or #DB itself is being intercepted by L1,
then queueing the exception will incorrectly deliver the nested VM-Exit before
nested VM-Enter completes.  I.e. I _thought_ we would also need:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d9..fdbd272027ed 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12526,6 +12526,9 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
        if (vcpu->arch.guest_state_protected)
                return -EINVAL;
 
+       if (vcpu->arch.nested_run_pending)
+               return -EBUSY;
+
        vcpu_load(vcpu);
 
        if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {

But that isn't actually the case, because {svm,vmx}_check_nested_events() blocks
exceptions and VM-Exits while nested_run_pending=1.  Off-list, I had rejected
modifying nested_vmx_triple_fault() to drop the triple fault like so:

@@ -5191,6 +5191,9 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 
 static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu)
 {
+       if (to_vmx(vcpu)->nested.nested_run_pending)
+               return;
+
        kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
        nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0);
 }

because "That would largely defeat the entire purpose of the WARN. The whole point
is to detect bugs where KVM synthesizes a VM-Exit before completing nested VM-Enter.
It should be impossible for nested_{vmx,svm}_triple_fault() to be called with
nested_run_pending=true."

Given the above, that argument doesn't hold up at first glance, because blocking
the triple fault if nested_run_pending=1 would be consistent with how other "synchronous"
events are handled by nVMX and nSVM.  But after staring at this a bit, unless I'm
forgetting something (entirely possible), I actually think {svm,vmx}_check_nested_events()
are wrong, because I stand behind my quoted statement: it should be impossible for
KVM to synthesize a synchronous event before completing nested VM-Enter.

Blocking pending and injected exceptions was added by bfcf83b1444d ("KVM: nVMX:
Fix trying to cancel vmlauch/vmresume"), and unfortunately neither the changelog
nor Lore[1] provides any details as to what exactly prompted the fix.  I _suspect_
it was either syzkaller induced, or the manifestation of other bugs in KVM's
exception handling.

Finally getting to the point, in theory, I _think_ KVM should actually WARN on
all cases where it temporarily blocks a synchronous event due to a pending nested
VM-Enter (see below diff for the basic gist).  However, that would open the
floodgates to syzkaller, because KVM_SET_VCPU_EVENTS can obviously stuff events,
KVM_X86_SET_MCE can queue a #MC, etc.

I _think_ we might be able to get away with rejecting KVM_SET_VCPU_EVENTS if
nested_run_pending=1, without breaking userspace?  Google's VMM is insane and does
KVM_SET_VCPU_EVENTS before KVM_SET_NESTED_STATE, but in real usage, i.e. outside
of selftests, (I hope) no VMM will restore into a "live" vCPU.

So instead of patch 1, I want to try either (a) blocking KVM_SET_VCPU_EVENTS,
KVM_X86_SET_MCE, and KVM_SET_GUEST_DEBUG if nested_run_pending=1, *and* follow-up
with the below WARN-spree, or (b) add a separate flag, e.g. nested_run_in_progress
or so, that is set with nested_run_pending, but cleared on an exit to userspace,
and then WARN on _that_, i.e. so that we can detect KVM bugs (the whole point of
the WARN) and hopefully stop playing this losing game of whack-a-mole with syzkaller.

I think I'm leaning toward (b)?  Except for KVM_SET_GUEST_DEBUG, where userspace
is trying to interpose on the guest, restricting ioctls doesn't really add any
value in practice.  Yeah, in theory it could _maybe_ prevent userspace from shooting
itself in the foot, but practically speaking, if userspace is restoring state into
a vCPU with nested_run_pending=1, it's either playing on expert mode or is already
completely broken.

My only hesitation with (b) is that KVM wouldn't be entirely consistent, since
vmx_unhandleable_emulation_required() _does_ explicitly reject a "userspace did
something stupid with nested_run_pending=1" case.  So from that perspective, part
of me wants to get greedy and try for (a).

[1] https://lore.kernel.org/all/?q=%22KVM:%20nVMX:%20Fix%20trying%20to%20cancel%20vmlauch%22

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index de90b104a0dd..d624e4db704a 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1605,14 +1605,14 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
        }
 
        if (vcpu->arch.exception_vmexit.pending) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                         return -EBUSY;
                nested_svm_inject_exception_vmexit(vcpu);
                return 0;
        }
 
        if (vcpu->arch.exception.pending) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
                return 0;
        }
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 248635da6766..a223c5e86188 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4336,7 +4336,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
         */
        if (vcpu->arch.exception_vmexit.pending &&
            !vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
 
                nested_vmx_inject_exception_vmexit(vcpu);
@@ -4345,13 +4345,13 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 
        if (vcpu->arch.exception.pending &&
            !vmx_is_low_priority_db_trap(&vcpu->arch.exception)) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
                goto no_vmexit;
        }
 
        if (vmx->nested.mtf_pending) {
-               if (block_nested_events)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
                nested_vmx_update_pending_dbg(vcpu);
                nested_vmx_vmexit(vcpu, EXIT_REASON_MONITOR_TRAP_FLAG, 0, 0);
@@ -4359,7 +4359,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
        }
 
        if (vcpu->arch.exception_vmexit.pending) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
 
                nested_vmx_inject_exception_vmexit(vcpu);
@@ -4367,7 +4367,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
        }
 
        if (vcpu->arch.exception.pending) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
                goto no_vmexit;
        }


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception
  2026-02-27 16:06   ` Sean Christopherson
@ 2026-02-27 16:34     ` Sean Christopherson
  2026-02-27 17:31       ` Yosry Ahmed
  0 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2026-02-27 16:34 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 27, 2026, Sean Christopherson wrote:
> So instead of patch 1, I want to try either (a) blocking KVM_SET_VCPU_EVENTS,
> KVM_X86_SET_MCE, and KVM_SET_GUEST_DEBUG if nested_run_pending=1, *and* follow-up
> with the below WARN-spree, or (b) add a separate flag, e.g. nested_run_in_progress
> or so, that is set with nested_run_pending, but cleared on an exit to userspace,
> and then WARN on _that_, i.e. so that we can detect KVM bugs (the whole point of
> the WARN) and hopefully stop playing this losing game of whack-a-mole with syzkaller.
> 
> I think I'm leaning toward (b)?  Except for KVM_SET_GUEST_DEBUG, where userspace
> is trying to interpose on the guest, restricting ioctls doesn't really add any
> value in practice.  Yeah, in theory it could _maybe_ prevent userspace from shooting
> itself in the foot, but practically speaking, if userspace is restoring state into
> a vCPU with nested_run_pending=1, it's either playing on expert mode or is already
> completely broken.
> 
> My only hesitation with (b) is that KVM wouldn't be entirely consistent, since
> vmx_unhandleable_emulation_required() _does_ explicitly reject a "userspace did
> something stupid with nested_run_pending=1" case.  So from that perspective, part
> of me wants to get greedy and try for (a).

On second (fifth?) thought, I don't think (a) is a good idea.  In addition to
potentially breaking userspace, it also risks preventing genuinely useful sequences.
E.g. even if no VMM does so today, it's entirely plausible that a VMM could want
to asynchronously inject an #MC to mimic a broadcast, and that the injection could
collide with a pending nested VM-Enter.

I'll send a separate (maybe RFC?) series for (b) using patch 1 as a starting point.
I want to fiddle around with some ideas, and it'll be faster to sketch things out
in code versus trying to describe things in text.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception
  2026-02-27 16:34     ` Sean Christopherson
@ 2026-02-27 17:31       ` Yosry Ahmed
  2026-02-27 18:18         ` Sean Christopherson
  0 siblings, 1 reply; 13+ messages in thread
From: Yosry Ahmed @ 2026-02-27 17:31 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 27, 2026 at 8:34 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Feb 27, 2026, Sean Christopherson wrote:
> > So instead of patch 1, I want to try either (a) blocking KVM_SET_VCPU_EVENTS,
> > KVM_X86_SET_MCE, and KVM_SET_GUEST_DEBUG if nested_run_pending=1, *and* follow-up
> > with the below WARN-spree, or (b) add a separate flag, e.g. nested_run_in_progress
> > or so, that is set with nested_run_pending, but cleared on an exit to userspace,
> > and then WARN on _that_, i.e. so that we can detect KVM bugs (the whole point of
> > the WARN) and hopefully stop playing this losing game of whack-a-mole with syzkaller.

I like the idea of the WARN there, although something in the back of
my mind tells me I went through this code before with an exception in
mind that could be injected with nested_run_pending=1, but I can't
remember it. Maybe it was injected by userspace and all is good.

That being said, I hate nested_run_in_progress. It's too close to
nested_run_pending and I am pretty sure they will be mixed up.

exception_from_userspace's name made me think this is something we
could key off to WARN, but it's meant to morph queued exceptions from
userspace into an "exception_vmexit" if needed. The field name is
generic but its functionality isn't, maybe it should have been called
exception_check_vmexit or something. Anyway..

That gave me an idea though, can we add a field to
kvm_queued_exception to identify the origin of the exception
(userspace vs. KVM)? Then we can key the warning off of that.

We can potentially also avoid adding the field and just plumb the
argument through to kvm_multiple_exception(), and WARN there if
nested_run_pending is set and the origin is not userspace?

> >
> > I think I'm leaning toward (b)?  Except for KVM_SET_GUEST_DEBUG, where userspace
> > is trying to interpose on the guest, restricting ioctls doesn't really add any
> > value in practice.  Yeah, in theory it could _maybe_ prevent userspace from shooting
> > itself in the foot, but practically speaking, if userspace is restoring state into
> > a vCPU with nested_run_pending=1, it's either playing on expert mode or is already
> > completely broken.
> >
> > My only hesitation with (b) is that KVM wouldn't be entirely consistent, since
> > vmx_unhandleable_emulation_required() _does_ explicitly reject a "userspace did
> > something stupid with nested_run_pending=1" case.  So from that perspective, part
> > of me wants to get greedy and try for (a).
>
> On second (fifth?) thought, I don't think (a) is a good idea.  In addition to
> potentially breaking userspace, it also risks preventing genuinely useful sequences.
> E.g. even if no VMM does so today, it's entirely plausible that a VMM could want
> to asynchronously inject an #MC to mimic a broadcast, and that the injection could
> collide with a pending nested VM-Enter.
>
> I'll send a separate (maybe RFC?) series for (b) using patch 1 as a starting point.
> I want to fiddle around with some ideas, and it'll be faster to sketch things out
> in code versus trying to describe things in text.

So you'll apply patch 3 as-is, drop patch 2, and (potentially) take
patch 1 and build another series on top of it?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception
  2026-02-27 17:31       ` Yosry Ahmed
@ 2026-02-27 18:18         ` Sean Christopherson
  2026-02-27 18:34           ` Yosry Ahmed
  0 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2026-02-27 18:18 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 27, 2026, Yosry Ahmed wrote:
> On Fri, Feb 27, 2026 at 8:34 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Feb 27, 2026, Sean Christopherson wrote:
> > > So instead of patch 1, I want to try either (a) blocking KVM_SET_VCPU_EVENTS,
> > > KVM_X86_SET_MCE, and KVM_SET_GUEST_DEBUG if nested_run_pending=1, *and* follow-up
> > > with the below WARN-spree, or (b) add a separate flag, e.g. nested_run_in_progress
> > > or so, that is set with nested_run_pending, but cleared on an exit to userspace,
> > > and then WARN on _that_, i.e. so that we can detect KVM bugs (the whole point of
> > > the WARN) and hopefully stop playing this losing game of whack-a-mole with syzkaller.
> 
> I like the idea of the WARN there, although something in the back of
> my mind tells me I went through this code before with an exception in
> mind that could be injected with nested_run_pending=1, but I can't
> remember it. Maybe it was injected by userspace and all is good.

If there is such a flow, it's likely a bug, i.e. we'd want the WARN.  AFAIK,
every single time the WARN has been hit in the last ~2-3 years has been due to
syzkaller.

> That being said, I hate nested_run_in_progress. It's too close to
> nested_run_pending and I am pretty sure they will be mixed up.

Agreed, though the fact that name is _too_ close means that, aside from the
potential for disaster (minor detail), it's accurate.

One thought is to hide nested_run_in_progress beyond a KConfig, so that attempts
to use it for anything but the sanity check(s) would fail the build.  I don't
really want to create yet another KVM_PROVE_xxx though, but unlike KVM_PROVE_MMU,
I think we want to this enabled in production.

I'll chew on this a bit...

> exception_from_userspace's name made me think this is something we
> could key off to WARN, but it's meant to morph queued exceptions from
> userspace into an "exception_vmexit" if needed. The field name is
> generic but its functionality isn't, maybe it should have been called
> exception_check_vmexit or something. Anyway..

No?  It's not a "check", it's literally an pending exception that has been morphed
to a VM-Exit.

Hmm, though looking at all that code again, I bet we can dedup a _lot_ code by
adding kvm_queued_exception.is_vmexit instead of tracking a completely separate
exception.  The only potential hiccup I can think of is if there's some wrinkle
with the interaction with already pending/injected exceptions.  Pending should be
fine, as the VM-Exit has priority.

Ah, scratch that idea, injected exceptions need to be tracked separate, e.g. see
vmcs12_save_pending_event().  It's correct for vmx_check_nested_events() to
deliver a VM-Exit even if there is an already-injected exception, e.g. if an EPT
Violation in L1's purview triggers when vectoring an injected exception, but in
that case, KVM needs to save the injected exception information into vmc{b,s}12.

> That gave me an idea though, can we add a field to
> kvm_queued_exception to identify the origin of the exception
> (userspace vs. KVM)? Then we can key the warning off of that.

That would incur non-trivial maintenance costs, and it would be tricky to get the
broader protection of the existing WARNing "right".  E.g. how would KVM know that
the VM-Exit was originally induced by an exception that was queued by userspace?

> We can potentially also avoid adding the field and just plumb the
> argument through to kvm_multiple_exception(), and WARN there if
> nested_run_pending is set and the origin is not userspace?

Not really, because kvm_vcpu_ioctl_x86_set_vcpu_events() doesn't use
kvm_queued_exception(), it stuffs things directly.

That said, if you want to try and code it up, I say go for it.  Worst case scenario
you'll have wasted a bit of time.

> > > I think I'm leaning toward (b)?  Except for KVM_SET_GUEST_DEBUG, where userspace
> > > is trying to interpose on the guest, restricting ioctls doesn't really add any
> > > value in practice.  Yeah, in theory it could _maybe_ prevent userspace from shooting
> > > itself in the foot, but practically speaking, if userspace is restoring state into
> > > a vCPU with nested_run_pending=1, it's either playing on expert mode or is already
> > > completely broken.
> > >
> > > My only hesitation with (b) is that KVM wouldn't be entirely consistent, since
> > > vmx_unhandleable_emulation_required() _does_ explicitly reject a "userspace did
> > > something stupid with nested_run_pending=1" case.  So from that perspective, part
> > > of me wants to get greedy and try for (a).
> >
> > On second (fifth?) thought, I don't think (a) is a good idea.  In addition to
> > potentially breaking userspace, it also risks preventing genuinely useful sequences.
> > E.g. even if no VMM does so today, it's entirely plausible that a VMM could want
> > to asynchronously inject an #MC to mimic a broadcast, and that the injection could
> > collide with a pending nested VM-Enter.
> >
> > I'll send a separate (maybe RFC?) series for (b) using patch 1 as a starting point.
> > I want to fiddle around with some ideas, and it'll be faster to sketch things out
> > in code versus trying to describe things in text.
> 
> So you'll apply patch 3 as-is, drop patch 2, and (potentially) take
> patch 1 and build another series on top of it?

Yeah, that's where I'm trending.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception
  2026-02-27 18:18         ` Sean Christopherson
@ 2026-02-27 18:34           ` Yosry Ahmed
  2026-03-02 23:22             ` Sean Christopherson
  0 siblings, 1 reply; 13+ messages in thread
From: Yosry Ahmed @ 2026-02-27 18:34 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

> > That being said, I hate nested_run_in_progress. It's too close to
> > nested_run_pending and I am pretty sure they will be mixed up.
>
> Agreed, though the fact that name is _too_ close means that, aside from the
> potential for disaster (minor detail), it's accurate.
>
> One thought is to hide nested_run_in_progress beyond a KConfig, so that attempts
> to use it for anything but the sanity check(s) would fail the build.  I don't
> really want to create yet another KVM_PROVE_xxx though, but unlike KVM_PROVE_MMU,
> I think we want to this enabled in production.
>
> I'll chew on this a bit...

Maybe (if we go this direction) name it very explicitly
warn_on_nested_exception if it's only intended to be used for the
sanity checks?

>
> > exception_from_userspace's name made me think this is something we
> > could key off to WARN, but it's meant to morph queued exceptions from
> > userspace into an "exception_vmexit" if needed. The field name is
> > generic but its functionality isn't, maybe it should have been called
> > exception_check_vmexit or something. Anyway..
>
> No?  It's not a "check", it's literally an pending exception that has been morphed
> to a VM-Exit.

I meant that the exception_from_userspace flag means "KVM should check
if this exception should be morphed to a VM-Exit". It doesn't mean
that the exception has already morphed or will necessarily morph. So
exception_check_vmexit makes sense to me, if it's set, KVM checks if
we need to morph the exception to a VM-Exit.

> > That gave me an idea though, can we add a field to
> > kvm_queued_exception to identify the origin of the exception
> > (userspace vs. KVM)? Then we can key the warning off of that.
>
> That would incur non-trivial maintenance costs, and it would be tricky to get the
> broader protection of the existing WARNing "right".  E.g. how would KVM know that
> the VM-Exit was originally induced by an exception that was queued by userspace?

It should have the info when morphing a pending exception to a
VM-Exit, assuming whoever is queuing the exception is passing it in,
but yeah I see how this can be a burden.

>
> > We can potentially also avoid adding the field and just plumb the
> > argument through to kvm_multiple_exception(), and WARN there if
> > nested_run_pending is set and the origin is not userspace?
>
> Not really, because kvm_vcpu_ioctl_x86_set_vcpu_events() doesn't use
> kvm_queued_exception(), it stuffs things directly.

Right, what I had in mind was that by default exceptions are assumed
to be queued by KVM, so kvm_vcpu_ioctl_x86_set_vcpu_events() doesn't
need to change. Basically, if a code path is queuing an exception from
userspace, it should use a new variant of kvm_queued_exception() (e.g.
kvm_queue_exception_u()). If it's stuffing things directly, nothing to
do. I think the code paths queuing exceptions from userspace should be
limited, so it should be fine to do this.

That being said, this still WARNs on the queuing side, not on the
checking side, so if you think that's not the right thing to do in
general then scratch this too.

>
> That said, if you want to try and code it up, I say go for it.  Worst case scenario
> you'll have wasted a bit of time.

I meant something like this (completely untested), this just WARNs if
we try to queue the exception, doesn't really stop it:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c8aacf1fa67f..6f4148eae08be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -836,11 +836,14 @@ static void kvm_queue_exception_vmexit(struct
kvm_vcpu *vcpu, unsigned int vecto

 static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
                                   bool has_error, u32 error_code,
-                                  bool has_payload, unsigned long payload)
+                                  bool has_payload, unsigned long payload,
+                                  bool from_userspace)
 {
        u32 prev_nr;
        int class1, class2;

+       WARN_ON_ONCE(!from_userspace && vcpu->arch.nested_run_pending);
+
        kvm_make_request(KVM_REQ_EVENT, vcpu);

        /*
@@ -899,7 +902,7 @@ static void kvm_multiple_exception(struct kvm_vcpu
*vcpu, unsigned int nr,

 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
 {
-       kvm_multiple_exception(vcpu, nr, false, 0, false, 0);
+       kvm_multiple_exception(vcpu, nr, false, 0, false, 0, false);
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_queue_exception);

@@ -907,14 +910,19 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_queue_exception);
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr,
                           unsigned long payload)
 {
-       kvm_multiple_exception(vcpu, nr, false, 0, true, payload);
+       kvm_multiple_exception(vcpu, nr, false, 0, true, payload, false);
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_queue_exception_p);

 static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
                                    u32 error_code, unsigned long payload)
 {
-       kvm_multiple_exception(vcpu, nr, true, error_code, true, payload);
+       kvm_multiple_exception(vcpu, nr, true, error_code, true,
payload, false);
+}
+
+static void kvm_queue_exception_u(struct kvm_vcpu *vcpu, unsigned nr)
+{
+       kvm_multiple_exception(vcpu, nr, false, 0, false, 0, true);
 }

 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
@@ -1015,7 +1023,7 @@ void kvm_inject_nmi(struct kvm_vcpu *vcpu)

 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
 {
-       kvm_multiple_exception(vcpu, nr, true, error_code, false, 0);
+       kvm_multiple_exception(vcpu, nr, true, error_code, false, 0, false);
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_queue_exception_e);

@@ -5519,7 +5527,7 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct
kvm_vcpu *vcpu,
                banks[3] = mce->misc;
                vcpu->arch.mcg_status = mce->mcg_status;
                banks[1] = mce->status;
-               kvm_queue_exception(vcpu, MC_VECTOR);
+               kvm_queue_exception_u(vcpu, MC_VECTOR);
        } else if (!(banks[1] & MCI_STATUS_VAL)
                   || !(banks[1] & MCI_STATUS_UC)) {
                if (banks[1] & MCI_STATUS_VAL)
@@ -12546,9 +12554,9 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct
kvm_vcpu *vcpu,
                if (kvm_is_exception_pending(vcpu) ||
vcpu->arch.exception.injected)
                        goto out;
                if (dbg->control & KVM_GUESTDBG_INJECT_DB)
-                       kvm_queue_exception(vcpu, DB_VECTOR);
+                       kvm_queue_exception_u(vcpu, DB_VECTOR);
                else
-                       kvm_queue_exception(vcpu, BP_VECTOR);
+                       kvm_queue_exception_u(vcpu, BP_VECTOR);
        }

        /*

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception
  2026-02-27 18:34           ` Yosry Ahmed
@ 2026-03-02 23:22             ` Sean Christopherson
  2026-03-02 23:36               ` Yosry Ahmed
  0 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2026-03-02 23:22 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Fri, Feb 27, 2026, Yosry Ahmed wrote:
> > > That being said, I hate nested_run_in_progress. It's too close to
> > > nested_run_pending and I am pretty sure they will be mixed up.
> >
> > Agreed, though the fact that name is _too_ close means that, aside from the
> > potential for disaster (minor detail), it's accurate.
> >
> > One thought is to hide nested_run_in_progress beyond a KConfig, so that attempts
> > to use it for anything but the sanity check(s) would fail the build.  I don't
> > really want to create yet another KVM_PROVE_xxx though, but unlike KVM_PROVE_MMU,
> > I think we want to this enabled in production.
> >
> > I'll chew on this a bit...
> 
> Maybe (if we go this direction) name it very explicitly
> warn_on_nested_exception if it's only intended to be used for the
> sanity checks?

It's not just about exceptions though.  That's the case that has caused a rash
of recent problems, but the rule isn't specific to exceptions, it's very broadly
Thou Shalt Not Cancel VMRUN.

I think that's where there's some disconnect.  We can't make the nested_run_pending
warnings go away by adding more sanity checks, and I am dead set against removing
those warnings.

Aha!  Idea.  What if we turn nested_run_pending into a u8, and use a magic value
of '2' to indicate that userspace gained control of the CPU since nested_run_pending
was set, and then only WARN on nested_run_pending==1?  That way we don't have to
come up with a new name, and there's zero chance of nested_run_pending and something
like nested_run_in_progress getting out of sync.

---
 arch/x86/include/asm/kvm_host.h |  6 +++++-
 arch/x86/kvm/svm/nested.c       |  3 ++-
 arch/x86/kvm/vmx/nested.c       |  4 ++--
 arch/x86/kvm/x86.c              |  7 +++++++
 arch/x86/kvm/x86.h              | 10 ++++++++++
 5 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 19b3790e5e99..a8d39b3aff6a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1104,8 +1104,12 @@ struct kvm_vcpu_arch {
 	 * can only occur at instruction boundaries.  The only exception is
 	 * VMX's "notify" exits, which exist in large part to break the CPU out
 	 * of infinite ucode loops, but can corrupt vCPU state in the process!
+	 *
+	 * For all intents and purposes, this is a boolean, but it's tracked as
+	 * a u8 so that KVM can detect when userspace may have stuffed vCPU
+	 * state and generated an architecturally-impossible VM-Exit.
 	 */
-	bool nested_run_pending;
+	u8 nested_run_pending;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	hpa_t hv_root_tdp;
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index c2d4c9c63146..77ff9ead957c 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1138,7 +1138,8 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 	/* Exit Guest-Mode */
 	leave_guest_mode(vcpu);
 	svm->nested.vmcb12_gpa = 0;
-	WARN_ON_ONCE(vcpu->arch.nested_run_pending);
+
+	kvm_warn_on_nested_run_pending(vcpu);
 
 	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
 
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 031075467a6d..5659545360dc 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5042,7 +5042,7 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 	vmx->nested.mtf_pending = false;
 
 	/* trying to cancel vmlaunch/vmresume is a bug */
-	WARN_ON_ONCE(vcpu->arch.nested_run_pending);
+	kvm_warn_on_nested_run_pending(vcpu);
 
 #ifdef CONFIG_KVM_HYPERV
 	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
@@ -6665,7 +6665,7 @@ bool nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu)
 	unsigned long exit_qual;
 	u32 exit_intr_info;
 
-	WARN_ON_ONCE(vcpu->arch.nested_run_pending);
+	kvm_warn_on_nested_run_pending(vcpu);
 
 	/*
 	 * Late nested VM-Fail shares the same flow as nested VM-Exit since KVM
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d9..30ff5a755572 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12023,6 +12023,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	if (r <= 0)
 		goto out;
 
+	/*
+	 * If userspace may have modified vCPU state, mark nested_run_pending
+	 * as "untrusted" to avoid triggering false-positive WARNs.
+	 */
+	if (vcpu->arch.nested_run_pending == 1)
+		vcpu->arch.nested_run_pending = 2;
+
 	r = vcpu_run(vcpu);
 
 out:
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 94d4f07aaaa0..d3003c8be961 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -188,6 +188,16 @@ static inline bool kvm_can_set_cpuid_and_feature_msrs(struct kvm_vcpu *vcpu)
 	return vcpu->arch.last_vmentry_cpu == -1 && !is_guest_mode(vcpu);
 }
 
+/*
+ * WARN if a nested VM-Enter is pending completion, and userspace hasn't gained
+ * control since the nested VM-Enter was initiated (in which case, userspace
+ * may have modified vCPU state to induce an architecturally invalid VM-Exit).
+ */
+static inline void kvm_warn_on_nested_run_pending(struct kvm_vcpu *vcpu)
+{
+	WARN_ON_ONCE(vcpu->arch.nested_run_pending == 1);
+}
+
 static inline void kvm_set_mp_state(struct kvm_vcpu *vcpu, int mp_state)
 {
 	vcpu->arch.mp_state = mp_state;

base-commit: a68a4bbc5b9ce5b722473399f05cb05217abaee8
--

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception
  2026-03-02 23:22             ` Sean Christopherson
@ 2026-03-02 23:36               ` Yosry Ahmed
  2026-03-02 23:47                 ` Sean Christopherson
  0 siblings, 1 reply; 13+ messages in thread
From: Yosry Ahmed @ 2026-03-02 23:36 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Mar 2, 2026 at 3:22 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Feb 27, 2026, Yosry Ahmed wrote:
> > > > That being said, I hate nested_run_in_progress. It's too close to
> > > > nested_run_pending and I am pretty sure they will be mixed up.
> > >
> > > Agreed, though the fact that name is _too_ close means that, aside from the
> > > potential for disaster (minor detail), it's accurate.
> > >
> > > One thought is to hide nested_run_in_progress beyond a KConfig, so that attempts
> > > to use it for anything but the sanity check(s) would fail the build.  I don't
> > > really want to create yet another KVM_PROVE_xxx though, but unlike KVM_PROVE_MMU,
> > > I think we want to this enabled in production.
> > >
> > > I'll chew on this a bit...
> >
> > Maybe (if we go this direction) name it very explicitly
> > warn_on_nested_exception if it's only intended to be used for the
> > sanity checks?
>
> It's not just about exceptions though.  That's the case that has caused a rash
> of recent problems, but the rule isn't specific to exceptions, it's very broadly
> Thou Shalt Not Cancel VMRUN.
>
> I think that's where there's some disconnect.  We can't make the nested_run_pending
> warnings go away by adding more sanity checks, and I am dead set against removing
> those warnings.
>
> Aha!  Idea.  What if we turn nested_run_pending into a u8, and use a magic value
> of '2' to indicate that userspace gained control of the CPU since nested_run_pending
> was set, and then only WARN on nested_run_pending==1?  That way we don't have to
> come up with a new name, and there's zero chance of nested_run_pending and something
> like nested_run_in_progress getting out of sync.

Yeah this should work, the only thing I would change is using macros
instead of 1 and 2 for readability.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception
  2026-03-02 23:36               ` Yosry Ahmed
@ 2026-03-02 23:47                 ` Sean Christopherson
  0 siblings, 0 replies; 13+ messages in thread
From: Sean Christopherson @ 2026-03-02 23:47 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Mar 02, 2026, Yosry Ahmed wrote:
> On Mon, Mar 2, 2026 at 3:22 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Feb 27, 2026, Yosry Ahmed wrote:
> > > > > That being said, I hate nested_run_in_progress. It's too close to
> > > > > nested_run_pending and I am pretty sure they will be mixed up.
> > > >
> > > > Agreed, though the fact that name is _too_ close means that, aside from the
> > > > potential for disaster (minor detail), it's accurate.
> > > >
> > > > One thought is to hide nested_run_in_progress beyond a KConfig, so that attempts
> > > > to use it for anything but the sanity check(s) would fail the build.  I don't
> > > > really want to create yet another KVM_PROVE_xxx though, but unlike KVM_PROVE_MMU,
> > > > I think we want to this enabled in production.
> > > >
> > > > I'll chew on this a bit...
> > >
> > > Maybe (if we go this direction) name it very explicitly
> > > warn_on_nested_exception if it's only intended to be used for the
> > > sanity checks?
> >
> > It's not just about exceptions though.  That's the case that has caused a rash
> > of recent problems, but the rule isn't specific to exceptions, it's very broadly
> > Thou Shalt Not Cancel VMRUN.
> >
> > I think that's where there's some disconnect.  We can't make the nested_run_pending
> > warnings go away by adding more sanity checks, and I am dead set against removing
> > those warnings.
> >
> > Aha!  Idea.  What if we turn nested_run_pending into a u8, and use a magic value
> > of '2' to indicate that userspace gained control of the CPU since nested_run_pending
> > was set, and then only WARN on nested_run_pending==1?  That way we don't have to
> > come up with a new name, and there's zero chance of nested_run_pending and something
> > like nested_run_in_progress getting out of sync.
> 
> Yeah this should work, the only thing I would change is using macros
> instead of 1 and 2 for readability.

I was "this" close to using a enum or #define, but I couldn't figure out a clean
solution to this code:

	vcpu->arch.nested_run_pending =
		!!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING);

as I didn't want to end up with effectively:

	if (true)
		x = 1;
	else
		x = 0;

But thinking more on it, that code is inherently untrusuted, so it can be this:

	if (kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING)
		vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING_UNTRUSTED;
	else
		vcpu->arch.nested_run_pending = 0;

which is pretty much the same, but at least is a bit more than a convoluted cast
from a bool to an int.

FWIW, I verified this makes the C reproducer happy.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] KVM: x86: Fix incorrect handling of triple faults
  2026-02-27  1:13 [PATCH 0/3] KVM: x86: Fix incorrect handling of triple faults Yosry Ahmed
                   ` (2 preceding siblings ...)
  2026-02-27  1:13 ` [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception Yosry Ahmed
@ 2026-03-05 17:26 ` Sean Christopherson
  3 siblings, 0 replies; 13+ messages in thread
From: Sean Christopherson @ 2026-03-05 17:26 UTC (permalink / raw)
  To: Sean Christopherson, Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Fri, 27 Feb 2026 01:13:03 +0000, Yosry Ahmed wrote:
> Fix a couple of bugs related to handling triple faults, namely KVM
> injecting a triple fault into an L2 that hasn't run yet, or KVM
> combining #DB/#BP from KVM_SET_GUEST_DEBUG with existing exceptions
> causing a triple fault (or #DF).
> 
> Either of these bugs can result in a triple fault being injected with
> nested_run_pending=1, leading to triggering the warning in
> __nested_vmx_vmexit().
> 
> [...]

Applied patch 3 to kvm-x86 misc, thanks!

[1/3] KVM: x86: Move nested_run_pending to kvm_vcpu_arch
      (coming separately)
[2/3] KVM: x86: Do not inject triple faults into an L2 with a pending run
      (DROP)
[3/3] KVM: x86: Check for injected exceptions before queuing a debug exception
      https://github.com/kvm-x86/linux/commit/e907b4e72488

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-05 17:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-27  1:13 [PATCH 0/3] KVM: x86: Fix incorrect handling of triple faults Yosry Ahmed
2026-02-27  1:13 ` [PATCH 1/3] KVM: x86: Move nested_run_pending to kvm_vcpu_arch Yosry Ahmed
2026-02-27  1:13 ` [PATCH 2/3] KVM: x86: Do not inject triple faults into an L2 with a pending run Yosry Ahmed
2026-02-27  1:13 ` [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception Yosry Ahmed
2026-02-27 16:06   ` Sean Christopherson
2026-02-27 16:34     ` Sean Christopherson
2026-02-27 17:31       ` Yosry Ahmed
2026-02-27 18:18         ` Sean Christopherson
2026-02-27 18:34           ` Yosry Ahmed
2026-03-02 23:22             ` Sean Christopherson
2026-03-02 23:36               ` Yosry Ahmed
2026-03-02 23:47                 ` Sean Christopherson
2026-03-05 17:26 ` [PATCH 0/3] KVM: x86: Fix incorrect handling of triple faults Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox