* [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT @ 2026-02-03 1:13 Yosry Ahmed 2026-02-03 15:33 ` Yosry Ahmed 2026-02-03 16:12 ` Sean Christopherson 0 siblings, 2 replies; 10+ messages in thread From: Yosry Ahmed @ 2026-02-03 1:13 UTC (permalink / raw) To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Yosry Ahmed KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on nested #VMEXIT. Use the value from vcpu->arch.cr2 instead. The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are only out-of-sync in two cases: after migration, and after a #PF is injected into L2. After migration, the value of CR2 in vmcb02 is uninitialized (i.e. zero), as KVM_SET_SREGS restores CR2 value to vcpu->arch.cr2. Using vcpu->arch.cr2 to update vmcb12 is the right thing to do. The #PF injection case is more nuanced. It occurs if KVM injects a #PF into L2, then exits to L1 before it actually runs L2. Although the APM is a bit unclear about when CR2 is written during a #PF, the SDM is more clear: Processors update CR2 whenever a page fault is detected. If a second page fault occurs while an earlier page fault is being delivered, the faulting linear address of the second fault will overwrite the contents of CR2 (replacing the previous address). These updates to CR2 occur even if the page fault results in a double fault or occurs during the delivery of a double fault. KVM injecting the exception surely counts as the #PF being "detected". More importantly, when an exception is injected into L2 at the time of a synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly, such that an L1 hypervisor can re-inject the exception. If CR2 is not written at that point, the L1 hypervisor have no way of correctly re-injecting the #PF. Hence, using vcpu->arch.cr2 is also the right thing to write in vmcb12 in this case. Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for L2, only when it is injected. The distinction is important, because only injected exceptions are propagated to L1 through exit_int_info. It would be incorrect to update CR2 in vmcb12 for a pending #PF, as L1 would perceive an updated CR2 value with no #PF. Update the comment in kvm_deliver_exception_payload() to clarify this. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> --- arch/x86/kvm/svm/nested.c | 2 +- arch/x86/kvm/x86.c | 7 +++++++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index de90b104a0dd5..9031746ce2db1 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1156,7 +1156,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm) vmcb12->save.efer = svm->vcpu.arch.efer; vmcb12->save.cr0 = kvm_read_cr0(vcpu); vmcb12->save.cr3 = kvm_read_cr3(vcpu); - vmcb12->save.cr2 = vmcb02->save.cr2; + vmcb12->save.cr2 = vcpu->arch.cr2; vmcb12->save.cr4 = svm->vcpu.arch.cr4; vmcb12->save.rflags = kvm_get_rflags(vcpu); vmcb12->save.rip = kvm_rip_read(vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index db3f393192d94..1015522d0fbd7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -864,6 +864,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, vcpu->arch.exception.error_code = error_code; vcpu->arch.exception.has_payload = has_payload; vcpu->arch.exception.payload = payload; + /* + * Only injected exceptions are propagated to L1 in + * vmcb12/vmcs12 on nested #VMEXIT. Hence, do not deliver the + * exception payload for L2 until the exception is injected. + * Otherwise, L1 would perceive the updated payload without a + * corresponding exception. + */ if (!is_guest_mode(vcpu)) kvm_deliver_exception_payload(vcpu, &vcpu->arch.exception); -- 2.53.0.rc2.204.g2597b5adb4-goog ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT 2026-02-03 1:13 [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT Yosry Ahmed @ 2026-02-03 15:33 ` Yosry Ahmed 2026-02-03 16:12 ` Sean Christopherson 1 sibling, 0 replies; 10+ messages in thread From: Yosry Ahmed @ 2026-02-03 15:33 UTC (permalink / raw) To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel On Tue, Feb 03, 2026 at 01:13:20AM +0000, Yosry Ahmed wrote: > KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on > nested #VMEXIT. Use the value from vcpu->arch.cr2 instead. > > The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN > of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are > only out-of-sync in two cases: after migration, and after a #PF is > injected into L2. > > After migration, the value of CR2 in vmcb02 is uninitialized (i.e. > zero), as KVM_SET_SREGS restores CR2 value to vcpu->arch.cr2. Using > vcpu->arch.cr2 to update vmcb12 is the right thing to do. > > The #PF injection case is more nuanced. It occurs if KVM injects a #PF > into L2, then exits to L1 before it actually runs L2. Although the APM > is a bit unclear about when CR2 is written during a #PF, the SDM is more > clear: > > Processors update CR2 whenever a page fault is detected. If a > second page fault occurs while an earlier page fault is being > delivered, the faulting linear address of the second fault will > overwrite the contents of CR2 (replacing the previous address). > These updates to CR2 occur even if the page fault results in a > double fault or occurs during the delivery of a double fault. > > KVM injecting the exception surely counts as the #PF being "detected". > More importantly, when an exception is injected into L2 at the time of a > synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly, > such that an L1 hypervisor can re-inject the exception. If CR2 is not > written at that point, the L1 hypervisor have no way of correctly > re-injecting the #PF. Hence, using vcpu->arch.cr2 is also the right > thing to write in vmcb12 in this case. > > Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for > L2, only when it is injected. The distinction is important, because only > injected exceptions are propagated to L1 through exit_int_info. It would > be incorrect to update CR2 in vmcb12 for a pending #PF, as L1 would > perceive an updated CR2 value with no #PF. Update the comment in > kvm_deliver_exception_payload() to clarify this. I forgot the best part: If a synthesized #VMEXIT to L1 writes the wrong CR2 (e.g. right after migration), and L2 is handling a #PF, it could read a corrupted CR2. This could manifest as segmentation faults in L2, or potentially data corruption. Cc: stable@vger.kernel.org > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> > --- > arch/x86/kvm/svm/nested.c | 2 +- > arch/x86/kvm/x86.c | 7 +++++++ > 2 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c > index de90b104a0dd5..9031746ce2db1 100644 > --- a/arch/x86/kvm/svm/nested.c > +++ b/arch/x86/kvm/svm/nested.c > @@ -1156,7 +1156,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm) > vmcb12->save.efer = svm->vcpu.arch.efer; > vmcb12->save.cr0 = kvm_read_cr0(vcpu); > vmcb12->save.cr3 = kvm_read_cr3(vcpu); > - vmcb12->save.cr2 = vmcb02->save.cr2; > + vmcb12->save.cr2 = vcpu->arch.cr2; > vmcb12->save.cr4 = svm->vcpu.arch.cr4; > vmcb12->save.rflags = kvm_get_rflags(vcpu); > vmcb12->save.rip = kvm_rip_read(vcpu); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index db3f393192d94..1015522d0fbd7 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -864,6 +864,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, > vcpu->arch.exception.error_code = error_code; > vcpu->arch.exception.has_payload = has_payload; > vcpu->arch.exception.payload = payload; > + /* > + * Only injected exceptions are propagated to L1 in > + * vmcb12/vmcs12 on nested #VMEXIT. Hence, do not deliver the > + * exception payload for L2 until the exception is injected. > + * Otherwise, L1 would perceive the updated payload without a > + * corresponding exception. > + */ > if (!is_guest_mode(vcpu)) > kvm_deliver_exception_payload(vcpu, > &vcpu->arch.exception); > -- > 2.53.0.rc2.204.g2597b5adb4-goog > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT 2026-02-03 1:13 [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT Yosry Ahmed 2026-02-03 15:33 ` Yosry Ahmed @ 2026-02-03 16:12 ` Sean Christopherson 2026-02-03 16:51 ` Yosry Ahmed 1 sibling, 1 reply; 10+ messages in thread From: Sean Christopherson @ 2026-02-03 16:12 UTC (permalink / raw) To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel On Tue, Feb 03, 2026, Yosry Ahmed wrote: > KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on > nested #VMEXIT. Use the value from vcpu->arch.cr2 instead. > > The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN > of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are > only out-of-sync in two cases: after migration, and after a #PF is Nit, instead of "migration", talk about state save/restore, and then list live migration as an example. Most of the time it's fairly obvious that "migration" in KVM means "live migration", but not always. E.g. task migration often comes into play, as does page migration. And more importantly, the above statement is wrong when state is saved/restored for something other than migration. E.g. if userspace is restoring a snapshot for reasons other than migrating a VM. > injected into L2. > > After migration, the value of CR2 in vmcb02 is uninitialized (i.e. save/restore > zero), This isn't guaranteed. E.g. if state is restored into a vCPU that was already running, vmcb02.save.cr2 could hold any number of things. > as KVM_SET_SREGS restores CR2 value to vcpu->arch.cr2. Using > vcpu->arch.cr2 to update vmcb12 is the right thing to do. > > The #PF injection case is more nuanced. It occurs if KVM injects a #PF > into L2, then exits to L1 before it actually runs L2. Although the APM > is a bit unclear about when CR2 is written during a #PF, the SDM is more > clear: > > Processors update CR2 whenever a page fault is detected. If a > second page fault occurs while an earlier page fault is being > delivered, the faulting linear address of the second fault will > overwrite the contents of CR2 (replacing the previous address). > These updates to CR2 occur even if the page fault results in a > double fault or occurs during the delivery of a double fault. > > KVM injecting the exception surely counts as the #PF being "detected". Heh, "detected" is definitely poor wording in the SDM. > More importantly, when an exception is injected into L2 at the time of a > synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly, > such that an L1 hypervisor can re-inject the exception. If CR2 is not > written at that point, the L1 hypervisor have no way of correctly > re-injecting the #PF. Hence, using vcpu->arch.cr2 is also the right > thing to write in vmcb12 in this case. > > Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for > L2, only when it is injected. The distinction is important, because only > injected exceptions are propagated to L1 through exit_int_info. It would > be incorrect to update CR2 in vmcb12 for a pending #PF, as L1 would > perceive an updated CR2 value with no #PF. Update the comment in > kvm_deliver_exception_payload() to clarify this. > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> > --- > arch/x86/kvm/svm/nested.c | 2 +- > arch/x86/kvm/x86.c | 7 +++++++ > 2 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c > index de90b104a0dd5..9031746ce2db1 100644 > --- a/arch/x86/kvm/svm/nested.c > +++ b/arch/x86/kvm/svm/nested.c > @@ -1156,7 +1156,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm) > vmcb12->save.efer = svm->vcpu.arch.efer; > vmcb12->save.cr0 = kvm_read_cr0(vcpu); > vmcb12->save.cr3 = kvm_read_cr3(vcpu); > - vmcb12->save.cr2 = vmcb02->save.cr2; > + vmcb12->save.cr2 = vcpu->arch.cr2; > vmcb12->save.cr4 = svm->vcpu.arch.cr4; > vmcb12->save.rflags = kvm_get_rflags(vcpu); > vmcb12->save.rip = kvm_rip_read(vcpu); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index db3f393192d94..1015522d0fbd7 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -864,6 +864,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, > vcpu->arch.exception.error_code = error_code; > vcpu->arch.exception.has_payload = has_payload; > vcpu->arch.exception.payload = payload; > + /* > + * Only injected exceptions are propagated to L1 in > + * vmcb12/vmcs12 on nested #VMEXIT. Hence, do not deliver the Nit, #VMEXIT is SVM specific terminology. VM-Exit is more vendor agnostic. > + * exception payload for L2 until the exception is injected. > + * Otherwise, L1 would perceive the updated payload without a > + * corresponding exception. Huh. I'm fairly certain this code is now at best unnecessary, and at worst actively harmful. Because the more architectural way to document this code is: /* * If L2 is active, defer delivery of the payload until the * exception is actually injected to avoid clobbering state if * L1 wants to intercept the exception (the architectural state * is NOT updated if the exeption is morphed to a VM-Exit). */ But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a VM-Exit. Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that commit. Because invoking kvm_deliver_exception_payload() when the exception was morphed to a VM-Exit is wrong. Oh, wait, this is the !exception_payload_enabled case. So never mind, that's simply an unfixable bug, as the second comment alludes to. /* * KVM's ABI only allows for one exception to be migrated. Luckily, * the only time there can be two queued exceptions is if there's a * non-exiting _injected_ exception, and a pending exiting exception. * In that case, ignore the VM-Exiting exception as it's an extension * of the injected exception. */ if (vcpu->arch.exception_vmexit.pending && !vcpu->arch.exception.pending && !vcpu->arch.exception.injected) ex = &vcpu->arch.exception_vmexit; else ex = &vcpu->arch.exception; /* * In guest mode, payload delivery should be deferred if the exception * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1 * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability, * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not * propagate the payload and so it cannot be safely deferred. Deliver * the payload if the capability hasn't been requested. */ if (!vcpu->kvm->arch.exception_payload_enabled && ex->pending && ex->has_payload) kvm_deliver_exception_payload(vcpu, ex); So yeah, I _think_ we could drop the is_guest_mode() check. However, even better would be to drop this call *entirely*, i.e. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b0112c515584..00a39c95631d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, vcpu->arch.exception.error_code = error_code; vcpu->arch.exception.has_payload = has_payload; vcpu->arch.exception.payload = payload; - if (!is_guest_mode(vcpu)) - kvm_deliver_exception_payload(vcpu, - &vcpu->arch.exception); return; } Because KVM really shouldn't update CR2 until the excpetion is actually injected (or the state is at risk of being lost because exception_payload_enabled==false). E.g. in the (extremely) unlikely (and probably impossible to configure reliably) scenario that userspace deliberately drops a pending exception, arch state shouldn't be updated. > + */ > if (!is_guest_mode(vcpu)) > kvm_deliver_exception_payload(vcpu, > &vcpu->arch.exception); > -- > 2.53.0.rc2.204.g2597b5adb4-goog > ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT 2026-02-03 16:12 ` Sean Christopherson @ 2026-02-03 16:51 ` Yosry Ahmed 2026-02-03 18:03 ` Sean Christopherson 0 siblings, 1 reply; 10+ messages in thread From: Yosry Ahmed @ 2026-02-03 16:51 UTC (permalink / raw) To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote: > On Tue, Feb 03, 2026, Yosry Ahmed wrote: > > KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on > > nested #VMEXIT. Use the value from vcpu->arch.cr2 instead. > > > > The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN > > of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are > > only out-of-sync in two cases: after migration, and after a #PF is > > Nit, instead of "migration", talk about state save/restore, and then list live > migration as an example. Most of the time it's fairly obvious that "migration" > in KVM means "live migration", but not always. E.g. task migration often comes > into play, as does page migration. > > And more importantly, the above statement is wrong when state is saved/restored > for something other than migration. E.g. if userspace is restoring a snapshot > for reasons other than migrating a VM. Makes sense, will fix. > > > injected into L2. > > > > After migration, the value of CR2 in vmcb02 is uninitialized (i.e. > > save/restore > > > zero), > > This isn't guaranteed. E.g. if state is restored into a vCPU that was already > running, vmcb02.save.cr2 could hold any number of things. Good point, I missed that case. > > > as KVM_SET_SREGS restores CR2 value to vcpu->arch.cr2. Using > > vcpu->arch.cr2 to update vmcb12 is the right thing to do. > > > > The #PF injection case is more nuanced. It occurs if KVM injects a #PF > > into L2, then exits to L1 before it actually runs L2. Although the APM > > is a bit unclear about when CR2 is written during a #PF, the SDM is more > > clear: > > > > Processors update CR2 whenever a page fault is detected. If a > > second page fault occurs while an earlier page fault is being > > delivered, the faulting linear address of the second fault will > > overwrite the contents of CR2 (replacing the previous address). > > These updates to CR2 occur even if the page fault results in a > > double fault or occurs during the delivery of a double fault. > > > > KVM injecting the exception surely counts as the #PF being "detected". > > Heh, "detected" is definitely poor wording in the SDM. > > > More importantly, when an exception is injected into L2 at the time of a > > synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly, > > such that an L1 hypervisor can re-inject the exception. If CR2 is not > > written at that point, the L1 hypervisor have no way of correctly > > re-injecting the #PF. Hence, using vcpu->arch.cr2 is also the right > > thing to write in vmcb12 in this case. > > > > Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for > > L2, only when it is injected. The distinction is important, because only > > injected exceptions are propagated to L1 through exit_int_info. It would > > be incorrect to update CR2 in vmcb12 for a pending #PF, as L1 would > > perceive an updated CR2 value with no #PF. Update the comment in > > kvm_deliver_exception_payload() to clarify this. > > > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> > > --- > > arch/x86/kvm/svm/nested.c | 2 +- > > arch/x86/kvm/x86.c | 7 +++++++ > > 2 files changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c > > index de90b104a0dd5..9031746ce2db1 100644 > > --- a/arch/x86/kvm/svm/nested.c > > +++ b/arch/x86/kvm/svm/nested.c > > @@ -1156,7 +1156,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm) > > vmcb12->save.efer = svm->vcpu.arch.efer; > > vmcb12->save.cr0 = kvm_read_cr0(vcpu); > > vmcb12->save.cr3 = kvm_read_cr3(vcpu); > > - vmcb12->save.cr2 = vmcb02->save.cr2; > > + vmcb12->save.cr2 = vcpu->arch.cr2; > > vmcb12->save.cr4 = svm->vcpu.arch.cr4; > > vmcb12->save.rflags = kvm_get_rflags(vcpu); > > vmcb12->save.rip = kvm_rip_read(vcpu); > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index db3f393192d94..1015522d0fbd7 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -864,6 +864,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, > > vcpu->arch.exception.error_code = error_code; > > vcpu->arch.exception.has_payload = has_payload; > > vcpu->arch.exception.payload = payload; > > + /* > > + * Only injected exceptions are propagated to L1 in > > + * vmcb12/vmcs12 on nested #VMEXIT. Hence, do not deliver the > > Nit, #VMEXIT is SVM specific terminology. VM-Exit is more vendor agnostic. Will do. > > > + * exception payload for L2 until the exception is injected. > > + * Otherwise, L1 would perceive the updated payload without a > > + * corresponding exception. > > Huh. I'm fairly certain this code is now at best unnecessary, and at worst > actively harmful. Because the more architectural way to document this code is: > > /* > * If L2 is active, defer delivery of the payload until the > * exception is actually injected to avoid clobbering state if > * L1 wants to intercept the exception (the architectural state > * is NOT updated if the exeption is morphed to a VM-Exit). > */ It's not only about exceptions being morphed to a VM-Exit though, is it? KVM should not update the payload (e.g. CR2) if a #PF is pending but was not injected, because from L1's perspective CR2 was updated but exit_int_info won't reflect a #PF. Right? > > But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending > VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a > VM-Exit. > > Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that > commit. Because invoking kvm_deliver_exception_payload() when the exception was > morphed to a VM-Exit is wrong. Oh, wait, this is the !exception_payload_enabled > case. So never mind, that's simply an unfixable bug, as the second comment alludes > to. Hmm for the #PF case I think delivering the payload is always wrong if it was morphed to a VM-Exit, regardless of exception_payload_enabled, because the payload should never reach CR2, right? Spoiler alert, there's another problem there. Even if the exception did not morph to a VM-Exit, if userspace already did KVM_GET_SREGS then the delivered payload is lost :/ > > /* > * KVM's ABI only allows for one exception to be migrated. Luckily, > * the only time there can be two queued exceptions is if there's a > * non-exiting _injected_ exception, and a pending exiting exception. > * In that case, ignore the VM-Exiting exception as it's an extension > * of the injected exception. > */ > if (vcpu->arch.exception_vmexit.pending && > !vcpu->arch.exception.pending && > !vcpu->arch.exception.injected) > ex = &vcpu->arch.exception_vmexit; > else > ex = &vcpu->arch.exception; > > /* > * In guest mode, payload delivery should be deferred if the exception > * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1 > * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability, > * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not > * propagate the payload and so it cannot be safely deferred. Deliver > * the payload if the capability hasn't been requested. > */ > if (!vcpu->kvm->arch.exception_payload_enabled && > ex->pending && ex->has_payload) > kvm_deliver_exception_payload(vcpu, ex); > > So yeah, I _think_ we could drop the is_guest_mode() check. However, even better > would be to drop this call *entirely*, i.e. Hmm I don't think so, because as I mentioned above, KVM shouldn't update CR2 until the exception is actually injected, right? It would actually be great to drop the is_guest_mode() check here but leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS and KVM_SET_SREGS goes away, and I *think* we can drop the kvm_deliver_exception_payload() call in kvm_vcpu_ioctl_x86_get_vcpu_events(). The only problem would be CR2 getting updated without a fault being reflected in the vmcb12's exit_int_info AFAICT. > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index b0112c515584..00a39c95631d 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, > vcpu->arch.exception.error_code = error_code; > vcpu->arch.exception.has_payload = has_payload; > vcpu->arch.exception.payload = payload; > - if (!is_guest_mode(vcpu)) > - kvm_deliver_exception_payload(vcpu, > - &vcpu->arch.exception); > return; > } > > Because KVM really shouldn't update CR2 until the excpetion is actually injected > (or the state is at risk of being lost because exception_payload_enabled==false). > E.g. in the (extremely) unlikely (and probably impossible to configure reliably) > scenario that userspace deliberately drops a pending exception, arch state shouldn't > be updated. I think if we drop it there might be a problem. With exception_payload_enabled==false, pending exceptions becomes injected after save/restore. If the payload is not delivered here, then after restore we have an injected event with no payload. I guess the kvm_deliver_exception_payload() call in kvm_vcpu_ioctl_x86_get_vcpu_events() is supposed to handle this, but it only works if userspace does KVM_GET_SREGS *after* KVM_GET_VCPU_EVENTS. Removing the call here will regress VMM's doing KVM_GET_SREGS AFAICT. > > > + */ > > if (!is_guest_mode(vcpu)) > > kvm_deliver_exception_payload(vcpu, > > &vcpu->arch.exception); > > -- > > 2.53.0.rc2.204.g2597b5adb4-goog > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT 2026-02-03 16:51 ` Yosry Ahmed @ 2026-02-03 18:03 ` Sean Christopherson 2026-02-03 19:08 ` Yosry Ahmed 2026-02-11 20:40 ` Yosry Ahmed 0 siblings, 2 replies; 10+ messages in thread From: Sean Christopherson @ 2026-02-03 18:03 UTC (permalink / raw) To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel On Tue, Feb 03, 2026, Yosry Ahmed wrote: > On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote: > > On Tue, Feb 03, 2026, Yosry Ahmed wrote: > > /* > > * If L2 is active, defer delivery of the payload until the > > * exception is actually injected to avoid clobbering state if > > * L1 wants to intercept the exception (the architectural state > > * is NOT updated if the exeption is morphed to a VM-Exit). > > */ > > It's not only about exceptions being morphed to a VM-Exit though, is it? > KVM should not update the payload (e.g. CR2) if a #PF is pending but was > not injected, because from L1's perspective CR2 was updated but > exit_int_info won't reflect a #PF. Right? Right, but that's got nothing to do with L2 being active. Take nested completely out of the picture, and the above statement holds true as well. "If a #PF is pending but was not injected, then the guest shouldn't see a change in CR2". > > But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending > > VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a > > VM-Exit. > > > > Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that > > commit. Because invoking kvm_deliver_exception_payload() when the exception was > > morphed to a VM-Exit is wrong. Oh, wait, this is the !exception_payload_enabled > > case. So never mind, that's simply an unfixable bug, as the second comment alludes > > to. > > Hmm for the #PF case I think delivering the payload is always wrong if > it was morphed to a VM-Exit, regardless of exception_payload_enabled, > because the payload should never reach CR2, right? Right. > Spoiler alert, there's another problem there. Even if the exception did > not morph to a VM-Exit, if userspace already did KVM_GET_SREGS then the > delivered payload is lost :/ > > > > > /* > > * KVM's ABI only allows for one exception to be migrated. Luckily, > > * the only time there can be two queued exceptions is if there's a > > * non-exiting _injected_ exception, and a pending exiting exception. > > * In that case, ignore the VM-Exiting exception as it's an extension > > * of the injected exception. > > */ > > if (vcpu->arch.exception_vmexit.pending && > > !vcpu->arch.exception.pending && > > !vcpu->arch.exception.injected) > > ex = &vcpu->arch.exception_vmexit; > > else > > ex = &vcpu->arch.exception; > > > > /* > > * In guest mode, payload delivery should be deferred if the exception > > * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1 > > * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability, > > * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not > > * propagate the payload and so it cannot be safely deferred. Deliver > > * the payload if the capability hasn't been requested. > > */ > > if (!vcpu->kvm->arch.exception_payload_enabled && > > ex->pending && ex->has_payload) > > kvm_deliver_exception_payload(vcpu, ex); > > > > So yeah, I _think_ we could drop the is_guest_mode() check. However, even better > > would be to drop this call *entirely*, i.e. > > Hmm I don't think so, because as I mentioned above, KVM shouldn't update > CR2 until the exception is actually injected, right? Me confused. What you're saying is what I'm suggesting: don't update CR2 until KVM actually stuffs the #PF into the VMCS/VMCB. Oh, or are you talking about the first sentence? If so, strike that from the record, I was wrong (see below). > It would actually be great to drop the is_guest_mode() check here but > leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS > and KVM_SET_SREGS goes away, and I *think* we can drop the > kvm_deliver_exception_payload() call in > kvm_vcpu_ioctl_x86_get_vcpu_events(). > > The only problem would be CR2 getting updated without a fault being > reflected in the vmcb12's exit_int_info AFAICT. No, that particular case is a non-issue, because the code immediately above has already verified that KVM will *not* morph the #PF to a nested VM-Exit. Note, the "queue:" path is just for non-contributory exceptions and doesn't change the VM-Exit change anyways. /* * If the exception is destined for L2, morph it to a VM-Exit if L1 * wants to intercept the exception. */ if (is_guest_mode(vcpu) && kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) { <==== kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code, has_payload, payload); return; } if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) { queue: vcpu->arch.exception.pending = true; vcpu->arch.exception.injected = false; vcpu->arch.exception.has_error_code = has_error; vcpu->arch.exception.vector = nr; vcpu->arch.exception.error_code = error_code; vcpu->arch.exception.has_payload = has_payload; vcpu->arch.exception.payload = payload; if (!is_guest_mode(vcpu)) kvm_deliver_exception_payload(vcpu, &vcpu->arch.exception); return; } > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index b0112c515584..00a39c95631d 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, > > vcpu->arch.exception.error_code = error_code; > > vcpu->arch.exception.has_payload = has_payload; > > vcpu->arch.exception.payload = payload; > > - if (!is_guest_mode(vcpu)) > > - kvm_deliver_exception_payload(vcpu, > > - &vcpu->arch.exception); > > return; > > } > > > > Because KVM really shouldn't update CR2 until the excpetion is actually injected > > (or the state is at risk of being lost because exception_payload_enabled==false). > > E.g. in the (extremely) unlikely (and probably impossible to configure reliably) > > scenario that userspace deliberately drops a pending exception, arch state shouldn't > > be updated. > > I think if we drop it there might be a problem. With > exception_payload_enabled==false, pending exceptions becomes injected > after save/restore. If the payload is not delivered here, then after > restore we have an injected event with no payload. > > I guess the kvm_deliver_exception_payload() call in > kvm_vcpu_ioctl_x86_get_vcpu_events() is supposed to handle this, but it > only works if userspace does KVM_GET_SREGS *after* KVM_GET_VCPU_EVENTS. > Removing the call here will regress VMM's doing KVM_GET_SREGS AFAICT. Drat, QEMU does KVM_GET_VCPU_EVENTS before KVM_GET_SREGS{,2}, so I was hopeful we wouldn't actually break anyone, but Firecracker at least gets sregs before events. Of course Firecracker is already broken due to not enabling KVM_CAP_EXCEPTION_PAYLOAD... Ugh, and it's not just KVM_GET_SREGS, it's also KVM_GET_DEBUGREGS thanks to DR6. If we're being _super_ pedantic, then delivering the payload anywhere but injection or KVM_GET_VCPU_EVENTS is "wrong" (well, "more wrong", since any behavior without KVM_CAP_EXCEPTION_PAYLOAD is inherently wrong). E.g. very hypothetically, userspace that saves/restores sregs but not vCPU events would see an unexpected CR2 change. *sigh* Ewwwwwwww. And we *definitely* don't want to drop the is_guest_mode() check, because nested_vmx_update_pending_dbg() relies on the payload to still be valid. Ah, right, and that's also why commit a06230b62b89 ("KVM: x86: Deliver exception payload on KVM_GET_VCPU_EVENTS") from MTF series deferred the update: KVM needs to keep the #DB pending so that a VM-Exit that occurs before the #DB is injected gets recorded in vmcs12. So, with all of that in mind, I believe the best we can do is fully defer delivery of the exception until it's actually injected, and then apply the quirk to the relevant GET APIs. --- arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++----------------- 1 file changed, 39 insertions(+), 23 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b0112c515584..e000521dfc8b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, vcpu->arch.exception.error_code = error_code; vcpu->arch.exception.has_payload = has_payload; vcpu->arch.exception.payload = payload; - if (!is_guest_mode(vcpu)) - kvm_deliver_exception_payload(vcpu, - &vcpu->arch.exception); return; } @@ -5532,18 +5529,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu, return 0; } -static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, - struct kvm_vcpu_events *events) +static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu) { - struct kvm_queued_exception *ex; - - process_nmi(vcpu); - -#ifdef CONFIG_KVM_SMM - if (kvm_check_request(KVM_REQ_SMI, vcpu)) - process_smi(vcpu); -#endif - /* * KVM's ABI only allows for one exception to be migrated. Luckily, * the only time there can be two queued exceptions is if there's a @@ -5554,21 +5541,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, if (vcpu->arch.exception_vmexit.pending && !vcpu->arch.exception.pending && !vcpu->arch.exception.injected) - ex = &vcpu->arch.exception_vmexit; - else - ex = &vcpu->arch.exception; + return &vcpu->arch.exception_vmexit; + + return &vcpu->arch.exception; +} + +static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu) +{ + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); /* - * In guest mode, payload delivery should be deferred if the exception - * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1 - * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability, - * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not - * propagate the payload and so it cannot be safely deferred. Deliver - * the payload if the capability hasn't been requested. + * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver + * the pending exception payload when userspace saves *any* vCPU state + * that interacts with exception payloads to avoid breaking userspace. + * + * Architecturally, KVM must not deliver an exception payload until the + * exception is actually injected, e.g. to avoid losing pending #DB + * information (which VMX tracks in the VMCS), and to avoid clobbering + * state if the exception is never injected for whatever reason. But + * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or + * may not propagate the payload across save+restore, and so KVM can't + * safely defer delivery of the payload. */ if (!vcpu->kvm->arch.exception_payload_enabled && ex->pending && ex->has_payload) kvm_deliver_exception_payload(vcpu, ex); +} + +static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, + struct kvm_vcpu_events *events) +{ + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); + + process_nmi(vcpu); + +#ifdef CONFIG_KVM_SMM + if (kvm_check_request(KVM_REQ_SMI, vcpu)) + process_smi(vcpu); +#endif + + kvm_handle_exception_payload_quirk(vcpu); memset(events, 0, sizeof(*events)); @@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, vcpu->arch.guest_state_protected) return -EINVAL; + kvm_handle_exception_payload_quirk(vcpu); + memset(dbgregs, 0, sizeof(*dbgregs)); BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db)); @@ -12123,6 +12137,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) if (vcpu->arch.guest_state_protected) goto skip_protected_regs; + kvm_handle_exception_payload_quirk(vcpu); + kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS); kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS); kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES); base-commit: 55671237401edd1ec59276b852b9361cc170915b -- ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT 2026-02-03 18:03 ` Sean Christopherson @ 2026-02-03 19:08 ` Yosry Ahmed 2026-02-03 19:42 ` Sean Christopherson 2026-02-11 20:40 ` Yosry Ahmed 1 sibling, 1 reply; 10+ messages in thread From: Yosry Ahmed @ 2026-02-03 19:08 UTC (permalink / raw) To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel On Tue, Feb 03, 2026 at 10:03:35AM -0800, Sean Christopherson wrote: > On Tue, Feb 03, 2026, Yosry Ahmed wrote: > > On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote: > > > On Tue, Feb 03, 2026, Yosry Ahmed wrote: > > > /* > > > * If L2 is active, defer delivery of the payload until the > > > * exception is actually injected to avoid clobbering state if > > > * L1 wants to intercept the exception (the architectural state > > > * is NOT updated if the exeption is morphed to a VM-Exit). > > > */ > > > > It's not only about exceptions being morphed to a VM-Exit though, is it? > > KVM should not update the payload (e.g. CR2) if a #PF is pending but was > > not injected, because from L1's perspective CR2 was updated but > > exit_int_info won't reflect a #PF. Right? > > Right, but that's got nothing to do with L2 being active. Take nested completely > out of the picture, and the above statement holds true as well. "If a #PF is > pending but was not injected, then the guest shouldn't see a change in CR2". Right, but it is still related to nested in a way. Ignore the exception morphing to a VM-Exit, the case I am refering to is specifically exit_int_info on SVM. IIUC, if there's an injected (but not intercepted) exception when doing a nested VM-Exit, we have to propagate that to L1 (in nested_save_pending_event_to_vmcb12()), such that it can re-inject that exception. So what I was referring to is, if we write CR2 for a pending exception to L2, and then exit to L1, L1 would perceive a chance in CR2 without an ongoing #PF in exit_int_info. I believe the equivalent VMX function is vmcs12_save_pending_event(). All that to say, we should not deliver the payload of an exception to L2 before it's actually injected. > > > > But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending > > > VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a > > > VM-Exit. > > > > > > Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that > > > commit. Because invoking kvm_deliver_exception_payload() when the exception was > > > morphed to a VM-Exit is wrong. Oh, wait, this is the !exception_payload_enabled > > > case. So never mind, that's simply an unfixable bug, as the second comment alludes > > > to. > > > > Hmm for the #PF case I think delivering the payload is always wrong if > > it was morphed to a VM-Exit, regardless of exception_payload_enabled, > > because the payload should never reach CR2, right? > > Right. > > > Spoiler alert, there's another problem there. Even if the exception did > > not morph to a VM-Exit, if userspace already did KVM_GET_SREGS then the > > delivered payload is lost :/ > > > > > > > > /* > > > * KVM's ABI only allows for one exception to be migrated. Luckily, > > > * the only time there can be two queued exceptions is if there's a > > > * non-exiting _injected_ exception, and a pending exiting exception. > > > * In that case, ignore the VM-Exiting exception as it's an extension > > > * of the injected exception. > > > */ > > > if (vcpu->arch.exception_vmexit.pending && > > > !vcpu->arch.exception.pending && > > > !vcpu->arch.exception.injected) > > > ex = &vcpu->arch.exception_vmexit; > > > else > > > ex = &vcpu->arch.exception; > > > > > > /* > > > * In guest mode, payload delivery should be deferred if the exception > > > * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1 > > > * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability, > > > * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not > > > * propagate the payload and so it cannot be safely deferred. Deliver > > > * the payload if the capability hasn't been requested. > > > */ > > > if (!vcpu->kvm->arch.exception_payload_enabled && > > > ex->pending && ex->has_payload) > > > kvm_deliver_exception_payload(vcpu, ex); > > > > > > So yeah, I _think_ we could drop the is_guest_mode() check. However, even better > > > would be to drop this call *entirely*, i.e. > > > > Hmm I don't think so, because as I mentioned above, KVM shouldn't update > > CR2 until the exception is actually injected, right? > > Me confused. What you're saying is what I'm suggesting: don't update CR2 until > KVM actually stuffs the #PF into the VMCS/VMCB. Oh, or are you talking about the > first sentence? If so, strike that from the record, I was wrong (see below). Yup, first sentence. > > > It would actually be great to drop the is_guest_mode() check here but > > leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS > > and KVM_SET_SREGS goes away, and I *think* we can drop the > > kvm_deliver_exception_payload() call in > > kvm_vcpu_ioctl_x86_get_vcpu_events(). > > > > The only problem would be CR2 getting updated without a fault being > > reflected in the vmcb12's exit_int_info AFAICT. > > No, that particular case is a non-issue, because the code immediately above has > already verified that KVM will *not* morph the #PF to a nested VM-Exit. Note, > the "queue:" path is just for non-contributory exceptions and doesn't change the > VM-Exit change anyways. What I meant was not stuffing the #PF into the VMCB/VMCS because it's intercepted, but the #PF being stuffed into exit_int_info or idt_vectoring_info. If we drop the guest mode check here, we could end up with CR2 updated and a #PF not reflected in exit_int_info/idt_vectoring_info (assuming #PF is not intercepted). > > /* > * If the exception is destined for L2, morph it to a VM-Exit if L1 > * wants to intercept the exception. > */ > if (is_guest_mode(vcpu) && > kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) { <==== > kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code, > has_payload, payload); > return; > } > > if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) { > queue: > vcpu->arch.exception.pending = true; > vcpu->arch.exception.injected = false; > > vcpu->arch.exception.has_error_code = has_error; > vcpu->arch.exception.vector = nr; > vcpu->arch.exception.error_code = error_code; > vcpu->arch.exception.has_payload = has_payload; > vcpu->arch.exception.payload = payload; > if (!is_guest_mode(vcpu)) > kvm_deliver_exception_payload(vcpu, > &vcpu->arch.exception); > return; > } > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index b0112c515584..00a39c95631d 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, > > > vcpu->arch.exception.error_code = error_code; > > > vcpu->arch.exception.has_payload = has_payload; > > > vcpu->arch.exception.payload = payload; > > > - if (!is_guest_mode(vcpu)) > > > - kvm_deliver_exception_payload(vcpu, > > > - &vcpu->arch.exception); > > > return; > > > } > > > > > > Because KVM really shouldn't update CR2 until the excpetion is actually injected > > > (or the state is at risk of being lost because exception_payload_enabled==false). > > > E.g. in the (extremely) unlikely (and probably impossible to configure reliably) > > > scenario that userspace deliberately drops a pending exception, arch state shouldn't > > > be updated. > > > > I think if we drop it there might be a problem. With > > exception_payload_enabled==false, pending exceptions becomes injected > > after save/restore. If the payload is not delivered here, then after > > restore we have an injected event with no payload. > > > > I guess the kvm_deliver_exception_payload() call in > > kvm_vcpu_ioctl_x86_get_vcpu_events() is supposed to handle this, but it > > only works if userspace does KVM_GET_SREGS *after* KVM_GET_VCPU_EVENTS. > > Removing the call here will regress VMM's doing KVM_GET_SREGS AFAICT. > > Drat, QEMU does KVM_GET_VCPU_EVENTS before KVM_GET_SREGS{,2}, so I was hopeful > we wouldn't actually break anyone, but Firecracker at least gets sregs before > events. Of course Firecracker is already broken due to not enabling > KVM_CAP_EXCEPTION_PAYLOAD... > > Ugh, and it's not just KVM_GET_SREGS, it's also KVM_GET_DEBUGREGS thanks to DR6. > > If we're being _super_ pedantic, then delivering the payload anywhere but injection > or KVM_GET_VCPU_EVENTS is "wrong" (well, "more wrong", since any behavior without > KVM_CAP_EXCEPTION_PAYLOAD is inherently wrong). E.g. very hypothetically, userspace > that saves/restores sregs but not vCPU events would see an unexpected CR2 change. > > *sigh* > > Ewwwwwwww. And we *definitely* don't want to drop the is_guest_mode() check, > because nested_vmx_update_pending_dbg() relies on the payload to still be valid. > > Ah, right, and that's also why commit a06230b62b89 ("KVM: x86: Deliver exception > payload on KVM_GET_VCPU_EVENTS") from MTF series deferred the update: KVM needs > to keep the #DB pending so that a VM-Exit that occurs before the #DB is injected > gets recorded in vmcs12. > > So, with all of that in mind, I believe the best we can do is fully defer delivery > of the exception until it's actually injected, and then apply the quirk to the > relevant GET APIs. I think this should work. I can test it for the nested case, the way I could reproduce the problem (with a VMM that does KVM_GET_SREGS before KVM_GET_VCPU_EVENTS, but does not use KVM_CAP_EXCEPTION_PAYLOAD) is by intercepting and re-injecting all #PFs from L2, and then repeatedly doing save+restore while L2 is doing some heavy lifting (building GCC). This generates a lot of #PF exceptions to be saved+restored, and we eventually get a segfault because of corrupted CR2 in L2. Removing the is_guest_mode() check in kvm_multiple_exception() fixes it by prematurely delivering the payload when it's queued. I think your fix will also work by prematurely delivering the payload at save time. This is actually more corect because at restore time the exception will become injected and treated as such (e.g. shows up in exit_int_info). Do you intend to send a patch? Or should I send it out (separate from the current one) with you as the author? > > --- > arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++----------------- > 1 file changed, 39 insertions(+), 23 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index b0112c515584..e000521dfc8b 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, > vcpu->arch.exception.error_code = error_code; > vcpu->arch.exception.has_payload = has_payload; > vcpu->arch.exception.payload = payload; > - if (!is_guest_mode(vcpu)) > - kvm_deliver_exception_payload(vcpu, > - &vcpu->arch.exception); > return; > } > > @@ -5532,18 +5529,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu, > return 0; > } > > -static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, > - struct kvm_vcpu_events *events) > +static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu) > { > - struct kvm_queued_exception *ex; > - > - process_nmi(vcpu); > - > -#ifdef CONFIG_KVM_SMM > - if (kvm_check_request(KVM_REQ_SMI, vcpu)) > - process_smi(vcpu); > -#endif > - > /* > * KVM's ABI only allows for one exception to be migrated. Luckily, > * the only time there can be two queued exceptions is if there's a > @@ -5554,21 +5541,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, > if (vcpu->arch.exception_vmexit.pending && > !vcpu->arch.exception.pending && > !vcpu->arch.exception.injected) > - ex = &vcpu->arch.exception_vmexit; > - else > - ex = &vcpu->arch.exception; > + return &vcpu->arch.exception_vmexit; > + > + return &vcpu->arch.exception; > +} > + > +static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu) > +{ > + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); > > /* > - * In guest mode, payload delivery should be deferred if the exception > - * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1 > - * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability, > - * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not > - * propagate the payload and so it cannot be safely deferred. Deliver > - * the payload if the capability hasn't been requested. > + * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver > + * the pending exception payload when userspace saves *any* vCPU state > + * that interacts with exception payloads to avoid breaking userspace. > + * > + * Architecturally, KVM must not deliver an exception payload until the > + * exception is actually injected, e.g. to avoid losing pending #DB > + * information (which VMX tracks in the VMCS), and to avoid clobbering > + * state if the exception is never injected for whatever reason. But > + * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or > + * may not propagate the payload across save+restore, and so KVM can't > + * safely defer delivery of the payload. > */ > if (!vcpu->kvm->arch.exception_payload_enabled && > ex->pending && ex->has_payload) > kvm_deliver_exception_payload(vcpu, ex); > +} > + > +static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, > + struct kvm_vcpu_events *events) > +{ > + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); > + > + process_nmi(vcpu); > + > +#ifdef CONFIG_KVM_SMM > + if (kvm_check_request(KVM_REQ_SMI, vcpu)) > + process_smi(vcpu); > +#endif > + > + kvm_handle_exception_payload_quirk(vcpu); > > memset(events, 0, sizeof(*events)); > > @@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, > vcpu->arch.guest_state_protected) > return -EINVAL; > > + kvm_handle_exception_payload_quirk(vcpu); > + > memset(dbgregs, 0, sizeof(*dbgregs)); > > BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db)); > @@ -12123,6 +12137,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) > if (vcpu->arch.guest_state_protected) > goto skip_protected_regs; > > + kvm_handle_exception_payload_quirk(vcpu); > + > kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS); > kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS); > kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES); > > base-commit: 55671237401edd1ec59276b852b9361cc170915b > -- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT 2026-02-03 19:08 ` Yosry Ahmed @ 2026-02-03 19:42 ` Sean Christopherson 2026-02-03 19:54 ` Yosry Ahmed 0 siblings, 1 reply; 10+ messages in thread From: Sean Christopherson @ 2026-02-03 19:42 UTC (permalink / raw) To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel On Tue, Feb 03, 2026, Yosry Ahmed wrote: > On Tue, Feb 03, 2026 at 10:03:35AM -0800, Sean Christopherson wrote: > > On Tue, Feb 03, 2026, Yosry Ahmed wrote: > > > On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote: > > > > On Tue, Feb 03, 2026, Yosry Ahmed wrote: > > > > /* > > > > * If L2 is active, defer delivery of the payload until the > > > > * exception is actually injected to avoid clobbering state if > > > > * L1 wants to intercept the exception (the architectural state > > > > * is NOT updated if the exeption is morphed to a VM-Exit). > > > > */ > > > > > > It's not only about exceptions being morphed to a VM-Exit though, is it? > > > KVM should not update the payload (e.g. CR2) if a #PF is pending but was > > > not injected, because from L1's perspective CR2 was updated but > > > exit_int_info won't reflect a #PF. Right? > > > > Right, but that's got nothing to do with L2 being active. Take nested completely > > out of the picture, and the above statement holds true as well. "If a #PF is > > pending but was not injected, then the guest shouldn't see a change in CR2". > > Right, but it is still related to nested in a way. Ignore the exception > morphing to a VM-Exit, the case I am refering to is specifically > exit_int_info on SVM. IIUC, if there's an injected (but not intercepted) > exception when doing a nested VM-Exit, we have to propagate that to L1 > (in nested_save_pending_event_to_vmcb12()), such that it can re-inject > that exception. Ugh, that's a poor choice of name for nested_save_pending_event_to_vmcb12(). As defined by kvm_queued_exception, that's not a *pending* event, it's an *injected* event. In that case, the payload *should* have been delivered (to CR2 or DR6) because that exception has already occurred (been "detected" in the SDM's weird wording). The VM-Exit is not happening *before* the #PF, it's happening after the #PF is "detected", while the #PF is being vectored. From a virtualization perspective, any other implementation is basically unworkable, as it would require the host to gain control after an exception is successfully vectored. I.e. the absense of any mechanisms to support that effectively confirms that the CPU writes CR2 before attempting to deliver the exception to software. > So what I was referring to is, if we write CR2 for a pending exception > to L2, and then exit to L1, L1 would perceive a chance in CR2 without an > ongoing #PF in exit_int_info. I believe the equivalent VMX function is > vmcs12_save_pending_event(). Also poorly named :-/ > All that to say, we should not deliver the payload of an exception to L2 > before it's actually injected. As above, those helpers deal with exceptions that have already been injected by KVM. > > > It would actually be great to drop the is_guest_mode() check here but > > > leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS > > > and KVM_SET_SREGS goes away, and I *think* we can drop the > > > kvm_deliver_exception_payload() call in > > > kvm_vcpu_ioctl_x86_get_vcpu_events(). > > > > > > The only problem would be CR2 getting updated without a fault being > > > reflected in the vmcb12's exit_int_info AFAICT. > > > > No, that particular case is a non-issue, because the code immediately above has > > already verified that KVM will *not* morph the #PF to a nested VM-Exit. Note, > > the "queue:" path is just for non-contributory exceptions and doesn't change the > > VM-Exit change anyways. > > What I meant was not stuffing the #PF into the VMCB/VMCS because it's > intercepted, but the #PF being stuffed into exit_int_info or > idt_vectoring_info. > > If we drop the guest mode check here, we could end up with CR2 updated > and a #PF not reflected in exit_int_info/idt_vectoring_info (assuming > #PF is not intercepted). No, because once {svm,vmx}_inject_exception() have been reach, KVM has fully committed to delivering the exception to the guest. If KVM cancels KVM_RUN, e.g. because of a pending signal from userspace to initiate save/restore, KVM calls kvm_x86_ops.cancel_injection() so that vendor code can move the to-be-injected exception from the VMCS/VMCB back to vcpu->arch.exception. Note that kvm_requeue_exception() (a) sets injected=true and (b) deliberately doesn't track any payload, because the payload has already been delivered. If VM-Enter is executed and a non-nested VM-Exit occurs, then hardware saves the in-progress exception in VMCB.exit_int_info/VMCS.idt_vectoring_info, and KVM moves the exception back to vcpu->arch.exception via vmx_complete_interrupts() and svm_complete_interrupts() (which are also used for cancelling injection, because the logic is identical, only the VMCS/VMCB source differs). For nested VM-Exit, KVM needs to emulate that behavior. The exception has already been "detected" by KVM, and the payload has already been delivered, but a VM-Exit was encountered while vectoring the exception to software. E.g. if a guest #PF occurs while the guest stack is at the bottom of a page, such that the first N pushes will hit page X, and the last M pushes will hit page X-1, and the write to page X-1 hits a #NPF / EPT Violation, then L1 will (and should!) see an updated CR2, with the first N pushes to vector the exception resident in page X. > > So, with all of that in mind, I believe the best we can do is fully defer delivery > > of the exception until it's actually injected, and then apply the quirk to the > > relevant GET APIs. > > I think this should work. I can test it for the nested case, the way I > could reproduce the problem (with a VMM that does KVM_GET_SREGS before > KVM_GET_VCPU_EVENTS, but does not use KVM_CAP_EXCEPTION_PAYLOAD) is by > intercepting and re-injecting all #PFs from L2, and then repeatedly > doing save+restore while L2 is doing some heavy lifting (building GCC). > This generates a lot of #PF exceptions to be saved+restored, and we > eventually get a segfault because of corrupted CR2 in L2. > > Removing the is_guest_mode() check in kvm_multiple_exception() fixes it > by prematurely delivering the payload when it's queued. I think your fix > will also work by prematurely delivering the payload at save time. This > is actually more corect because at restore time the exception will > become injected and treated as such (e.g. shows up in exit_int_info). > > Do you intend to send a patch? Or should I send it out (separate from > the current one) with you as the author? I'll send a patch for this, there's a lot of historical information I want to capture. Can you send a v2 for _this_ patch, without the comment change? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT 2026-02-03 19:42 ` Sean Christopherson @ 2026-02-03 19:54 ` Yosry Ahmed 0 siblings, 0 replies; 10+ messages in thread From: Yosry Ahmed @ 2026-02-03 19:54 UTC (permalink / raw) To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel On Tue, Feb 03, 2026 at 11:42:01AM -0800, Sean Christopherson wrote: > On Tue, Feb 03, 2026, Yosry Ahmed wrote: > > On Tue, Feb 03, 2026 at 10:03:35AM -0800, Sean Christopherson wrote: > > > On Tue, Feb 03, 2026, Yosry Ahmed wrote: > > > > On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote: > > > > > On Tue, Feb 03, 2026, Yosry Ahmed wrote: > > > > > /* > > > > > * If L2 is active, defer delivery of the payload until the > > > > > * exception is actually injected to avoid clobbering state if > > > > > * L1 wants to intercept the exception (the architectural state > > > > > * is NOT updated if the exeption is morphed to a VM-Exit). > > > > > */ > > > > > > > > It's not only about exceptions being morphed to a VM-Exit though, is it? > > > > KVM should not update the payload (e.g. CR2) if a #PF is pending but was > > > > not injected, because from L1's perspective CR2 was updated but > > > > exit_int_info won't reflect a #PF. Right? > > > > > > Right, but that's got nothing to do with L2 being active. Take nested completely > > > out of the picture, and the above statement holds true as well. "If a #PF is > > > pending but was not injected, then the guest shouldn't see a change in CR2". > > > > Right, but it is still related to nested in a way. Ignore the exception > > morphing to a VM-Exit, the case I am refering to is specifically > > exit_int_info on SVM. IIUC, if there's an injected (but not intercepted) > > exception when doing a nested VM-Exit, we have to propagate that to L1 > > (in nested_save_pending_event_to_vmcb12()), such that it can re-inject > > that exception. > > Ugh, that's a poor choice of name for nested_save_pending_event_to_vmcb12(). > > As defined by kvm_queued_exception, that's not a *pending* event, it's an > *injected* event. In that case, the payload *should* have been delivered (to CR2 > or DR6) because that exception has already occurred (been "detected" in the SDM's > weird wording). The VM-Exit is not happening *before* the #PF, it's happening > after the #PF is "detected", while the #PF is being vectored. > > From a virtualization perspective, any other implementation is basically unworkable, > as it would require the host to gain control after an exception is successfully > vectored. I.e. the absense of any mechanisms to support that effectively confirms > that the CPU writes CR2 before attempting to deliver the exception to software. > > > So what I was referring to is, if we write CR2 for a pending exception > > to L2, and then exit to L1, L1 would perceive a chance in CR2 without an > > ongoing #PF in exit_int_info. I believe the equivalent VMX function is > > vmcs12_save_pending_event(). > > Also poorly named :-/ > > > All that to say, we should not deliver the payload of an exception to L2 > > before it's actually injected. > > As above, those helpers deal with exceptions that have already been injected by > KVM. > > > > > It would actually be great to drop the is_guest_mode() check here but > > > > leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS > > > > and KVM_SET_SREGS goes away, and I *think* we can drop the > > > > kvm_deliver_exception_payload() call in > > > > kvm_vcpu_ioctl_x86_get_vcpu_events(). > > > > > > > > The only problem would be CR2 getting updated without a fault being > > > > reflected in the vmcb12's exit_int_info AFAICT. > > > > > > No, that particular case is a non-issue, because the code immediately above has > > > already verified that KVM will *not* morph the #PF to a nested VM-Exit. Note, > > > the "queue:" path is just for non-contributory exceptions and doesn't change the > > > VM-Exit change anyways. > > > > What I meant was not stuffing the #PF into the VMCB/VMCS because it's > > intercepted, but the #PF being stuffed into exit_int_info or > > idt_vectoring_info. > > > > If we drop the guest mode check here, we could end up with CR2 updated > > and a #PF not reflected in exit_int_info/idt_vectoring_info (assuming > > #PF is not intercepted). > > No, because once {svm,vmx}_inject_exception() have been reach, KVM has fully > committed to delivering the exception to the guest. If KVM cancels KVM_RUN, e.g. > because of a pending signal from userspace to initiate save/restore, KVM calls > kvm_x86_ops.cancel_injection() so that vendor code can move the to-be-injected > exception from the VMCS/VMCB back to vcpu->arch.exception. Note that > kvm_requeue_exception() (a) sets injected=true and (b) deliberately doesn't > track any payload, because the payload has already been delivered. > > If VM-Enter is executed and a non-nested VM-Exit occurs, then hardware saves the > in-progress exception in VMCB.exit_int_info/VMCS.idt_vectoring_info, and KVM > moves the exception back to vcpu->arch.exception via vmx_complete_interrupts() > and svm_complete_interrupts() (which are also used for cancelling injection, > because the logic is identical, only the VMCS/VMCB source differs). > > For nested VM-Exit, KVM needs to emulate that behavior. The exception has already > been "detected" by KVM, and the payload has already been delivered, but a VM-Exit > was encountered while vectoring the exception to software. > > E.g. if a guest #PF occurs while the guest stack is at the bottom of a page, such > that the first N pushes will hit page X, and the last M pushes will hit page X-1, > and the write to page X-1 hits a #NPF / EPT Violation, then L1 will (and should!) > see an updated CR2, with the first N pushes to vector the exception resident in > page X. We are not disagreeing, I think I may not have been clear. I agree with all what you said. What I was saying is basically delivering the payload for *pending* exceptions for L2 is wrong (i.e. just removing the is_guest_mode() check before calling kvm_deliver_exception_payload() in kvm_multiple_exception()). Exactly because of everything you said, the exception is not yet *injected* and is *not* reflected to L1 in exit_int_info/idt_vectoring_info if it's injection fails. i.e what I said above: If we drop the guest mode check here, we could end up with CR2 updated and a #PF not reflected in exit_int_info/idt_vectoring_info (assuming #PF is not intercepted). We already established that removing the is_guest_mode() would be wrong, so all is good here, I am just clarifying that what I meant initially is that it would be wrong because we would update CR2 without actually reflecting the #PF in exit_int_info/idt_vectoring_info, not because we could update CR2 before a #PF intercept (as you mentioned, this part is handled). > > > > So, with all of that in mind, I believe the best we can do is fully defer delivery > > > of the exception until it's actually injected, and then apply the quirk to the > > > relevant GET APIs. > > > > I think this should work. I can test it for the nested case, the way I > > could reproduce the problem (with a VMM that does KVM_GET_SREGS before > > KVM_GET_VCPU_EVENTS, but does not use KVM_CAP_EXCEPTION_PAYLOAD) is by > > intercepting and re-injecting all #PFs from L2, and then repeatedly > > doing save+restore while L2 is doing some heavy lifting (building GCC). > > This generates a lot of #PF exceptions to be saved+restored, and we > > eventually get a segfault because of corrupted CR2 in L2. > > > > Removing the is_guest_mode() check in kvm_multiple_exception() fixes it > > by prematurely delivering the payload when it's queued. I think your fix > > will also work by prematurely delivering the payload at save time. This > > is actually more corect because at restore time the exception will > > become injected and treated as such (e.g. shows up in exit_int_info). > > > > Do you intend to send a patch? Or should I send it out (separate from > > the current one) with you as the author? > > I'll send a patch for this, there's a lot of historical information I want to > capture. Sounds good, I will test that it fixes the problem of losing the payload of a nested exception when KVM_GET_SREGS preceeds KVM_GET_VCPU_EVENTS (without KVM_CAP_EXCEPTION_PAYLOAD). Please CC me when you send it. > > Can you send a v2 for _this_ patch, without the comment change? Will do so shortly. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT 2026-02-03 18:03 ` Sean Christopherson 2026-02-03 19:08 ` Yosry Ahmed @ 2026-02-11 20:40 ` Yosry Ahmed 2026-02-12 18:14 ` Sean Christopherson 1 sibling, 1 reply; 10+ messages in thread From: Yosry Ahmed @ 2026-02-11 20:40 UTC (permalink / raw) To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel > So, with all of that in mind, I believe the best we can do is fully defer delivery > of the exception until it's actually injected, and then apply the quirk to the > relevant GET APIs. > > --- > arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++----------------- > 1 file changed, 39 insertions(+), 23 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index b0112c515584..e000521dfc8b 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, > vcpu->arch.exception.error_code = error_code; > vcpu->arch.exception.has_payload = has_payload; > vcpu->arch.exception.payload = payload; > - if (!is_guest_mode(vcpu)) > - kvm_deliver_exception_payload(vcpu, > - &vcpu->arch.exception); > return; > } > > @@ -5532,18 +5529,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu, > return 0; > } > > -static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, > - struct kvm_vcpu_events *events) > +static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu) > { > - struct kvm_queued_exception *ex; > - > - process_nmi(vcpu); > - > -#ifdef CONFIG_KVM_SMM > - if (kvm_check_request(KVM_REQ_SMI, vcpu)) > - process_smi(vcpu); > -#endif > - > /* > * KVM's ABI only allows for one exception to be migrated. Luckily, > * the only time there can be two queued exceptions is if there's a > @@ -5554,21 +5541,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, > if (vcpu->arch.exception_vmexit.pending && > !vcpu->arch.exception.pending && > !vcpu->arch.exception.injected) > - ex = &vcpu->arch.exception_vmexit; > - else > - ex = &vcpu->arch.exception; > + return &vcpu->arch.exception_vmexit; > + > + return &vcpu->arch.exception; > +} > + > +static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu) > +{ > + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); > > /* > - * In guest mode, payload delivery should be deferred if the exception > - * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1 > - * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability, > - * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not > - * propagate the payload and so it cannot be safely deferred. Deliver > - * the payload if the capability hasn't been requested. > + * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver > + * the pending exception payload when userspace saves *any* vCPU state > + * that interacts with exception payloads to avoid breaking userspace. > + * > + * Architecturally, KVM must not deliver an exception payload until the > + * exception is actually injected, e.g. to avoid losing pending #DB > + * information (which VMX tracks in the VMCS), and to avoid clobbering > + * state if the exception is never injected for whatever reason. But > + * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or > + * may not propagate the payload across save+restore, and so KVM can't > + * safely defer delivery of the payload. > */ > if (!vcpu->kvm->arch.exception_payload_enabled && > ex->pending && ex->has_payload) > kvm_deliver_exception_payload(vcpu, ex); > +} > + > +static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, > + struct kvm_vcpu_events *events) > +{ > + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); > + > + process_nmi(vcpu); > + > +#ifdef CONFIG_KVM_SMM > + if (kvm_check_request(KVM_REQ_SMI, vcpu)) > + process_smi(vcpu); > +#endif > + > + kvm_handle_exception_payload_quirk(vcpu); > > memset(events, 0, sizeof(*events)); > > @@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, > vcpu->arch.guest_state_protected) > return -EINVAL; > > + kvm_handle_exception_payload_quirk(vcpu); > + > memset(dbgregs, 0, sizeof(*dbgregs)); > > BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db)); > @@ -12123,6 +12137,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) > if (vcpu->arch.guest_state_protected) > goto skip_protected_regs; > > + kvm_handle_exception_payload_quirk(vcpu); > + Hmm looking at this again, I realized it also affects the code path from store_regs(), I think we don't want to prematurely deliver exception payloads in that path. So maybe it's best to move this to kvm_arch_vcpu_ioctl_get_sregs() and kvm_arch_vcpu_ioctl()? The other option is to plumb a boolean that is only set to true in the ioctl code path. > kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS); > kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS); > kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES); > > base-commit: 55671237401edd1ec59276b852b9361cc170915b > -- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT 2026-02-11 20:40 ` Yosry Ahmed @ 2026-02-12 18:14 ` Sean Christopherson 0 siblings, 0 replies; 10+ messages in thread From: Sean Christopherson @ 2026-02-12 18:14 UTC (permalink / raw) To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel On Wed, Feb 11, 2026, Yosry Ahmed wrote: > > > So, with all of that in mind, I believe the best we can do is fully defer delivery > > of the exception until it's actually injected, and then apply the quirk to the > > relevant GET APIs. > > @@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, > > vcpu->arch.guest_state_protected) > > return -EINVAL; > > > > + kvm_handle_exception_payload_quirk(vcpu); > > + > > memset(dbgregs, 0, sizeof(*dbgregs)); > > > > BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db)); > > @@ -12123,6 +12137,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) > > if (vcpu->arch.guest_state_protected) > > goto skip_protected_regs; > > > > + kvm_handle_exception_payload_quirk(vcpu); > > + > > Hmm looking at this again, I realized it also affects the code path from > store_regs(), I think we don't want to prematurely deliver exception > payloads in that path. Hrm, I actually think delivering the payload in store_regs() is the least awful option. E.g. a VMM that saves sregs on exit to userspace could elide KVM_GET_SREGS when doing a save/restore. In practice, it's all moot, because AFAICT nothing uses KVM_SYNC_X86_SREGS. > So maybe it's best to move this to > kvm_arch_vcpu_ioctl_get_sregs() and kvm_arch_vcpu_ioctl()? > > The other option is to plumb a boolean that is only set to true in the > ioctl code path. > > > kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS); > > kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS); > > kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES); > > > > base-commit: 55671237401edd1ec59276b852b9361cc170915b > > -- ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-02-12 18:14 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-03 1:13 [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT Yosry Ahmed 2026-02-03 15:33 ` Yosry Ahmed 2026-02-03 16:12 ` Sean Christopherson 2026-02-03 16:51 ` Yosry Ahmed 2026-02-03 18:03 ` Sean Christopherson 2026-02-03 19:08 ` Yosry Ahmed 2026-02-03 19:42 ` Sean Christopherson 2026-02-03 19:54 ` Yosry Ahmed 2026-02-11 20:40 ` Yosry Ahmed 2026-02-12 18:14 ` Sean Christopherson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox