[PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT
@ 2026-02-03  1:13 Yosry Ahmed
  2026-02-03 15:33 ` Yosry Ahmed
  2026-02-03 16:12 ` Sean Christopherson
  0 siblings, 2 replies; 10+ messages in thread
From: Yosry Ahmed @ 2026-02-03  1:13 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Yosry Ahmed

KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on
nested #VMEXIT. Use the value from vcpu->arch.cr2 instead.

The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN
of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are
only out-of-sync in two cases: after migration, and after a #PF is
injected into L2.

After migration, the value of CR2 in vmcb02 is uninitialized (i.e.
zero), as KVM_SET_SREGS restores CR2 value to vcpu->arch.cr2. Using
vcpu->arch.cr2 to update vmcb12 is the right thing to do.

The #PF injection case is more nuanced. It occurs if KVM injects a #PF
into L2, then exits to L1 before it actually runs L2. Although the APM
is a bit unclear about when CR2 is written during a #PF, the SDM is more
clear:

	Processors update CR2 whenever a page fault is detected. If a
	second page fault occurs while an earlier page fault is being
	delivered, the faulting linear address of the second fault will
	overwrite the contents of CR2 (replacing the previous address).
	These updates to CR2 occur even if the page fault results in a
	double fault or occurs during the delivery of a double fault.

KVM injecting the exception surely counts as the #PF being "detected".
More importantly, when an exception is injected into L2 at the time of a
synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly,
such that an L1 hypervisor can re-inject the exception. If CR2 is not
written at that point, the L1 hypervisor have no way of correctly
re-injecting the #PF. Hence, using vcpu->arch.cr2 is also the right
thing to write in vmcb12 in this case.

Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for
L2, only when it is injected. The distinction is important, because only
injected exceptions are propagated to L1 through exit_int_info. It would
be incorrect to update CR2 in vmcb12 for a pending #PF, as L1 would
perceive an updated CR2 value with no #PF. Update the comment in
kvm_deliver_exception_payload() to clarify this.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c | 2 +-
 arch/x86/kvm/x86.c        | 7 +++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index de90b104a0dd5..9031746ce2db1 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1156,7 +1156,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 	vmcb12->save.efer   = svm->vcpu.arch.efer;
 	vmcb12->save.cr0    = kvm_read_cr0(vcpu);
 	vmcb12->save.cr3    = kvm_read_cr3(vcpu);
-	vmcb12->save.cr2    = vmcb02->save.cr2;
+	vmcb12->save.cr2    = vcpu->arch.cr2;
 	vmcb12->save.cr4    = svm->vcpu.arch.cr4;
 	vmcb12->save.rflags = kvm_get_rflags(vcpu);
 	vmcb12->save.rip    = kvm_rip_read(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d94..1015522d0fbd7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -864,6 +864,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
 		vcpu->arch.exception.error_code = error_code;
 		vcpu->arch.exception.has_payload = has_payload;
 		vcpu->arch.exception.payload = payload;
+		/*
+		 * Only injected exceptions are propagated to L1 in
+		 * vmcb12/vmcs12 on nested #VMEXIT. Hence, do not deliver the
+		 * exception payload for L2 until the exception is injected.
+		 * Otherwise, L1 would perceive the updated payload without a
+		 * corresponding exception.
+		 */
 		if (!is_guest_mode(vcpu))
 			kvm_deliver_exception_payload(vcpu,
 						      &vcpu->arch.exception);
-- 
2.53.0.rc2.204.g2597b5adb4-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT
  2026-02-03  1:13 [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT Yosry Ahmed
@ 2026-02-03 15:33 ` Yosry Ahmed
  2026-02-03 16:12 ` Sean Christopherson
  1 sibling, 0 replies; 10+ messages in thread
From: Yosry Ahmed @ 2026-02-03 15:33 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Feb 03, 2026 at 01:13:20AM +0000, Yosry Ahmed wrote:
> KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on
> nested #VMEXIT. Use the value from vcpu->arch.cr2 instead.
> 
> The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN
> of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are
> only out-of-sync in two cases: after migration, and after a #PF is
> injected into L2.
> 
> After migration, the value of CR2 in vmcb02 is uninitialized (i.e.
> zero), as KVM_SET_SREGS restores CR2 value to vcpu->arch.cr2. Using
> vcpu->arch.cr2 to update vmcb12 is the right thing to do.
> 
> The #PF injection case is more nuanced. It occurs if KVM injects a #PF
> into L2, then exits to L1 before it actually runs L2. Although the APM
> is a bit unclear about when CR2 is written during a #PF, the SDM is more
> clear:
> 
> 	Processors update CR2 whenever a page fault is detected. If a
> 	second page fault occurs while an earlier page fault is being
> 	delivered, the faulting linear address of the second fault will
> 	overwrite the contents of CR2 (replacing the previous address).
> 	These updates to CR2 occur even if the page fault results in a
> 	double fault or occurs during the delivery of a double fault.
> 
> KVM injecting the exception surely counts as the #PF being "detected".
> More importantly, when an exception is injected into L2 at the time of a
> synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly,
> such that an L1 hypervisor can re-inject the exception. If CR2 is not
> written at that point, the L1 hypervisor have no way of correctly
> re-injecting the #PF. Hence, using vcpu->arch.cr2 is also the right
> thing to write in vmcb12 in this case.
> 
> Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for
> L2, only when it is injected. The distinction is important, because only
> injected exceptions are propagated to L1 through exit_int_info. It would
> be incorrect to update CR2 in vmcb12 for a pending #PF, as L1 would
> perceive an updated CR2 value with no #PF. Update the comment in
> kvm_deliver_exception_payload() to clarify this.

I forgot the best part:

If a synthesized #VMEXIT to L1 writes the wrong CR2 (e.g. right after
migration), and L2 is handling a #PF, it could read a corrupted CR2.
This could manifest as segmentation faults in L2, or potentially data
corruption.

Cc: stable@vger.kernel.org

> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 2 +-
>  arch/x86/kvm/x86.c        | 7 +++++++
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index de90b104a0dd5..9031746ce2db1 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1156,7 +1156,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>  	vmcb12->save.efer   = svm->vcpu.arch.efer;
>  	vmcb12->save.cr0    = kvm_read_cr0(vcpu);
>  	vmcb12->save.cr3    = kvm_read_cr3(vcpu);
> -	vmcb12->save.cr2    = vmcb02->save.cr2;
> +	vmcb12->save.cr2    = vcpu->arch.cr2;
>  	vmcb12->save.cr4    = svm->vcpu.arch.cr4;
>  	vmcb12->save.rflags = kvm_get_rflags(vcpu);
>  	vmcb12->save.rip    = kvm_rip_read(vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index db3f393192d94..1015522d0fbd7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -864,6 +864,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
>  		vcpu->arch.exception.error_code = error_code;
>  		vcpu->arch.exception.has_payload = has_payload;
>  		vcpu->arch.exception.payload = payload;
> +		/*
> +		 * Only injected exceptions are propagated to L1 in
> +		 * vmcb12/vmcs12 on nested #VMEXIT. Hence, do not deliver the
> +		 * exception payload for L2 until the exception is injected.
> +		 * Otherwise, L1 would perceive the updated payload without a
> +		 * corresponding exception.
> +		 */
>  		if (!is_guest_mode(vcpu))
>  			kvm_deliver_exception_payload(vcpu,
>  						      &vcpu->arch.exception);
> -- 
> 2.53.0.rc2.204.g2597b5adb4-goog
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT
  2026-02-03  1:13 [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT Yosry Ahmed
  2026-02-03 15:33 ` Yosry Ahmed
@ 2026-02-03 16:12 ` Sean Christopherson
  2026-02-03 16:51   ` Yosry Ahmed
  1 sibling, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2026-02-03 16:12 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on
> nested #VMEXIT. Use the value from vcpu->arch.cr2 instead.
> 
> The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN
> of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are
> only out-of-sync in two cases: after migration, and after a #PF is

Nit, instead of "migration", talk about state save/restore, and then list live
migration as an example.  Most of the time it's fairly obvious that "migration"
in KVM means "live migration", but not always.  E.g. task migration often comes
into play, as does page migration.

And more importantly, the above statement is wrong when state is saved/restored
for something other than migration.  E.g. if userspace is restoring a snapshot
for reasons other than migrating a VM.

> injected into L2.
> 
> After migration, the value of CR2 in vmcb02 is uninitialized (i.e.

save/restore

> zero),

This isn't guaranteed.  E.g. if state is restored into a vCPU that was already
running, vmcb02.save.cr2 could hold any number of things.

> as KVM_SET_SREGS restores CR2 value to vcpu->arch.cr2. Using
> vcpu->arch.cr2 to update vmcb12 is the right thing to do.
> 
> The #PF injection case is more nuanced. It occurs if KVM injects a #PF
> into L2, then exits to L1 before it actually runs L2. Although the APM
> is a bit unclear about when CR2 is written during a #PF, the SDM is more
> clear:
> 
> 	Processors update CR2 whenever a page fault is detected. If a
> 	second page fault occurs while an earlier page fault is being
> 	delivered, the faulting linear address of the second fault will
> 	overwrite the contents of CR2 (replacing the previous address).
> 	These updates to CR2 occur even if the page fault results in a
> 	double fault or occurs during the delivery of a double fault.
> 
> KVM injecting the exception surely counts as the #PF being "detected".

Heh, "detected" is definitely poor wording in the SDM.

> More importantly, when an exception is injected into L2 at the time of a
> synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly,
> such that an L1 hypervisor can re-inject the exception. If CR2 is not
> written at that point, the L1 hypervisor have no way of correctly
> re-injecting the #PF. Hence, using vcpu->arch.cr2 is also the right
> thing to write in vmcb12 in this case.
> 
> Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for
> L2, only when it is injected. The distinction is important, because only
> injected exceptions are propagated to L1 through exit_int_info. It would
> be incorrect to update CR2 in vmcb12 for a pending #PF, as L1 would
> perceive an updated CR2 value with no #PF. Update the comment in
> kvm_deliver_exception_payload() to clarify this.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 2 +-
>  arch/x86/kvm/x86.c        | 7 +++++++
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index de90b104a0dd5..9031746ce2db1 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1156,7 +1156,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>  	vmcb12->save.efer   = svm->vcpu.arch.efer;
>  	vmcb12->save.cr0    = kvm_read_cr0(vcpu);
>  	vmcb12->save.cr3    = kvm_read_cr3(vcpu);
> -	vmcb12->save.cr2    = vmcb02->save.cr2;
> +	vmcb12->save.cr2    = vcpu->arch.cr2;
>  	vmcb12->save.cr4    = svm->vcpu.arch.cr4;
>  	vmcb12->save.rflags = kvm_get_rflags(vcpu);
>  	vmcb12->save.rip    = kvm_rip_read(vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index db3f393192d94..1015522d0fbd7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -864,6 +864,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
>  		vcpu->arch.exception.error_code = error_code;
>  		vcpu->arch.exception.has_payload = has_payload;
>  		vcpu->arch.exception.payload = payload;
> +		/*
> +		 * Only injected exceptions are propagated to L1 in
> +		 * vmcb12/vmcs12 on nested #VMEXIT. Hence, do not deliver the

Nit, #VMEXIT is SVM specific terminology.  VM-Exit is more vendor agnostic.

> +		 * exception payload for L2 until the exception is injected.
> +		 * Otherwise, L1 would perceive the updated payload without a
> +		 * corresponding exception.

Huh.  I'm fairly certain this code is now at best unnecessary, and at worst
actively harmful.  Because the more architectural way to document this code is:

		/*
		 * If L2 is active, defer delivery of the payload until the
		 * exception is actually injected to avoid clobbering state if
		 * L1 wants to intercept the exception (the architectural state
		 * is NOT updated if the exeption is morphed to a VM-Exit).
		 */

But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending
VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a
VM-Exit.

Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that
commit.  Because invoking kvm_deliver_exception_payload() when the exception was
morphed to a VM-Exit is wrong.  Oh, wait, this is the !exception_payload_enabled
case.  So never mind, that's simply an unfixable bug, as the second comment alludes
to.

	/*
	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
	 * the only time there can be two queued exceptions is if there's a
	 * non-exiting _injected_ exception, and a pending exiting exception.
	 * In that case, ignore the VM-Exiting exception as it's an extension
	 * of the injected exception.
	 */
	if (vcpu->arch.exception_vmexit.pending &&
	    !vcpu->arch.exception.pending &&
	    !vcpu->arch.exception.injected)
		ex = &vcpu->arch.exception_vmexit;
	else
		ex = &vcpu->arch.exception;

	/*
	 * In guest mode, payload delivery should be deferred if the exception
	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
	 * propagate the payload and so it cannot be safely deferred.  Deliver
	 * the payload if the capability hasn't been requested.
	 */
	if (!vcpu->kvm->arch.exception_payload_enabled &&
	    ex->pending && ex->has_payload)
		kvm_deliver_exception_payload(vcpu, ex);

So yeah, I _think_ we could drop the is_guest_mode() check.  However, even better
would be to drop this call *entirely*, i.e.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0112c515584..00a39c95631d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
                vcpu->arch.exception.error_code = error_code;
                vcpu->arch.exception.has_payload = has_payload;
                vcpu->arch.exception.payload = payload;
-               if (!is_guest_mode(vcpu))
-                       kvm_deliver_exception_payload(vcpu,
-                                                     &vcpu->arch.exception);
                return;
        }
 
Because KVM really shouldn't update CR2 until the excpetion is actually injected
(or the state is at risk of being lost because exception_payload_enabled==false).
E.g. in the (extremely) unlikely (and probably impossible to configure reliably)
scenario that userspace deliberately drops a pending exception, arch state shouldn't
be updated.

> +		 */
>  		if (!is_guest_mode(vcpu))
>  			kvm_deliver_exception_payload(vcpu,
>  						      &vcpu->arch.exception);
> -- 
> 2.53.0.rc2.204.g2597b5adb4-goog
> 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT
  2026-02-03 16:12 ` Sean Christopherson
@ 2026-02-03 16:51   ` Yosry Ahmed
  2026-02-03 18:03     ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Yosry Ahmed @ 2026-02-03 16:51 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote:
> On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on
> > nested #VMEXIT. Use the value from vcpu->arch.cr2 instead.
> > 
> > The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN
> > of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are
> > only out-of-sync in two cases: after migration, and after a #PF is
> 
> Nit, instead of "migration", talk about state save/restore, and then list live
> migration as an example.  Most of the time it's fairly obvious that "migration"
> in KVM means "live migration", but not always.  E.g. task migration often comes
> into play, as does page migration.
> 
> And more importantly, the above statement is wrong when state is saved/restored
> for something other than migration.  E.g. if userspace is restoring a snapshot
> for reasons other than migrating a VM.

Makes sense, will fix.

> 
> > injected into L2.
> > 
> > After migration, the value of CR2 in vmcb02 is uninitialized (i.e.
> 
> save/restore
> 
> > zero),
> 
> This isn't guaranteed.  E.g. if state is restored into a vCPU that was already
> running, vmcb02.save.cr2 could hold any number of things.

Good point, I missed that case.

> 
> > as KVM_SET_SREGS restores CR2 value to vcpu->arch.cr2. Using
> > vcpu->arch.cr2 to update vmcb12 is the right thing to do.
> > 
> > The #PF injection case is more nuanced. It occurs if KVM injects a #PF
> > into L2, then exits to L1 before it actually runs L2. Although the APM
> > is a bit unclear about when CR2 is written during a #PF, the SDM is more
> > clear:
> > 
> > 	Processors update CR2 whenever a page fault is detected. If a
> > 	second page fault occurs while an earlier page fault is being
> > 	delivered, the faulting linear address of the second fault will
> > 	overwrite the contents of CR2 (replacing the previous address).
> > 	These updates to CR2 occur even if the page fault results in a
> > 	double fault or occurs during the delivery of a double fault.
> > 
> > KVM injecting the exception surely counts as the #PF being "detected".
> 
> Heh, "detected" is definitely poor wording in the SDM.
> 
> > More importantly, when an exception is injected into L2 at the time of a
> > synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly,
> > such that an L1 hypervisor can re-inject the exception. If CR2 is not
> > written at that point, the L1 hypervisor have no way of correctly
> > re-injecting the #PF. Hence, using vcpu->arch.cr2 is also the right
> > thing to write in vmcb12 in this case.
> > 
> > Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for
> > L2, only when it is injected. The distinction is important, because only
> > injected exceptions are propagated to L1 through exit_int_info. It would
> > be incorrect to update CR2 in vmcb12 for a pending #PF, as L1 would
> > perceive an updated CR2 value with no #PF. Update the comment in
> > kvm_deliver_exception_payload() to clarify this.
> > 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/nested.c | 2 +-
> >  arch/x86/kvm/x86.c        | 7 +++++++
> >  2 files changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index de90b104a0dd5..9031746ce2db1 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1156,7 +1156,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
> >  	vmcb12->save.efer   = svm->vcpu.arch.efer;
> >  	vmcb12->save.cr0    = kvm_read_cr0(vcpu);
> >  	vmcb12->save.cr3    = kvm_read_cr3(vcpu);
> > -	vmcb12->save.cr2    = vmcb02->save.cr2;
> > +	vmcb12->save.cr2    = vcpu->arch.cr2;
> >  	vmcb12->save.cr4    = svm->vcpu.arch.cr4;
> >  	vmcb12->save.rflags = kvm_get_rflags(vcpu);
> >  	vmcb12->save.rip    = kvm_rip_read(vcpu);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index db3f393192d94..1015522d0fbd7 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -864,6 +864,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
> >  		vcpu->arch.exception.error_code = error_code;
> >  		vcpu->arch.exception.has_payload = has_payload;
> >  		vcpu->arch.exception.payload = payload;
> > +		/*
> > +		 * Only injected exceptions are propagated to L1 in
> > +		 * vmcb12/vmcs12 on nested #VMEXIT. Hence, do not deliver the
> 
> Nit, #VMEXIT is SVM specific terminology.  VM-Exit is more vendor agnostic.

Will do.

> 
> > +		 * exception payload for L2 until the exception is injected.
> > +		 * Otherwise, L1 would perceive the updated payload without a
> > +		 * corresponding exception.
> 
> Huh.  I'm fairly certain this code is now at best unnecessary, and at worst
> actively harmful.  Because the more architectural way to document this code is:
> 
> 		/*
> 		 * If L2 is active, defer delivery of the payload until the
> 		 * exception is actually injected to avoid clobbering state if
> 		 * L1 wants to intercept the exception (the architectural state
> 		 * is NOT updated if the exeption is morphed to a VM-Exit).
> 		 */

It's not only about exceptions being morphed to a VM-Exit though, is it?
KVM should not update the payload (e.g. CR2) if a #PF is pending but was
not injected, because from L1's perspective CR2 was updated but
exit_int_info won't reflect a #PF. Right?

> 
> But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending
> VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a
> VM-Exit.
> 
> Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that
> commit.  Because invoking kvm_deliver_exception_payload() when the exception was
> morphed to a VM-Exit is wrong.  Oh, wait, this is the !exception_payload_enabled
> case.  So never mind, that's simply an unfixable bug, as the second comment alludes
> to.

Hmm for the #PF case I think delivering the payload is always wrong if
it was morphed to a VM-Exit, regardless of exception_payload_enabled,
because the payload should never reach CR2, right?

Spoiler alert, there's another problem there. Even if the exception did
not morph to a VM-Exit, if userspace already did KVM_GET_SREGS then the
delivered payload is lost :/

> 
> 	/*
> 	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
> 	 * the only time there can be two queued exceptions is if there's a
> 	 * non-exiting _injected_ exception, and a pending exiting exception.
> 	 * In that case, ignore the VM-Exiting exception as it's an extension
> 	 * of the injected exception.
> 	 */
> 	if (vcpu->arch.exception_vmexit.pending &&
> 	    !vcpu->arch.exception.pending &&
> 	    !vcpu->arch.exception.injected)
> 		ex = &vcpu->arch.exception_vmexit;
> 	else
> 		ex = &vcpu->arch.exception;
> 
> 	/*
> 	 * In guest mode, payload delivery should be deferred if the exception
> 	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> 	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
> 	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
> 	 * propagate the payload and so it cannot be safely deferred.  Deliver
> 	 * the payload if the capability hasn't been requested.
> 	 */
> 	if (!vcpu->kvm->arch.exception_payload_enabled &&
> 	    ex->pending && ex->has_payload)
> 		kvm_deliver_exception_payload(vcpu, ex);
> 
> So yeah, I _think_ we could drop the is_guest_mode() check.  However, even better
> would be to drop this call *entirely*, i.e.

Hmm I don't think so, because as I mentioned above, KVM shouldn't update
CR2 until the exception is actually injected, right?

It would actually be great to drop the is_guest_mode() check here but
leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS
and KVM_SET_SREGS goes away, and I *think* we can drop the
kvm_deliver_exception_payload() call in
kvm_vcpu_ioctl_x86_get_vcpu_events().

The only problem would be CR2 getting updated without a fault being
reflected in the vmcb12's exit_int_info AFAICT.

> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b0112c515584..00a39c95631d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
>                 vcpu->arch.exception.error_code = error_code;
>                 vcpu->arch.exception.has_payload = has_payload;
>                 vcpu->arch.exception.payload = payload;
> -               if (!is_guest_mode(vcpu))
> -                       kvm_deliver_exception_payload(vcpu,
> -                                                     &vcpu->arch.exception);
>                 return;
>         }
>  
> Because KVM really shouldn't update CR2 until the excpetion is actually injected
> (or the state is at risk of being lost because exception_payload_enabled==false).
> E.g. in the (extremely) unlikely (and probably impossible to configure reliably)
> scenario that userspace deliberately drops a pending exception, arch state shouldn't
> be updated.

I think if we drop it there might be a problem. With
exception_payload_enabled==false, pending exceptions becomes injected
after save/restore. If the payload is not delivered here, then after
restore we have an injected event with no payload.

I guess the kvm_deliver_exception_payload() call in
kvm_vcpu_ioctl_x86_get_vcpu_events() is supposed to handle this, but it
only works if userspace does KVM_GET_SREGS *after* KVM_GET_VCPU_EVENTS.
Removing the call here will regress VMM's doing KVM_GET_SREGS AFAICT.

> 
> > +		 */
> >  		if (!is_guest_mode(vcpu))
> >  			kvm_deliver_exception_payload(vcpu,
> >  						      &vcpu->arch.exception);
> > -- 
> > 2.53.0.rc2.204.g2597b5adb4-goog
> > 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT
  2026-02-03 16:51   ` Yosry Ahmed
@ 2026-02-03 18:03     ` Sean Christopherson
  2026-02-03 19:08       ` Yosry Ahmed
  2026-02-11 20:40       ` Yosry Ahmed
  0 siblings, 2 replies; 10+ messages in thread
From: Sean Christopherson @ 2026-02-03 18:03 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote:
> > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > 		/*
> > 		 * If L2 is active, defer delivery of the payload until the
> > 		 * exception is actually injected to avoid clobbering state if
> > 		 * L1 wants to intercept the exception (the architectural state
> > 		 * is NOT updated if the exeption is morphed to a VM-Exit).
> > 		 */
> 
> It's not only about exceptions being morphed to a VM-Exit though, is it?
> KVM should not update the payload (e.g. CR2) if a #PF is pending but was
> not injected, because from L1's perspective CR2 was updated but
> exit_int_info won't reflect a #PF. Right?

Right, but that's got nothing to do with L2 being active.  Take nested completely
out of the picture, and the above statement holds true as well.  "If a #PF is
pending but was not injected, then the guest shouldn't see a change in CR2".

> > But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending
> > VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a
> > VM-Exit.
> > 
> > Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that
> > commit.  Because invoking kvm_deliver_exception_payload() when the exception was
> > morphed to a VM-Exit is wrong.  Oh, wait, this is the !exception_payload_enabled
> > case.  So never mind, that's simply an unfixable bug, as the second comment alludes
> > to.
> 
> Hmm for the #PF case I think delivering the payload is always wrong if
> it was morphed to a VM-Exit, regardless of exception_payload_enabled,
> because the payload should never reach CR2, right?

Right.

> Spoiler alert, there's another problem there. Even if the exception did
> not morph to a VM-Exit, if userspace already did KVM_GET_SREGS then the
> delivered payload is lost :/
> 
> > 
> > 	/*
> > 	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
> > 	 * the only time there can be two queued exceptions is if there's a
> > 	 * non-exiting _injected_ exception, and a pending exiting exception.
> > 	 * In that case, ignore the VM-Exiting exception as it's an extension
> > 	 * of the injected exception.
> > 	 */
> > 	if (vcpu->arch.exception_vmexit.pending &&
> > 	    !vcpu->arch.exception.pending &&
> > 	    !vcpu->arch.exception.injected)
> > 		ex = &vcpu->arch.exception_vmexit;
> > 	else
> > 		ex = &vcpu->arch.exception;
> > 
> > 	/*
> > 	 * In guest mode, payload delivery should be deferred if the exception
> > 	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> > 	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
> > 	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
> > 	 * propagate the payload and so it cannot be safely deferred.  Deliver
> > 	 * the payload if the capability hasn't been requested.
> > 	 */
> > 	if (!vcpu->kvm->arch.exception_payload_enabled &&
> > 	    ex->pending && ex->has_payload)
> > 		kvm_deliver_exception_payload(vcpu, ex);
> > 
> > So yeah, I _think_ we could drop the is_guest_mode() check.  However, even better
> > would be to drop this call *entirely*, i.e.
> 
> Hmm I don't think so, because as I mentioned above, KVM shouldn't update
> CR2 until the exception is actually injected, right?

Me confused.  What you're saying is what I'm suggesting: don't update CR2 until
KVM actually stuffs the #PF into the VMCS/VMCB.  Oh, or are you talking about the
first sentence?  If so, strike that from the record, I was wrong (see below).

> It would actually be great to drop the is_guest_mode() check here but
> leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS
> and KVM_SET_SREGS goes away, and I *think* we can drop the
> kvm_deliver_exception_payload() call in
> kvm_vcpu_ioctl_x86_get_vcpu_events().
>
> The only problem would be CR2 getting updated without a fault being
> reflected in the vmcb12's exit_int_info AFAICT.

No, that particular case is a non-issue, because the code immediately above has
already verified that KVM will *not* morph the #PF to a nested VM-Exit.  Note,
the "queue:" path is just for non-contributory exceptions and doesn't change the
VM-Exit change anyways.

	/*
	 * If the exception is destined for L2, morph it to a VM-Exit if L1
	 * wants to intercept the exception.
	 */
	if (is_guest_mode(vcpu) &&
	    kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) {  <====
		kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
					   has_payload, payload);
		return;
	}

	if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
	queue:
		vcpu->arch.exception.pending = true;
		vcpu->arch.exception.injected = false;

		vcpu->arch.exception.has_error_code = has_error;
		vcpu->arch.exception.vector = nr;
		vcpu->arch.exception.error_code = error_code;
		vcpu->arch.exception.has_payload = has_payload;
		vcpu->arch.exception.payload = payload;
		if (!is_guest_mode(vcpu))
			kvm_deliver_exception_payload(vcpu,
						      &vcpu->arch.exception);
		return;
	}

> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index b0112c515584..00a39c95631d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
> >                 vcpu->arch.exception.error_code = error_code;
> >                 vcpu->arch.exception.has_payload = has_payload;
> >                 vcpu->arch.exception.payload = payload;
> > -               if (!is_guest_mode(vcpu))
> > -                       kvm_deliver_exception_payload(vcpu,
> > -                                                     &vcpu->arch.exception);
> >                 return;
> >         }
> >  
> > Because KVM really shouldn't update CR2 until the excpetion is actually injected
> > (or the state is at risk of being lost because exception_payload_enabled==false).
> > E.g. in the (extremely) unlikely (and probably impossible to configure reliably)
> > scenario that userspace deliberately drops a pending exception, arch state shouldn't
> > be updated.
> 
> I think if we drop it there might be a problem. With
> exception_payload_enabled==false, pending exceptions becomes injected
> after save/restore. If the payload is not delivered here, then after
> restore we have an injected event with no payload.
> 
> I guess the kvm_deliver_exception_payload() call in
> kvm_vcpu_ioctl_x86_get_vcpu_events() is supposed to handle this, but it
> only works if userspace does KVM_GET_SREGS *after* KVM_GET_VCPU_EVENTS.
> Removing the call here will regress VMM's doing KVM_GET_SREGS AFAICT.

Drat, QEMU does KVM_GET_VCPU_EVENTS before KVM_GET_SREGS{,2}, so I was hopeful
we wouldn't actually break anyone, but Firecracker at least gets sregs before
events.  Of course Firecracker is already broken due to not enabling
KVM_CAP_EXCEPTION_PAYLOAD...

Ugh, and it's not just KVM_GET_SREGS, it's also KVM_GET_DEBUGREGS thanks to DR6.

If we're being _super_ pedantic, then delivering the payload anywhere but injection
or KVM_GET_VCPU_EVENTS is "wrong" (well, "more wrong", since any behavior without
KVM_CAP_EXCEPTION_PAYLOAD is inherently wrong).  E.g. very hypothetically, userspace
that saves/restores sregs but not vCPU events would see an unexpected CR2 change.

*sigh*

Ewwwwwwww.  And we *definitely* don't want to drop the is_guest_mode() check,
because nested_vmx_update_pending_dbg() relies on the payload to still be valid.

Ah, right, and that's also why commit a06230b62b89 ("KVM: x86: Deliver exception
payload on KVM_GET_VCPU_EVENTS") from MTF series deferred the update: KVM needs
to keep the #DB pending so that a VM-Exit that occurs before the #DB is injected
gets recorded in vmcs12.

So, with all of that in mind, I believe the best we can do is fully defer delivery
of the exception until it's actually injected, and then apply the quirk to the
relevant GET APIs.

---
 arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++-----------------
 1 file changed, 39 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0112c515584..e000521dfc8b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
 		vcpu->arch.exception.error_code = error_code;
 		vcpu->arch.exception.has_payload = has_payload;
 		vcpu->arch.exception.payload = payload;
-		if (!is_guest_mode(vcpu))
-			kvm_deliver_exception_payload(vcpu,
-						      &vcpu->arch.exception);
 		return;
 	}
 
@@ -5532,18 +5529,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
-static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
-					       struct kvm_vcpu_events *events)
+static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu)
 {
-	struct kvm_queued_exception *ex;
-
-	process_nmi(vcpu);
-
-#ifdef CONFIG_KVM_SMM
-	if (kvm_check_request(KVM_REQ_SMI, vcpu))
-		process_smi(vcpu);
-#endif
-
 	/*
 	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
 	 * the only time there can be two queued exceptions is if there's a
@@ -5554,21 +5541,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 	if (vcpu->arch.exception_vmexit.pending &&
 	    !vcpu->arch.exception.pending &&
 	    !vcpu->arch.exception.injected)
-		ex = &vcpu->arch.exception_vmexit;
-	else
-		ex = &vcpu->arch.exception;
+		return &vcpu->arch.exception_vmexit;
+
+	return &vcpu->arch.exception;
+}
+
+static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu)
+{
+	struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
 
 	/*
-	 * In guest mode, payload delivery should be deferred if the exception
-	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
-	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
-	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
-	 * propagate the payload and so it cannot be safely deferred.  Deliver
-	 * the payload if the capability hasn't been requested.
+	 * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver
+	 * the pending exception payload when userspace saves *any* vCPU state
+	 * that interacts with exception payloads to avoid breaking userspace.
+	 *
+	 * Architecturally, KVM must not deliver an exception payload until the
+	 * exception is actually injected, e.g. to avoid losing pending #DB
+	 * information (which VMX tracks in the VMCS), and to avoid clobbering
+	 * state if the exception is never injected for whatever reason.  But
+	 * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or
+	 * may not propagate the payload across save+restore, and so KVM can't
+	 * safely defer delivery of the payload.
 	 */
 	if (!vcpu->kvm->arch.exception_payload_enabled &&
 	    ex->pending && ex->has_payload)
 		kvm_deliver_exception_payload(vcpu, ex);
+}
+
+static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
+					       struct kvm_vcpu_events *events)
+{
+	struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
+
+	process_nmi(vcpu);
+
+#ifdef CONFIG_KVM_SMM
+	if (kvm_check_request(KVM_REQ_SMI, vcpu))
+		process_smi(vcpu);
+#endif
+
+	kvm_handle_exception_payload_quirk(vcpu);
 
 	memset(events, 0, sizeof(*events));
 
@@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
 	    vcpu->arch.guest_state_protected)
 		return -EINVAL;
 
+	kvm_handle_exception_payload_quirk(vcpu);
+
 	memset(dbgregs, 0, sizeof(*dbgregs));
 
 	BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db));
@@ -12123,6 +12137,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	if (vcpu->arch.guest_state_protected)
 		goto skip_protected_regs;
 
+	kvm_handle_exception_payload_quirk(vcpu);
+
 	kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
 	kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
 	kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES);

base-commit: 55671237401edd1ec59276b852b9361cc170915b
--

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT
  2026-02-03 18:03     ` Sean Christopherson
@ 2026-02-03 19:08       ` Yosry Ahmed
  2026-02-03 19:42         ` Sean Christopherson
  2026-02-11 20:40       ` Yosry Ahmed
  1 sibling, 1 reply; 10+ messages in thread
From: Yosry Ahmed @ 2026-02-03 19:08 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Feb 03, 2026 at 10:03:35AM -0800, Sean Christopherson wrote:
> On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote:
> > > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > > 		/*
> > > 		 * If L2 is active, defer delivery of the payload until the
> > > 		 * exception is actually injected to avoid clobbering state if
> > > 		 * L1 wants to intercept the exception (the architectural state
> > > 		 * is NOT updated if the exeption is morphed to a VM-Exit).
> > > 		 */
> > 
> > It's not only about exceptions being morphed to a VM-Exit though, is it?
> > KVM should not update the payload (e.g. CR2) if a #PF is pending but was
> > not injected, because from L1's perspective CR2 was updated but
> > exit_int_info won't reflect a #PF. Right?
> 
> Right, but that's got nothing to do with L2 being active.  Take nested completely
> out of the picture, and the above statement holds true as well.  "If a #PF is
> pending but was not injected, then the guest shouldn't see a change in CR2".

Right, but it is still related to nested in a way. Ignore the exception
morphing to a VM-Exit, the case I am refering to is specifically
exit_int_info on SVM. IIUC, if there's an injected (but not intercepted)
exception when doing a nested VM-Exit, we have to propagate that to L1
(in nested_save_pending_event_to_vmcb12()), such that it can re-inject
that exception.

So what I was referring to is, if we write CR2 for a pending exception
to L2, and then exit to L1, L1 would perceive a chance in CR2 without an
ongoing #PF in exit_int_info. I believe the equivalent VMX function is
vmcs12_save_pending_event().

All that to say, we should not deliver the payload of an exception to L2
before it's actually injected.

> 
> > > But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending
> > > VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a
> > > VM-Exit.
> > > 
> > > Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that
> > > commit.  Because invoking kvm_deliver_exception_payload() when the exception was
> > > morphed to a VM-Exit is wrong.  Oh, wait, this is the !exception_payload_enabled
> > > case.  So never mind, that's simply an unfixable bug, as the second comment alludes
> > > to.
> > 
> > Hmm for the #PF case I think delivering the payload is always wrong if
> > it was morphed to a VM-Exit, regardless of exception_payload_enabled,
> > because the payload should never reach CR2, right?
> 
> Right.
> 
> > Spoiler alert, there's another problem there. Even if the exception did
> > not morph to a VM-Exit, if userspace already did KVM_GET_SREGS then the
> > delivered payload is lost :/
> > 
> > > 
> > > 	/*
> > > 	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
> > > 	 * the only time there can be two queued exceptions is if there's a
> > > 	 * non-exiting _injected_ exception, and a pending exiting exception.
> > > 	 * In that case, ignore the VM-Exiting exception as it's an extension
> > > 	 * of the injected exception.
> > > 	 */
> > > 	if (vcpu->arch.exception_vmexit.pending &&
> > > 	    !vcpu->arch.exception.pending &&
> > > 	    !vcpu->arch.exception.injected)
> > > 		ex = &vcpu->arch.exception_vmexit;
> > > 	else
> > > 		ex = &vcpu->arch.exception;
> > > 
> > > 	/*
> > > 	 * In guest mode, payload delivery should be deferred if the exception
> > > 	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> > > 	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
> > > 	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
> > > 	 * propagate the payload and so it cannot be safely deferred.  Deliver
> > > 	 * the payload if the capability hasn't been requested.
> > > 	 */
> > > 	if (!vcpu->kvm->arch.exception_payload_enabled &&
> > > 	    ex->pending && ex->has_payload)
> > > 		kvm_deliver_exception_payload(vcpu, ex);
> > > 
> > > So yeah, I _think_ we could drop the is_guest_mode() check.  However, even better
> > > would be to drop this call *entirely*, i.e.
> > 
> > Hmm I don't think so, because as I mentioned above, KVM shouldn't update
> > CR2 until the exception is actually injected, right?
> 
> Me confused.  What you're saying is what I'm suggesting: don't update CR2 until
> KVM actually stuffs the #PF into the VMCS/VMCB.  Oh, or are you talking about the
> first sentence?  If so, strike that from the record, I was wrong (see below).

Yup, first sentence. 

> 
> > It would actually be great to drop the is_guest_mode() check here but
> > leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS
> > and KVM_SET_SREGS goes away, and I *think* we can drop the
> > kvm_deliver_exception_payload() call in
> > kvm_vcpu_ioctl_x86_get_vcpu_events().
> >
> > The only problem would be CR2 getting updated without a fault being
> > reflected in the vmcb12's exit_int_info AFAICT.
> 
> No, that particular case is a non-issue, because the code immediately above has
> already verified that KVM will *not* morph the #PF to a nested VM-Exit.  Note,
> the "queue:" path is just for non-contributory exceptions and doesn't change the
> VM-Exit change anyways.

What I meant was not stuffing the #PF into the VMCB/VMCS because it's
intercepted, but the #PF being stuffed into exit_int_info or
idt_vectoring_info.

If we drop the guest mode check here, we could end up with CR2 updated
and a #PF not reflected in exit_int_info/idt_vectoring_info (assuming
#PF is not intercepted).

> 
> 	/*
> 	 * If the exception is destined for L2, morph it to a VM-Exit if L1
> 	 * wants to intercept the exception.
> 	 */
> 	if (is_guest_mode(vcpu) &&
> 	    kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) {  <====
> 		kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
> 					   has_payload, payload);
> 		return;
> 	}
> 
> 	if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
> 	queue:
> 		vcpu->arch.exception.pending = true;
> 		vcpu->arch.exception.injected = false;
> 
> 		vcpu->arch.exception.has_error_code = has_error;
> 		vcpu->arch.exception.vector = nr;
> 		vcpu->arch.exception.error_code = error_code;
> 		vcpu->arch.exception.has_payload = has_payload;
> 		vcpu->arch.exception.payload = payload;
> 		if (!is_guest_mode(vcpu))
> 			kvm_deliver_exception_payload(vcpu,
> 						      &vcpu->arch.exception);
> 		return;
> 	}
> 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index b0112c515584..00a39c95631d 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
> > >                 vcpu->arch.exception.error_code = error_code;
> > >                 vcpu->arch.exception.has_payload = has_payload;
> > >                 vcpu->arch.exception.payload = payload;
> > > -               if (!is_guest_mode(vcpu))
> > > -                       kvm_deliver_exception_payload(vcpu,
> > > -                                                     &vcpu->arch.exception);
> > >                 return;
> > >         }
> > >  
> > > Because KVM really shouldn't update CR2 until the excpetion is actually injected
> > > (or the state is at risk of being lost because exception_payload_enabled==false).
> > > E.g. in the (extremely) unlikely (and probably impossible to configure reliably)
> > > scenario that userspace deliberately drops a pending exception, arch state shouldn't
> > > be updated.
> > 
> > I think if we drop it there might be a problem. With
> > exception_payload_enabled==false, pending exceptions becomes injected
> > after save/restore. If the payload is not delivered here, then after
> > restore we have an injected event with no payload.
> > 
> > I guess the kvm_deliver_exception_payload() call in
> > kvm_vcpu_ioctl_x86_get_vcpu_events() is supposed to handle this, but it
> > only works if userspace does KVM_GET_SREGS *after* KVM_GET_VCPU_EVENTS.
> > Removing the call here will regress VMM's doing KVM_GET_SREGS AFAICT.
> 
> Drat, QEMU does KVM_GET_VCPU_EVENTS before KVM_GET_SREGS{,2}, so I was hopeful
> we wouldn't actually break anyone, but Firecracker at least gets sregs before
> events.  Of course Firecracker is already broken due to not enabling
> KVM_CAP_EXCEPTION_PAYLOAD...
> 
> Ugh, and it's not just KVM_GET_SREGS, it's also KVM_GET_DEBUGREGS thanks to DR6.
> 
> If we're being _super_ pedantic, then delivering the payload anywhere but injection
> or KVM_GET_VCPU_EVENTS is "wrong" (well, "more wrong", since any behavior without
> KVM_CAP_EXCEPTION_PAYLOAD is inherently wrong).  E.g. very hypothetically, userspace
> that saves/restores sregs but not vCPU events would see an unexpected CR2 change.
> 
> *sigh*
> 
> Ewwwwwwww.  And we *definitely* don't want to drop the is_guest_mode() check,
> because nested_vmx_update_pending_dbg() relies on the payload to still be valid.
> 
> Ah, right, and that's also why commit a06230b62b89 ("KVM: x86: Deliver exception
> payload on KVM_GET_VCPU_EVENTS") from MTF series deferred the update: KVM needs
> to keep the #DB pending so that a VM-Exit that occurs before the #DB is injected
> gets recorded in vmcs12.
> 
> So, with all of that in mind, I believe the best we can do is fully defer delivery
> of the exception until it's actually injected, and then apply the quirk to the
> relevant GET APIs.

I think this should work. I can test it for the nested case, the way I
could reproduce the problem (with a VMM that does KVM_GET_SREGS before
KVM_GET_VCPU_EVENTS, but does not use KVM_CAP_EXCEPTION_PAYLOAD) is by
intercepting and re-injecting all #PFs from L2, and then repeatedly
doing save+restore while L2 is doing some heavy lifting (building GCC).
This generates a lot of #PF exceptions to be saved+restored, and we
eventually get a segfault because of corrupted CR2 in L2.

Removing the is_guest_mode() check in kvm_multiple_exception() fixes it
by prematurely delivering the payload when it's queued. I think your fix
will also work by prematurely delivering the payload at save time. This
is actually more corect because at restore time the exception will
become injected and treated as such (e.g. shows up in exit_int_info).

Do you intend to send a patch? Or should I send it out (separate from
the current one) with you as the author?

> 
> ---
>  arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++-----------------
>  1 file changed, 39 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b0112c515584..e000521dfc8b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
>  		vcpu->arch.exception.error_code = error_code;
>  		vcpu->arch.exception.has_payload = has_payload;
>  		vcpu->arch.exception.payload = payload;
> -		if (!is_guest_mode(vcpu))
> -			kvm_deliver_exception_payload(vcpu,
> -						      &vcpu->arch.exception);
>  		return;
>  	}
>  
> @@ -5532,18 +5529,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
>  	return 0;
>  }
>  
> -static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> -					       struct kvm_vcpu_events *events)
> +static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu)
>  {
> -	struct kvm_queued_exception *ex;
> -
> -	process_nmi(vcpu);
> -
> -#ifdef CONFIG_KVM_SMM
> -	if (kvm_check_request(KVM_REQ_SMI, vcpu))
> -		process_smi(vcpu);
> -#endif
> -
>  	/*
>  	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
>  	 * the only time there can be two queued exceptions is if there's a
> @@ -5554,21 +5541,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>  	if (vcpu->arch.exception_vmexit.pending &&
>  	    !vcpu->arch.exception.pending &&
>  	    !vcpu->arch.exception.injected)
> -		ex = &vcpu->arch.exception_vmexit;
> -	else
> -		ex = &vcpu->arch.exception;
> +		return &vcpu->arch.exception_vmexit;
> +
> +	return &vcpu->arch.exception;
> +}
> +
> +static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
>  
>  	/*
> -	 * In guest mode, payload delivery should be deferred if the exception
> -	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> -	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
> -	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
> -	 * propagate the payload and so it cannot be safely deferred.  Deliver
> -	 * the payload if the capability hasn't been requested.
> +	 * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver
> +	 * the pending exception payload when userspace saves *any* vCPU state
> +	 * that interacts with exception payloads to avoid breaking userspace.
> +	 *
> +	 * Architecturally, KVM must not deliver an exception payload until the
> +	 * exception is actually injected, e.g. to avoid losing pending #DB
> +	 * information (which VMX tracks in the VMCS), and to avoid clobbering
> +	 * state if the exception is never injected for whatever reason.  But
> +	 * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or
> +	 * may not propagate the payload across save+restore, and so KVM can't
> +	 * safely defer delivery of the payload.
>  	 */
>  	if (!vcpu->kvm->arch.exception_payload_enabled &&
>  	    ex->pending && ex->has_payload)
>  		kvm_deliver_exception_payload(vcpu, ex);
> +}
> +
> +static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> +					       struct kvm_vcpu_events *events)
> +{
> +	struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
> +
> +	process_nmi(vcpu);
> +
> +#ifdef CONFIG_KVM_SMM
> +	if (kvm_check_request(KVM_REQ_SMI, vcpu))
> +		process_smi(vcpu);
> +#endif
> +
> +	kvm_handle_exception_payload_quirk(vcpu);
>  
>  	memset(events, 0, sizeof(*events));
>  
> @@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
>  	    vcpu->arch.guest_state_protected)
>  		return -EINVAL;
>  
> +	kvm_handle_exception_payload_quirk(vcpu);
> +
>  	memset(dbgregs, 0, sizeof(*dbgregs));
>  
>  	BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db));
> @@ -12123,6 +12137,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
>  	if (vcpu->arch.guest_state_protected)
>  		goto skip_protected_regs;
>  
> +	kvm_handle_exception_payload_quirk(vcpu);
> +
>  	kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
>  	kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
>  	kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES);
> 
> base-commit: 55671237401edd1ec59276b852b9361cc170915b
> --

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT
  2026-02-03 19:08       ` Yosry Ahmed
@ 2026-02-03 19:42         ` Sean Christopherson
  2026-02-03 19:54           ` Yosry Ahmed
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2026-02-03 19:42 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> On Tue, Feb 03, 2026 at 10:03:35AM -0800, Sean Christopherson wrote:
> > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > > On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote:
> > > > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > > > 		/*
> > > > 		 * If L2 is active, defer delivery of the payload until the
> > > > 		 * exception is actually injected to avoid clobbering state if
> > > > 		 * L1 wants to intercept the exception (the architectural state
> > > > 		 * is NOT updated if the exeption is morphed to a VM-Exit).
> > > > 		 */
> > > 
> > > It's not only about exceptions being morphed to a VM-Exit though, is it?
> > > KVM should not update the payload (e.g. CR2) if a #PF is pending but was
> > > not injected, because from L1's perspective CR2 was updated but
> > > exit_int_info won't reflect a #PF. Right?
> > 
> > Right, but that's got nothing to do with L2 being active.  Take nested completely
> > out of the picture, and the above statement holds true as well.  "If a #PF is
> > pending but was not injected, then the guest shouldn't see a change in CR2".
> 
> Right, but it is still related to nested in a way. Ignore the exception
> morphing to a VM-Exit, the case I am refering to is specifically
> exit_int_info on SVM. IIUC, if there's an injected (but not intercepted)
> exception when doing a nested VM-Exit, we have to propagate that to L1
> (in nested_save_pending_event_to_vmcb12()), such that it can re-inject
> that exception.

Ugh, that's a poor choice of name for nested_save_pending_event_to_vmcb12().

As defined by kvm_queued_exception, that's not a *pending* event, it's an
*injected* event.  In that case, the payload *should* have been delivered (to CR2
or DR6) because that exception has already occurred (been "detected" in the SDM's
weird wording).  The VM-Exit is not happening *before* the #PF, it's happening
after the #PF is "detected", while the #PF is being vectored.

From a virtualization perspective, any other implementation is basically unworkable,
as it would require the host to gain control after an exception is successfully
vectored.  I.e. the absense of any mechanisms to support that effectively confirms
that the CPU writes CR2 before attempting to deliver the exception to software.

> So what I was referring to is, if we write CR2 for a pending exception
> to L2, and then exit to L1, L1 would perceive a chance in CR2 without an
> ongoing #PF in exit_int_info. I believe the equivalent VMX function is
> vmcs12_save_pending_event().

Also poorly named :-/

> All that to say, we should not deliver the payload of an exception to L2
> before it's actually injected.

As above, those helpers deal with exceptions that have already been injected by
KVM.

> > > It would actually be great to drop the is_guest_mode() check here but
> > > leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS
> > > and KVM_SET_SREGS goes away, and I *think* we can drop the
> > > kvm_deliver_exception_payload() call in
> > > kvm_vcpu_ioctl_x86_get_vcpu_events().
> > >
> > > The only problem would be CR2 getting updated without a fault being
> > > reflected in the vmcb12's exit_int_info AFAICT.
> > 
> > No, that particular case is a non-issue, because the code immediately above has
> > already verified that KVM will *not* morph the #PF to a nested VM-Exit.  Note,
> > the "queue:" path is just for non-contributory exceptions and doesn't change the
> > VM-Exit change anyways.
> 
> What I meant was not stuffing the #PF into the VMCB/VMCS because it's
> intercepted, but the #PF being stuffed into exit_int_info or
> idt_vectoring_info.
> 
> If we drop the guest mode check here, we could end up with CR2 updated
> and a #PF not reflected in exit_int_info/idt_vectoring_info (assuming
> #PF is not intercepted).

No, because once {svm,vmx}_inject_exception() have been reach, KVM has fully
committed to delivering the exception to the guest.  If KVM cancels KVM_RUN, e.g.
because of a pending signal from userspace to initiate save/restore, KVM calls
kvm_x86_ops.cancel_injection() so that vendor code can move the to-be-injected
exception from the VMCS/VMCB back to vcpu->arch.exception.  Note that
kvm_requeue_exception() (a) sets injected=true and (b) deliberately doesn't
track any payload, because the payload has already been delivered.

If VM-Enter is executed and a non-nested VM-Exit occurs, then hardware saves the
in-progress exception in VMCB.exit_int_info/VMCS.idt_vectoring_info, and KVM
moves the exception back to vcpu->arch.exception via vmx_complete_interrupts()
and svm_complete_interrupts() (which are also used for cancelling injection,
because the logic is identical, only the VMCS/VMCB source differs).

For nested VM-Exit, KVM needs to emulate that behavior.  The exception has already
been "detected" by KVM, and the payload has already been delivered, but a VM-Exit
was encountered while vectoring the exception to software.

E.g. if a guest #PF occurs while the guest stack is at the bottom of a page, such
that the first N pushes will hit page X, and the last M pushes will hit page X-1,
and the write to page X-1 hits a #NPF / EPT Violation, then L1 will (and should!)
see an updated CR2, with the first N pushes to vector the exception resident in
page X.

> > So, with all of that in mind, I believe the best we can do is fully defer delivery
> > of the exception until it's actually injected, and then apply the quirk to the
> > relevant GET APIs.
> 
> I think this should work. I can test it for the nested case, the way I
> could reproduce the problem (with a VMM that does KVM_GET_SREGS before
> KVM_GET_VCPU_EVENTS, but does not use KVM_CAP_EXCEPTION_PAYLOAD) is by
> intercepting and re-injecting all #PFs from L2, and then repeatedly
> doing save+restore while L2 is doing some heavy lifting (building GCC).
> This generates a lot of #PF exceptions to be saved+restored, and we
> eventually get a segfault because of corrupted CR2 in L2.
> 
> Removing the is_guest_mode() check in kvm_multiple_exception() fixes it
> by prematurely delivering the payload when it's queued. I think your fix
> will also work by prematurely delivering the payload at save time. This
> is actually more corect because at restore time the exception will
> become injected and treated as such (e.g. shows up in exit_int_info).
> 
> Do you intend to send a patch? Or should I send it out (separate from
> the current one) with you as the author?

I'll send a patch for this, there's a lot of historical information I want to
capture.

Can you send a v2 for _this_ patch, without the comment change?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT
  2026-02-03 19:42         ` Sean Christopherson
@ 2026-02-03 19:54           ` Yosry Ahmed
  0 siblings, 0 replies; 10+ messages in thread
From: Yosry Ahmed @ 2026-02-03 19:54 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Feb 03, 2026 at 11:42:01AM -0800, Sean Christopherson wrote:
> On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > On Tue, Feb 03, 2026 at 10:03:35AM -0800, Sean Christopherson wrote:
> > > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > > > On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote:
> > > > > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > > > > 		/*
> > > > > 		 * If L2 is active, defer delivery of the payload until the
> > > > > 		 * exception is actually injected to avoid clobbering state if
> > > > > 		 * L1 wants to intercept the exception (the architectural state
> > > > > 		 * is NOT updated if the exeption is morphed to a VM-Exit).
> > > > > 		 */
> > > > 
> > > > It's not only about exceptions being morphed to a VM-Exit though, is it?
> > > > KVM should not update the payload (e.g. CR2) if a #PF is pending but was
> > > > not injected, because from L1's perspective CR2 was updated but
> > > > exit_int_info won't reflect a #PF. Right?
> > > 
> > > Right, but that's got nothing to do with L2 being active.  Take nested completely
> > > out of the picture, and the above statement holds true as well.  "If a #PF is
> > > pending but was not injected, then the guest shouldn't see a change in CR2".
> > 
> > Right, but it is still related to nested in a way. Ignore the exception
> > morphing to a VM-Exit, the case I am refering to is specifically
> > exit_int_info on SVM. IIUC, if there's an injected (but not intercepted)
> > exception when doing a nested VM-Exit, we have to propagate that to L1
> > (in nested_save_pending_event_to_vmcb12()), such that it can re-inject
> > that exception.
> 
> Ugh, that's a poor choice of name for nested_save_pending_event_to_vmcb12().
> 
> As defined by kvm_queued_exception, that's not a *pending* event, it's an
> *injected* event.  In that case, the payload *should* have been delivered (to CR2
> or DR6) because that exception has already occurred (been "detected" in the SDM's
> weird wording).  The VM-Exit is not happening *before* the #PF, it's happening
> after the #PF is "detected", while the #PF is being vectored.
> 
> From a virtualization perspective, any other implementation is basically unworkable,
> as it would require the host to gain control after an exception is successfully
> vectored.  I.e. the absense of any mechanisms to support that effectively confirms
> that the CPU writes CR2 before attempting to deliver the exception to software.
> 
> > So what I was referring to is, if we write CR2 for a pending exception
> > to L2, and then exit to L1, L1 would perceive a chance in CR2 without an
> > ongoing #PF in exit_int_info. I believe the equivalent VMX function is
> > vmcs12_save_pending_event().
> 
> Also poorly named :-/
> 
> > All that to say, we should not deliver the payload of an exception to L2
> > before it's actually injected.
> 
> As above, those helpers deal with exceptions that have already been injected by
> KVM.
> 
> > > > It would actually be great to drop the is_guest_mode() check here but
> > > > leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS
> > > > and KVM_SET_SREGS goes away, and I *think* we can drop the
> > > > kvm_deliver_exception_payload() call in
> > > > kvm_vcpu_ioctl_x86_get_vcpu_events().
> > > >
> > > > The only problem would be CR2 getting updated without a fault being
> > > > reflected in the vmcb12's exit_int_info AFAICT.
> > > 
> > > No, that particular case is a non-issue, because the code immediately above has
> > > already verified that KVM will *not* morph the #PF to a nested VM-Exit.  Note,
> > > the "queue:" path is just for non-contributory exceptions and doesn't change the
> > > VM-Exit change anyways.
> > 
> > What I meant was not stuffing the #PF into the VMCB/VMCS because it's
> > intercepted, but the #PF being stuffed into exit_int_info or
> > idt_vectoring_info.
> > 
> > If we drop the guest mode check here, we could end up with CR2 updated
> > and a #PF not reflected in exit_int_info/idt_vectoring_info (assuming
> > #PF is not intercepted).
> 
> No, because once {svm,vmx}_inject_exception() have been reach, KVM has fully
> committed to delivering the exception to the guest.  If KVM cancels KVM_RUN, e.g.
> because of a pending signal from userspace to initiate save/restore, KVM calls
> kvm_x86_ops.cancel_injection() so that vendor code can move the to-be-injected
> exception from the VMCS/VMCB back to vcpu->arch.exception.  Note that
> kvm_requeue_exception() (a) sets injected=true and (b) deliberately doesn't
> track any payload, because the payload has already been delivered.
> 
> If VM-Enter is executed and a non-nested VM-Exit occurs, then hardware saves the
> in-progress exception in VMCB.exit_int_info/VMCS.idt_vectoring_info, and KVM
> moves the exception back to vcpu->arch.exception via vmx_complete_interrupts()
> and svm_complete_interrupts() (which are also used for cancelling injection,
> because the logic is identical, only the VMCS/VMCB source differs).
> 
> For nested VM-Exit, KVM needs to emulate that behavior.  The exception has already
> been "detected" by KVM, and the payload has already been delivered, but a VM-Exit
> was encountered while vectoring the exception to software.
> 
> E.g. if a guest #PF occurs while the guest stack is at the bottom of a page, such
> that the first N pushes will hit page X, and the last M pushes will hit page X-1,
> and the write to page X-1 hits a #NPF / EPT Violation, then L1 will (and should!)
> see an updated CR2, with the first N pushes to vector the exception resident in
> page X.

We are not disagreeing, I think I may not have been clear. I agree with
all what you said.

What I was saying is basically delivering the payload for *pending*
exceptions for L2 is wrong (i.e. just removing the is_guest_mode() check
before calling kvm_deliver_exception_payload() in
kvm_multiple_exception()).

Exactly because of everything you said, the exception is not yet
*injected* and is *not* reflected to L1 in
exit_int_info/idt_vectoring_info if it's injection fails.

i.e what I said above:

	 If we drop the guest mode check here, we could end up with CR2
	 updated and a #PF not reflected in
	 exit_int_info/idt_vectoring_info (assuming #PF is not
	 intercepted).

We already established that removing the is_guest_mode() would be wrong,
so all is good here, I am just clarifying that what I meant initially is
that it would be wrong because we would update CR2 without actually
reflecting the #PF in exit_int_info/idt_vectoring_info, not because we
could update CR2 before a #PF intercept (as you mentioned, this part is
handled).

> 
> > > So, with all of that in mind, I believe the best we can do is fully defer delivery
> > > of the exception until it's actually injected, and then apply the quirk to the
> > > relevant GET APIs.
> > 
> > I think this should work. I can test it for the nested case, the way I
> > could reproduce the problem (with a VMM that does KVM_GET_SREGS before
> > KVM_GET_VCPU_EVENTS, but does not use KVM_CAP_EXCEPTION_PAYLOAD) is by
> > intercepting and re-injecting all #PFs from L2, and then repeatedly
> > doing save+restore while L2 is doing some heavy lifting (building GCC).
> > This generates a lot of #PF exceptions to be saved+restored, and we
> > eventually get a segfault because of corrupted CR2 in L2.
> > 
> > Removing the is_guest_mode() check in kvm_multiple_exception() fixes it
> > by prematurely delivering the payload when it's queued. I think your fix
> > will also work by prematurely delivering the payload at save time. This
> > is actually more corect because at restore time the exception will
> > become injected and treated as such (e.g. shows up in exit_int_info).
> > 
> > Do you intend to send a patch? Or should I send it out (separate from
> > the current one) with you as the author?
> 
> I'll send a patch for this, there's a lot of historical information I want to
> capture.

Sounds good, I will test that it fixes the problem of losing the payload
of a nested exception when KVM_GET_SREGS preceeds KVM_GET_VCPU_EVENTS
(without KVM_CAP_EXCEPTION_PAYLOAD). Please CC me when you send it.

> 
> Can you send a v2 for _this_ patch, without the comment change?

Will do so shortly.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT
  2026-02-03 18:03     ` Sean Christopherson
  2026-02-03 19:08       ` Yosry Ahmed
@ 2026-02-11 20:40       ` Yosry Ahmed
  2026-02-12 18:14         ` Sean Christopherson
  1 sibling, 1 reply; 10+ messages in thread
From: Yosry Ahmed @ 2026-02-11 20:40 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

 
> So, with all of that in mind, I believe the best we can do is fully defer delivery
> of the exception until it's actually injected, and then apply the quirk to the
> relevant GET APIs.
> 
> ---
>  arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++-----------------
>  1 file changed, 39 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b0112c515584..e000521dfc8b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
>  		vcpu->arch.exception.error_code = error_code;
>  		vcpu->arch.exception.has_payload = has_payload;
>  		vcpu->arch.exception.payload = payload;
> -		if (!is_guest_mode(vcpu))
> -			kvm_deliver_exception_payload(vcpu,
> -						      &vcpu->arch.exception);
>  		return;
>  	}
>  
> @@ -5532,18 +5529,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
>  	return 0;
>  }
>  
> -static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> -					       struct kvm_vcpu_events *events)
> +static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu)
>  {
> -	struct kvm_queued_exception *ex;
> -
> -	process_nmi(vcpu);
> -
> -#ifdef CONFIG_KVM_SMM
> -	if (kvm_check_request(KVM_REQ_SMI, vcpu))
> -		process_smi(vcpu);
> -#endif
> -
>  	/*
>  	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
>  	 * the only time there can be two queued exceptions is if there's a
> @@ -5554,21 +5541,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>  	if (vcpu->arch.exception_vmexit.pending &&
>  	    !vcpu->arch.exception.pending &&
>  	    !vcpu->arch.exception.injected)
> -		ex = &vcpu->arch.exception_vmexit;
> -	else
> -		ex = &vcpu->arch.exception;
> +		return &vcpu->arch.exception_vmexit;
> +
> +	return &vcpu->arch.exception;
> +}
> +
> +static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
>  
>  	/*
> -	 * In guest mode, payload delivery should be deferred if the exception
> -	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> -	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
> -	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
> -	 * propagate the payload and so it cannot be safely deferred.  Deliver
> -	 * the payload if the capability hasn't been requested.
> +	 * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver
> +	 * the pending exception payload when userspace saves *any* vCPU state
> +	 * that interacts with exception payloads to avoid breaking userspace.
> +	 *
> +	 * Architecturally, KVM must not deliver an exception payload until the
> +	 * exception is actually injected, e.g. to avoid losing pending #DB
> +	 * information (which VMX tracks in the VMCS), and to avoid clobbering
> +	 * state if the exception is never injected for whatever reason.  But
> +	 * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or
> +	 * may not propagate the payload across save+restore, and so KVM can't
> +	 * safely defer delivery of the payload.
>  	 */
>  	if (!vcpu->kvm->arch.exception_payload_enabled &&
>  	    ex->pending && ex->has_payload)
>  		kvm_deliver_exception_payload(vcpu, ex);
> +}
> +
> +static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> +					       struct kvm_vcpu_events *events)
> +{
> +	struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
> +
> +	process_nmi(vcpu);
> +
> +#ifdef CONFIG_KVM_SMM
> +	if (kvm_check_request(KVM_REQ_SMI, vcpu))
> +		process_smi(vcpu);
> +#endif
> +
> +	kvm_handle_exception_payload_quirk(vcpu);
>  
>  	memset(events, 0, sizeof(*events));
>  
> @@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
>  	    vcpu->arch.guest_state_protected)
>  		return -EINVAL;
>  
> +	kvm_handle_exception_payload_quirk(vcpu);
> +
>  	memset(dbgregs, 0, sizeof(*dbgregs));
>  
>  	BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db));
> @@ -12123,6 +12137,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
>  	if (vcpu->arch.guest_state_protected)
>  		goto skip_protected_regs;
>  
> +	kvm_handle_exception_payload_quirk(vcpu);
> +

Hmm looking at this again, I realized it also affects the code path from
store_regs(), I think we don't want to prematurely deliver exception
payloads in that path. So maybe it's best to move this to
kvm_arch_vcpu_ioctl_get_sregs() and kvm_arch_vcpu_ioctl()?

The other option is to plumb a boolean that is only set to true in the
ioctl code path.

>  	kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
>  	kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
>  	kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES);
> 
> base-commit: 55671237401edd1ec59276b852b9361cc170915b
> --

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT
  2026-02-11 20:40       ` Yosry Ahmed
@ 2026-02-12 18:14         ` Sean Christopherson
  0 siblings, 0 replies; 10+ messages in thread
From: Sean Christopherson @ 2026-02-12 18:14 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel

On Wed, Feb 11, 2026, Yosry Ahmed wrote:
>  
> > So, with all of that in mind, I believe the best we can do is fully defer delivery
> > of the exception until it's actually injected, and then apply the quirk to the
> > relevant GET APIs.
> > @@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
> >  	    vcpu->arch.guest_state_protected)
> >  		return -EINVAL;
> >  
> > +	kvm_handle_exception_payload_quirk(vcpu);
> > +
> >  	memset(dbgregs, 0, sizeof(*dbgregs));
> >  
> >  	BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db));
> > @@ -12123,6 +12137,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
> >  	if (vcpu->arch.guest_state_protected)
> >  		goto skip_protected_regs;
> >  
> > +	kvm_handle_exception_payload_quirk(vcpu);
> > +
> 
> Hmm looking at this again, I realized it also affects the code path from
> store_regs(), I think we don't want to prematurely deliver exception
> payloads in that path.

Hrm, I actually think delivering the payload in store_regs() is the least awful
option.  E.g. a VMM that saves sregs on exit to userspace could elide KVM_GET_SREGS
when doing a save/restore.

In practice, it's all moot, because AFAICT nothing uses KVM_SYNC_X86_SREGS.

> So maybe it's best to move this to
> kvm_arch_vcpu_ioctl_get_sregs() and kvm_arch_vcpu_ioctl()?
> 
> The other option is to plumb a boolean that is only set to true in the
> ioctl code path.
> 
> >  	kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
> >  	kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
> >  	kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES);
> > 
> > base-commit: 55671237401edd1ec59276b852b9361cc170915b
> > --

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-02-12 18:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-03  1:13 [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT Yosry Ahmed
2026-02-03 15:33 ` Yosry Ahmed
2026-02-03 16:12 ` Sean Christopherson
2026-02-03 16:51   ` Yosry Ahmed
2026-02-03 18:03     ` Sean Christopherson
2026-02-03 19:08       ` Yosry Ahmed
2026-02-03 19:42         ` Sean Christopherson
2026-02-03 19:54           ` Yosry Ahmed
2026-02-11 20:40       ` Yosry Ahmed
2026-02-12 18:14         ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox