[PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
       [not found] <20260210005449.3125133-1-yosry.ahmed@linux.dev>
@ 2026-02-10  0:54 ` Yosry Ahmed
  2026-02-10  1:49   ` Sean Christopherson
  2026-02-10  0:54 ` [PATCH 2/4] KVM: nSVM: Sync int_state " Yosry Ahmed
  1 sibling, 1 reply; 15+ messages in thread
From: Yosry Ahmed @ 2026-02-10  0:54 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Yosry Ahmed, stable

After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs
fields written by the CPU from vmcb02 to the cached vmcb12. This is
because the cached vmcb12 is used as the authoritative copy of some of
the controls, and is the payload when saving/restoring nested state.

next_rip is also written by the CPU (in some cases) after VMRUN, but is
not sync'd to cached vmcb12. As a result, it is corrupted after
save/restore (replaced by the original value written by L1 on nested
VMRUN). This could cause problems for both KVM (e.g. when injecting a
soft IRQ) or L1 (e.g. when using next_rip to advance RIP after emulating
an instruction).

Fix this by sync'ing next_rip in nested_sync_control_from_vmcb02(). Move
the call to nested_sync_control_from_vmcb02() (and the entire
is_guest_mode() block) after svm_complete_interrupts(), as it may update
next_rip in vmcb02.

Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
CC: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c |  6 ++++--
 arch/x86/kvm/svm/svm.c    | 26 +++++++++++++++-----------
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index de90b104a0dd..70086ba6497f 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -519,8 +519,10 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
 void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
 {
 	u32 mask;
-	svm->nested.ctl.event_inj      = svm->vmcb->control.event_inj;
-	svm->nested.ctl.event_inj_err  = svm->vmcb->control.event_inj_err;
+
+	svm->nested.ctl.event_inj	= svm->vmcb->control.event_inj;
+	svm->nested.ctl.event_inj_err	= svm->vmcb->control.event_inj_err;
+	svm->nested.ctl.next_rip	= svm->vmcb->control.next_rip;
 
 	/* Only a few fields of int_ctl are written by the processor.  */
 	mask = V_IRQ_MASK | V_TPR_MASK;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5f0136dbdde6..6d8d4d19455e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4399,17 +4399,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 	sync_cr8_to_lapic(vcpu);
 
 	svm->next_rip = 0;
-	if (is_guest_mode(vcpu)) {
-		nested_sync_control_from_vmcb02(svm);
-
-		/* Track VMRUNs that have made past consistency checking */
-		if (svm->nested.nested_run_pending &&
-		    !svm_is_vmrun_failure(svm->vmcb->control.exit_code))
-                        ++vcpu->stat.nested_run;
-
-		svm->nested.nested_run_pending = 0;
-	}
-
 	svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
 
 	/*
@@ -4435,6 +4424,21 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 
 	svm_complete_interrupts(vcpu);
 
+	/*
+	 * svm_complete_interrupts() may update svm->vmcb->control.next_rip,
+	 * which is sync'd by nested_sync_control_from_vmcb02() below.
+	 */
+	if (is_guest_mode(vcpu)) {
+		nested_sync_control_from_vmcb02(svm);
+
+		/* Track VMRUNs that have made past consistency checking */
+		if (svm->nested.nested_run_pending &&
+		    !svm_is_vmrun_failure(svm->vmcb->control.exit_code))
+			++vcpu->stat.nested_run;
+
+		svm->nested.nested_run_pending = 0;
+	}
+
 	return svm_exit_handlers_fastpath(vcpu);
 }
 

base-commit: e944fe2c09f405a2e2d147145c9b470084bc4c9a
-- 
2.53.0.rc2.204.g2597b5adb4-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] KVM: nSVM: Sync int_state to cached vmcb12 after VMRUN of L2
       [not found] <20260210005449.3125133-1-yosry.ahmed@linux.dev>
  2026-02-10  0:54 ` [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2 Yosry Ahmed
@ 2026-02-10  0:54 ` Yosry Ahmed
  1 sibling, 0 replies; 15+ messages in thread
From: Yosry Ahmed @ 2026-02-10  0:54 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Yosry Ahmed, stable

After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs
fields written by the CPU from vmcb02 to the cached vmcb12. This is
because the cached vmcb12 is used as the authoritative copy of some of
the controls, and is the payload when saving/restoring nested state.

int_state is also written by the CPU, specifically bit 0 (i.e.
SVM_INTERRUPT_SHADOW_MASK) for nested VMs, but it is not sync'd to
cached vmcb12. This does not cause a problem if KVM_SET_NESTED_STATE
preceeds KVM_SET_VCPU_EVENTS in the restore path, as an interrupt shadow
would be correctly restored to vmcb02 (KVM_SET_VCPU_EVENTS overwrites
what KVM_SET_NESTED_STATE restored in int_state).

However, if KVM_SET_VCPU_EVENTS preceeds KVM_SET_NESTED_STATE, an
interrupt shadow would be restored into vmcb01 instead of vmcb02. This
would mostly be benign for L1 (delays an interrupt), but not for L2. For
L2, the vCPU could hang (e.g. if a wakeup interrupt is delivered before
a HLT that should have been in an interrupt shadow).

Sync int_state to the cached vmcb12 in nested_sync_control_from_vmcb02()
to avoid this problem. With that, KVM_SET_NESTED_STATE restores the
correct interrupt shadow state, and if KVM_SET_VCPU_EVENTS follows it
would overwrite it with the same value.

Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
CC: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 70086ba6497f..ff24a5748c7d 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -523,6 +523,7 @@ void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
 	svm->nested.ctl.event_inj	= svm->vmcb->control.event_inj;
 	svm->nested.ctl.event_inj_err	= svm->vmcb->control.event_inj_err;
 	svm->nested.ctl.next_rip	= svm->vmcb->control.next_rip;
+	svm->nested.ctl.int_state	= svm->vmcb->control.int_state;

 	/* Only a few fields of int_ctl are written by the processor.  */
 	mask = V_IRQ_MASK | V_TPR_MASK;
-- 
2.53.0.rc2.204.g2597b5adb4-goog

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-10  0:54 ` [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2 Yosry Ahmed
@ 2026-02-10  1:49   ` Sean Christopherson
  2026-02-10 16:25     ` Yosry Ahmed
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2026-02-10  1:49 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel, stable

On Tue, Feb 10, 2026, Yosry Ahmed wrote:
> After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs
> fields written by the CPU from vmcb02 to the cached vmcb12. This is
> because the cached vmcb12 is used as the authoritative copy of some of
> the controls, and is the payload when saving/restoring nested state.
> 
> next_rip is also written by the CPU (in some cases) after VMRUN, but is
> not sync'd to cached vmcb12. As a result, it is corrupted after
> save/restore (replaced by the original value written by L1 on nested
> VMRUN). This could cause problems for both KVM (e.g. when injecting a
> soft IRQ) or L1 (e.g. when using next_rip to advance RIP after emulating
> an instruction).
> 
> Fix this by sync'ing next_rip in nested_sync_control_from_vmcb02(). Move
> the call to nested_sync_control_from_vmcb02() (and the entire
> is_guest_mode() block) after svm_complete_interrupts(), as it may update
> next_rip in vmcb02.

I'll give you one guess as to what I would say about bundling changes.  AFAICT,
there is _zero_ reason to move the call nested_sync_control_from_vmcb02() in a
patch tagged for stable@.

> Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
> CC: stable@vger.kernel.org
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c |  6 ++++--
>  arch/x86/kvm/svm/svm.c    | 26 +++++++++++++++-----------
>  2 files changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index de90b104a0dd..70086ba6497f 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -519,8 +519,10 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
>  void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
>  {
>  	u32 mask;
> -	svm->nested.ctl.event_inj      = svm->vmcb->control.event_inj;
> -	svm->nested.ctl.event_inj_err  = svm->vmcb->control.event_inj_err;
> +
> +	svm->nested.ctl.event_inj	= svm->vmcb->control.event_inj;
> +	svm->nested.ctl.event_inj_err	= svm->vmcb->control.event_inj_err;
> +	svm->nested.ctl.next_rip	= svm->vmcb->control.next_rip;

This is all a mess (the existing code).  nested_svm_vmexit() does this:

	vmcb12->control.int_state         = vmcb02->control.int_state;
	vmcb12->control.exit_code         = vmcb02->control.exit_code;
	vmcb12->control.exit_info_1       = vmcb02->control.exit_info_1;
	vmcb12->control.exit_info_2       = vmcb02->control.exit_info_2;

	if (!svm_is_vmrun_failure(vmcb12->control.exit_code))
		nested_save_pending_event_to_vmcb12(svm, vmcb12);

	if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
		vmcb12->control.next_rip  = vmcb02->control.next_rip;

	vmcb12->control.int_ctl           = svm->nested.ctl.int_ctl;
	vmcb12->control.event_inj         = svm->nested.ctl.event_inj;
	vmcb12->control.event_inj_err     = svm->nested.ctl.event_inj_err;

but then svm_get_nested_state(), by way of nested_copy_vmcb_cache_to_control(),
pulls everything from the cached fields.  Which probably only works because the
only fields that are pulled from vmcb02 nested_svm_vmexit() are never modified
by the CPU.

Actually, I take that back, I have no idea how this code works.  How does e.g.
exit_info_1 not get clobbered on save/restore?

In other words, AFAICT, nested.ctl.int_ctl is special in that KVM needs it to be
up-to-date at all times, *and* it needs to copied back to vmcb12 (or userspace).

Part of me wants to remove these two fields entirely:

	/* cache for control fields of the guest */
	struct vmcb_ctrl_area_cached ctl;

	/*
	 * Note: this struct is not kept up-to-date while L2 runs; it is only
	 * valid within nested_svm_vmrun.
	 */
	struct vmcb_save_area_cached save;

and instead use "full" caches only for the duration of nested_svm_vmrun().  Or
hell, just copy the entire vmcb12 and throw the cached structures in the garbage.
But that'll probably end in a game of whack-a-mole as things get moved back in.

So rather than do something totally drastic, I think we should kill
nested_copy_vmcb_cache_to_control() and replace it with a "save control" flow.
And then have it share code as much code as possible with nested_svm_vmexit(),
and fixup nested_svm_vmexit() to not pull from svm->nested.ctl unnecessarily.
Which, again AFICT, is pretty much limited to int_ctl: either vmcb02 is
authoritative, or KVM shouldn't be updating vmcb12, and so only the "save control"
for KVM_GET_NESTED_STATE needs to copy from the cache to the migrated vmcb12.

That'll probably end up a bit fat for a stable@ patch, so we could do a gross
one-off fix for this issue, and then do cleanups on top.

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5f0136dbdde6..cd5664c65a00 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4435,6 +4435,16 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 
        svm_complete_interrupts(vcpu);
 
+       /*
+        * Update the cache after completing interrupts to get an accurate
+        * NextRIP, e.g. when re-injecting a soft interrupt.
+        *
+        * FIXME: Rework svm_get_nested_state() to not pull data from the
+        *        cache (except for maybe int_ctl).
+        */
+       if (is_guest_mode(vcpu))
+               svm->nested.ctl.next_rip = svm->vmcb->control.next_rip;
+
        return svm_exit_handlers_fastpath(vcpu);
 }
 
>  	/* Only a few fields of int_ctl are written by the processor.  */
>  	mask = V_IRQ_MASK | V_TPR_MASK;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 5f0136dbdde6..6d8d4d19455e 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4399,17 +4399,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
>  	sync_cr8_to_lapic(vcpu);
>  
>  	svm->next_rip = 0;
> -	if (is_guest_mode(vcpu)) {
> -		nested_sync_control_from_vmcb02(svm);
> -
> -		/* Track VMRUNs that have made past consistency checking */
> -		if (svm->nested.nested_run_pending &&
> -		    !svm_is_vmrun_failure(svm->vmcb->control.exit_code))
> -                        ++vcpu->stat.nested_run;
> -
> -		svm->nested.nested_run_pending = 0;
> -	}
> -
>  	svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
>  
>  	/*
> @@ -4435,6 +4424,21 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
>  
>  	svm_complete_interrupts(vcpu);
>  
> +	/*
> +	 * svm_complete_interrupts() may update svm->vmcb->control.next_rip,
> +	 * which is sync'd by nested_sync_control_from_vmcb02() below.

Please try to avoid referencing functions and fields in comments.  History has
shown that they almost always become stale.

> +	 */
> +	if (is_guest_mode(vcpu)) {
> +		nested_sync_control_from_vmcb02(svm);
> +
> +		/* Track VMRUNs that have made past consistency checking */
> +		if (svm->nested.nested_run_pending &&
> +		    !svm_is_vmrun_failure(svm->vmcb->control.exit_code))
> +			++vcpu->stat.nested_run;
> +
> +		svm->nested.nested_run_pending = 0;
> +	}
> +
>  	return svm_exit_handlers_fastpath(vcpu);
>  }
>  
> 
> base-commit: e944fe2c09f405a2e2d147145c9b470084bc4c9a
> -- 
> 2.53.0.rc2.204.g2597b5adb4-goog
> 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-10  1:49   ` Sean Christopherson
@ 2026-02-10 16:25     ` Yosry Ahmed
  2026-02-10 19:20       ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Yosry Ahmed @ 2026-02-10 16:25 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, stable

On Mon, Feb 09, 2026 at 05:49:06PM -0800, Sean Christopherson wrote:
> On Tue, Feb 10, 2026, Yosry Ahmed wrote:
> > After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs
> > fields written by the CPU from vmcb02 to the cached vmcb12. This is
> > because the cached vmcb12 is used as the authoritative copy of some of
> > the controls, and is the payload when saving/restoring nested state.
> > 
> > next_rip is also written by the CPU (in some cases) after VMRUN, but is
> > not sync'd to cached vmcb12. As a result, it is corrupted after
> > save/restore (replaced by the original value written by L1 on nested
> > VMRUN). This could cause problems for both KVM (e.g. when injecting a
> > soft IRQ) or L1 (e.g. when using next_rip to advance RIP after emulating
> > an instruction).
> > 
> > Fix this by sync'ing next_rip in nested_sync_control_from_vmcb02(). Move
> > the call to nested_sync_control_from_vmcb02() (and the entire
> > is_guest_mode() block) after svm_complete_interrupts(), as it may update
> > next_rip in vmcb02.
> 
> I'll give you one guess as to what I would say about bundling changes.  AFAICT,
> there is _zero_ reason to move the call nested_sync_control_from_vmcb02() in a
> patch tagged for stable@.

I generally agree with your previous feedback about combining changes,
but I think I disagree for this specific instance. I did actually have
two separate changes: one to move the call to
nested_sync_control_from_vmcb02() (still tagged for stable@), and one to
add next_rip.

However, I found myself explaining a lot of the next_rip context in the
commit log of moving nested_sync_control_from_vmcb02(), to explain why
it specifically needed to go after svm_complete_interrupts(). Also, I
had to add the comment above the call to
nested_sync_control_from_vmcb02() in the patch adding next_rip to it.

Looking at both patches, it made more sense to combine them given their
tight connection and simplicity. The history is clearer when the move,
comment, and next_rip addition are bundled.

Or..

Did you mean to have a patch that just copied next_rip outside of
nested_sync_control_from_vmcb02(), after svm_complete_interrupts(), for
stable@, and then clean it up on top? Eh, not a big fan of that either
because the current patch is simple enough for stable@ imo.

> 
> > Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
> > CC: stable@vger.kernel.org
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/nested.c |  6 ++++--
> >  arch/x86/kvm/svm/svm.c    | 26 +++++++++++++++-----------
> >  2 files changed, 19 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index de90b104a0dd..70086ba6497f 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -519,8 +519,10 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
> >  void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
> >  {
> >  	u32 mask;
> > -	svm->nested.ctl.event_inj      = svm->vmcb->control.event_inj;
> > -	svm->nested.ctl.event_inj_err  = svm->vmcb->control.event_inj_err;
> > +
> > +	svm->nested.ctl.event_inj	= svm->vmcb->control.event_inj;
> > +	svm->nested.ctl.event_inj_err	= svm->vmcb->control.event_inj_err;
> > +	svm->nested.ctl.next_rip	= svm->vmcb->control.next_rip;
> 
> This is all a mess (the existing code).  nested_svm_vmexit() does this:
> 
> 	vmcb12->control.int_state         = vmcb02->control.int_state;
> 	vmcb12->control.exit_code         = vmcb02->control.exit_code;
> 	vmcb12->control.exit_info_1       = vmcb02->control.exit_info_1;
> 	vmcb12->control.exit_info_2       = vmcb02->control.exit_info_2;
> 
> 	if (!svm_is_vmrun_failure(vmcb12->control.exit_code))
> 		nested_save_pending_event_to_vmcb12(svm, vmcb12);
> 
> 	if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
> 		vmcb12->control.next_rip  = vmcb02->control.next_rip;
> 
> 	vmcb12->control.int_ctl           = svm->nested.ctl.int_ctl;
> 	vmcb12->control.event_inj         = svm->nested.ctl.event_inj;
> 	vmcb12->control.event_inj_err     = svm->nested.ctl.event_inj_err;
> 
> but then svm_get_nested_state(), by way of nested_copy_vmcb_cache_to_control(),
> pulls everything from the cached fields.  Which probably only works because the
> only fields that are pulled from vmcb02 nested_svm_vmexit() are never modified
> by the CPU.

Yeah I think that's the key. The main distinction is whether fields
are"in", "out" or "in/out" fields. I wish those were more clearly
separated by the APM. More below.

> 
> Actually, I take that back, I have no idea how this code works.  How does e.g.
> exit_info_1 not get clobbered on save/restore?

I *think* KVM always sets the error_code and exit_info_* fields before
synthesizing a #VMEXIT to L1, usually right before calling
nested_svm_vmeit(), so no chance for save/restore in between.

I think generally, most "out" fields are consumed by KVM before
userspace can save/restore, hence them getting lost on save/restore is
fine?

It's still probably worse than we think, I see
svm->nested.ctl.bus_lock_rip is not saved/restored, because it's not
part of the VMCB. So in
svm_set_nested_state()->nested_vmcb02_prepare_control() we end up
comparing garbage to garbage (because vmcb02->save.rip is also wrong):

	if (vmcb02->save.rip && (svm->nested.ctl.bus_lock_rip == vmcb02->save.rip))
		vmcb02->control.bus_lock_counter = 1;
	else
		vmcb02->control.bus_lock_counter = 0;

> 
> In other words, AFAICT, nested.ctl.int_ctl is special in that KVM needs it to be
> up-to-date at all times, *and* it needs to copied back to vmcb12 (or userspace).

Hmm actually looking at nested.ctl.int_ctl, I don't think it's that
special. Most KVM usages are checking "in" bits, i.e. whether some
features (e.g. vGIF) are enabled or not.

The "out" bits seem to only be consumed by svm_clear_vintr(), and I
think this can be worked around.

So maybe we don't really need to keep it up-to-date in the cache at all
times.

> 
> Part of me wants to remove these two fields entirely:
> 
> 	/* cache for control fields of the guest */
> 	struct vmcb_ctrl_area_cached ctl;
> 
> 	/*
> 	 * Note: this struct is not kept up-to-date while L2 runs; it is only
> 	 * valid within nested_svm_vmrun.
> 	 */
> 	struct vmcb_save_area_cached save;
> 
> and instead use "full" caches only for the duration of nested_svm_vmrun().  Or
> hell, just copy the entire vmcb12 and throw the cached structures in the garbage.
> But that'll probably end in a game of whack-a-mole as things get moved back in.

Yeah, KVM needs to keep some of the fields around :/

> 
> So rather than do something totally drastic, I think we should kill
> nested_copy_vmcb_cache_to_control() and replace it with a "save control" flow.
> And then have it share code as much code as possible with nested_svm_vmexit(),
> and fixup nested_svm_vmexit() to not pull from svm->nested.ctl unnecessarily.
> Which, again AFICT, is pretty much limited to int_ctl: either vmcb02 is
> authoritative, or KVM shouldn't be updating vmcb12, and so only the "save control"
> for KVM_GET_NESTED_STATE needs to copy from the cache to the migrated vmcb12.

I think this works if we draw a clear extinction between "in","out", and
"in/out" fields, which is not great because some fields (like int_ctl)
have different directions for different bits :/

But if we do draw that distinction, and have helpers that copy fields
based on direction, things become more intuitive:

During nested VMRUN, we use the "in" and "in/out" fields from cached
vmcb12 to construct vmcb02 through nested_vmcb02_prepare_control().

During save, we save "in" fields from the cached vmcb12, "out" and
"in/out" fields from vmcb02.

During restore, we use the "in" and "in/out" fields from the restored
payload to construct vmcb02 through nested_vmcb02_prepare_control(), AND
update the "out" fields as well from the payload.

During synthesized #VMEXIT, we save the "out" and "in/out" fields from
vmcb02 (shared part with save/restore).

The save/restore changes would need a flag to avoid restoring garbage
from an older KVM. It's also probably not as straightforward as I am
making it out to be. For example, "in/out" fields may not be reflected
as-is from vmcb12 to vmcb02, so if we save+restore with
nested_run_pending, we end up creating vmcb02 on the destination from
what we put in vmcb02 in the source, not vmcb12, which may or may not be
the right thing to do.

This is probably a heavier lift than we think it is, or maybe it's
simpler once I start coding it :)

> 
> That'll probably end up a bit fat for a stable@ patch, so we could do a gross
> one-off fix for this issue, and then do cleanups on top.

Honestly, I'd rather keep the existing patch for stable@. It's not that
complicated, and downstream trees that take it don't have to live with
the FIXME code.

The heavier lift to clean this up can be done separately, or I can send
a new version with the first 2 patches in the beginning for stable@ and
the cleanups on top, depends on how we decide to implement this.

> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 5f0136dbdde6..cd5664c65a00 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4435,6 +4435,16 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
>  
>         svm_complete_interrupts(vcpu);
>  
> +       /*
> +        * Update the cache after completing interrupts to get an accurate
> +        * NextRIP, e.g. when re-injecting a soft interrupt.
> +        *
> +        * FIXME: Rework svm_get_nested_state() to not pull data from the
> +        *        cache (except for maybe int_ctl).
> +        */
> +       if (is_guest_mode(vcpu))
> +               svm->nested.ctl.next_rip = svm->vmcb->control.next_rip;
> +
>         return svm_exit_handlers_fastpath(vcpu);
>  }
>  
> >  	/* Only a few fields of int_ctl are written by the processor.  */
> >  	mask = V_IRQ_MASK | V_TPR_MASK;
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 5f0136dbdde6..6d8d4d19455e 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -4399,17 +4399,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> >  	sync_cr8_to_lapic(vcpu);
> >  
> >  	svm->next_rip = 0;
> > -	if (is_guest_mode(vcpu)) {
> > -		nested_sync_control_from_vmcb02(svm);
> > -
> > -		/* Track VMRUNs that have made past consistency checking */
> > -		if (svm->nested.nested_run_pending &&
> > -		    !svm_is_vmrun_failure(svm->vmcb->control.exit_code))
> > -                        ++vcpu->stat.nested_run;
> > -
> > -		svm->nested.nested_run_pending = 0;
> > -	}
> > -
> >  	svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
> >  
> >  	/*
> > @@ -4435,6 +4424,21 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> >  
> >  	svm_complete_interrupts(vcpu);
> >  
> > +	/*
> > +	 * svm_complete_interrupts() may update svm->vmcb->control.next_rip,
> > +	 * which is sync'd by nested_sync_control_from_vmcb02() below.
> 
> Please try to avoid referencing functions and fields in comments.  History has
> shown that they almost always become stale.

Generally agree, but in this case I am referencing the calls right above
and right below, and it's probably clearer to mention the ordering
constraint directly with their names.

That being said, if you feel strongly I can probably do sth like your
suggestion above:

	/*
	 * Only sync fields from vmcb02 to cache after completing
	 * interrupts, as NextRIP may be updated (e.g. when re-injecting a
	 * soft interrupt).
	 */

> 
> > +	 */
> > +	if (is_guest_mode(vcpu)) {
> > +		nested_sync_control_from_vmcb02(svm);
> > +
> > +		/* Track VMRUNs that have made past consistency checking */
> > +		if (svm->nested.nested_run_pending &&
> > +		    !svm_is_vmrun_failure(svm->vmcb->control.exit_code))
> > +			++vcpu->stat.nested_run;
> > +
> > +		svm->nested.nested_run_pending = 0;
> > +	}
> > +
> >  	return svm_exit_handlers_fastpath(vcpu);
> >  }
> >  
> > 
> > base-commit: e944fe2c09f405a2e2d147145c9b470084bc4c9a
> > -- 
> > 2.53.0.rc2.204.g2597b5adb4-goog
> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-10 16:25     ` Yosry Ahmed
@ 2026-02-10 19:20       ` Sean Christopherson
  2026-02-10 22:19         ` Yosry Ahmed
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2026-02-10 19:20 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel, stable

On Tue, Feb 10, 2026, Yosry Ahmed wrote:
> On Mon, Feb 09, 2026 at 05:49:06PM -0800, Sean Christopherson wrote:
> > On Tue, Feb 10, 2026, Yosry Ahmed wrote:
> > > After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs
> > > fields written by the CPU from vmcb02 to the cached vmcb12. This is
> > > because the cached vmcb12 is used as the authoritative copy of some of
> > > the controls, and is the payload when saving/restoring nested state.
> > > 
> > > next_rip is also written by the CPU (in some cases) after VMRUN, but is
> > > not sync'd to cached vmcb12. As a result, it is corrupted after
> > > save/restore (replaced by the original value written by L1 on nested
> > > VMRUN). This could cause problems for both KVM (e.g. when injecting a
> > > soft IRQ) or L1 (e.g. when using next_rip to advance RIP after emulating
> > > an instruction).
> > > 
> > > Fix this by sync'ing next_rip in nested_sync_control_from_vmcb02(). Move
> > > the call to nested_sync_control_from_vmcb02() (and the entire
> > > is_guest_mode() block) after svm_complete_interrupts(), as it may update
> > > next_rip in vmcb02.
> > 
> > I'll give you one guess as to what I would say about bundling changes.  AFAICT,
> > there is _zero_ reason to move the call nested_sync_control_from_vmcb02() in a
> > patch tagged for stable@.
> 
> I generally agree with your previous feedback about combining changes,
> but I think I disagree for this specific instance. I did actually have
> two separate changes: one to move the call to
> nested_sync_control_from_vmcb02() (still tagged for stable@), and one to
> add next_rip.
> 
> However, I found myself explaining a lot of the next_rip context in the
> commit log of moving nested_sync_control_from_vmcb02(), to explain why
> it specifically needed to go after svm_complete_interrupts().

And?  That's kinda the whole point of changelogs.  I'm also not seeing an onerous
amount of documentation, e.g.

  KVM: SVM: Sync control from vmcb02 on #VMEXIT after completing interrupts

  Refresh KVM's cache of VMCB control fields, which is a weird combination
  of original vmcb12 data and current vmcb02 data, after completing
  interrupts and exceptions so that a future fix can refresh next_rip
  without dropping the next_rip updates made for completing soft interrupts.
  
> Also, I had to add the comment above the call to
> nested_sync_control_from_vmcb02() in the patch adding next_rip to it.
> 
> Looking at both patches, it made more sense to combine them given their
> tight connection and simplicity. The history is clearer when the move,
> comment, and next_rip addition are bundled.
> 
> Or..
> 
> Did you mean to have a patch that just copied next_rip outside of
> nested_sync_control_from_vmcb02(), after svm_complete_interrupts(), for
> stable@, and then clean it up on top? Eh, not a big fan of that either
> because the current patch is simple enough for stable@ imo.

My objection to bundling is that it subtly requires guarnteeing that none of the
fields updated by nested_sync_control_from_vmcb02() are consumed between its
current location and the new location.  That alone warrants a changelog.  I.e.
it's as much that there's _zero_ analysis in the current changelog as to the
safety, as it is that the movement is bundled together.

> > > Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
> > > CC: stable@vger.kernel.org
> > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > ---
> > >  arch/x86/kvm/svm/nested.c |  6 ++++--
> > >  arch/x86/kvm/svm/svm.c    | 26 +++++++++++++++-----------
> > >  2 files changed, 19 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index de90b104a0dd..70086ba6497f 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -519,8 +519,10 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
> > >  void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
> > >  {
> > >  	u32 mask;
> > > -	svm->nested.ctl.event_inj      = svm->vmcb->control.event_inj;
> > > -	svm->nested.ctl.event_inj_err  = svm->vmcb->control.event_inj_err;
> > > +
> > > +	svm->nested.ctl.event_inj	= svm->vmcb->control.event_inj;
> > > +	svm->nested.ctl.event_inj_err	= svm->vmcb->control.event_inj_err;
> > > +	svm->nested.ctl.next_rip	= svm->vmcb->control.next_rip;
> > 
> > This is all a mess (the existing code).  nested_svm_vmexit() does this:
> > 
> > 	vmcb12->control.int_state         = vmcb02->control.int_state;
> > 	vmcb12->control.exit_code         = vmcb02->control.exit_code;
> > 	vmcb12->control.exit_info_1       = vmcb02->control.exit_info_1;
> > 	vmcb12->control.exit_info_2       = vmcb02->control.exit_info_2;
> > 
> > 	if (!svm_is_vmrun_failure(vmcb12->control.exit_code))
> > 		nested_save_pending_event_to_vmcb12(svm, vmcb12);
> > 
> > 	if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
> > 		vmcb12->control.next_rip  = vmcb02->control.next_rip;
> > 
> > 	vmcb12->control.int_ctl           = svm->nested.ctl.int_ctl;
> > 	vmcb12->control.event_inj         = svm->nested.ctl.event_inj;
> > 	vmcb12->control.event_inj_err     = svm->nested.ctl.event_inj_err;
> > 
> > but then svm_get_nested_state(), by way of nested_copy_vmcb_cache_to_control(),
> > pulls everything from the cached fields.  Which probably only works because the
> > only fields that are pulled from vmcb02 nested_svm_vmexit() are never modified
> > by the CPU.
> 
> Yeah I think that's the key. The main distinction is whether fields
> are"in", "out" or "in/out" fields. I wish those were more clearly
> separated by the APM. More below.
> 
> > 
> > Actually, I take that back, I have no idea how this code works.  How does e.g.
> > exit_info_1 not get clobbered on save/restore?
> 
> I *think* KVM always sets the error_code and exit_info_* fields before
> synthesizing a #VMEXIT to L1, usually right before calling
> nested_svm_vmeit(), so no chance for save/restore in between.

Ugh, right, KVM generally doesn't recognize signals until after invoking the exit
handler.  Actually, even that isn't the key, it's that this flaw only affects
"in/out" fields, as you note above.  Heh, and even that probably isn't entirely
precise, as it's really "in/out fields that KVM consumes while running L2 are
buggy".  E.g. in this case, nested_vmcb02_prepare_control() pulls next_rip for
vmcb02 from the cache.

	if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
	else if (boot_cpu_has(X86_FEATURE_NRIPS))
		vmcb02->control.next_rip    = vmcb12_rip;

> I think generally, most "out" fields are consumed by KVM before userspace can
> save/restore, hence them getting lost on save/restore is fine?

Yep.

> It's still probably worse than we think, I see
> svm->nested.ctl.bus_lock_rip is not saved/restored, because it's not
> part of the VMCB. So in
> svm_set_nested_state()->nested_vmcb02_prepare_control() we end up
> comparing garbage to garbage (because vmcb02->save.rip is also wrong):
> 
> 	if (vmcb02->save.rip && (svm->nested.ctl.bus_lock_rip == vmcb02->save.rip))
> 		vmcb02->control.bus_lock_counter = 1;
> 	else
> 		vmcb02->control.bus_lock_counter = 0;

Let's ignore this one.  It's overall a non-issue, and we're already planning on
moving it out of the control cache.

> > In other words, AFAICT, nested.ctl.int_ctl is special in that KVM needs it to be
> > up-to-date at all times, *and* it needs to copied back to vmcb12 (or userspace).
> 
> Hmm actually looking at nested.ctl.int_ctl, I don't think it's that special.
> Most KVM usages are checking "in" bits, i.e. whether some features (e.g.
> vGIF) are enabled or not.
>
> The "out" bits seem to only be consumed by svm_clear_vintr(), and I
> think this can be worked around.

OMG that code makes my head hurt.  Isn't that code just this?

	/*
	 * Drop int_ctl fields related to VINTR injection.  If L2 is active,
	 * restore the virtual IRQ flag and its vector from vmcb12 now that KVM
	 * is done usurping virtual IRQs for its own purposes.
	 */
	svm->vmcb01.ptr->control.int_ctl &= ~V_IRQ_INJECTION_BITS_MASK;

	if (is_guest_mode(&svm->vcpu)) {
		svm->vmcb->control.int_ctl = (svm->vmcb->control.int_ctl & ~V_IRQ_MASK) |
					     (svm->nested.ctl.int_ctl & V_IRQ_MASK);
		svm->vmcb->control.int_vector = svm->nested.ctl.int_vector;
	} else {
		WARN_ON_ONCE(svm->vmcb != svm->vmcb01.ptr);
	}

	svm_clr_intercept(svm, INTERCEPT_VINTR);
	vmcb_mark_dirty(svm->vmcb, VMCB_INTR);

> So maybe we don't really need to keep it up-to-date in the cache at all
> times.

Yeah, IMO that approach is unnecessarily convoluted.  The actual logic isn't all
that complex, all of the complexity comes from juggling state between the cache
and vmcb02 just so that the cache can be authoritative.  Given that we failed
miserably in actually making the cache authoritative, e.g. see the nested #VMEXIT
flow, I think we should kill the entire concept and instead maintain an *exact*
snapshot of vmcb12's controls, and then make vmcb02 authoritative.  We'll still
need the logic in nested_sync_control_from_vmcb02() to updated int_ctl on save
or #VMEXIT if KVM is still intercepting VINTR for its own purposes, but at least
the code will be contained.

Then to make it all but impossible to re-introduce this mess, do something like
this so that someone would have to go way out of their way to try and modify the
cache.

diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index ebd7b36b1ceb..2de6305be9ce 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -199,14 +199,13 @@ struct svm_nested_state {
         * we cannot inject a nested vmexit yet.  */
        bool nested_run_pending;
 
-       /* cache for control fields of the guest */
-       struct vmcb_ctrl_area_cached ctl;
-
        /*
-        * Note: this struct is not kept up-to-date while L2 runs; it is only
-        * valid within nested_svm_vmrun.
+        * An opaque, read-only cache of vmcb12 controls, used to query L1's
+        * controls while running L2, e.g. to route intercepts appropriately.
+        * All reads are routed through accessors to make it all but impossible
+        * for KVM to clobber its snapshot of vmcb12.
         */
-       struct vmcb_save_area_cached save;
+       u8 __vmcb12_ctrl[sizeof(struct vmcb_ctrl_area_cached)];
 
        bool initialized;

> > Part of me wants to remove these two fields entirely:
> > 
> > 	/* cache for control fields of the guest */
> > 	struct vmcb_ctrl_area_cached ctl;
> > 
> > 	/*
> > 	 * Note: this struct is not kept up-to-date while L2 runs; it is only
> > 	 * valid within nested_svm_vmrun.
> > 	 */
> > 	struct vmcb_save_area_cached save;
> > 
> > and instead use "full" caches only for the duration of nested_svm_vmrun().  Or
> > hell, just copy the entire vmcb12 and throw the cached structures in the garbage.
> > But that'll probably end in a game of whack-a-mole as things get moved back in.
> 
> Yeah, KVM needs to keep some of the fields around :/

For me, that's totally fine.  As above, the problem I see is that there is no
single source of truth, i.e. that the authoritative state is spread across vmcb02
and the cache.

> > So rather than do something totally drastic, I think we should kill
> > nested_copy_vmcb_cache_to_control() and replace it with a "save control" flow.
> > And then have it share code as much code as possible with nested_svm_vmexit(),
> > and fixup nested_svm_vmexit() to not pull from svm->nested.ctl unnecessarily.
> > Which, again AFICT, is pretty much limited to int_ctl: either vmcb02 is
> > authoritative, or KVM shouldn't be updating vmcb12, and so only the "save control"
> > for KVM_GET_NESTED_STATE needs to copy from the cache to the migrated vmcb12.
> 
> I think this works if we draw a clear extinction between "in","out", and
> "in/out" fields, which is not great because some fields (like int_ctl)
> have different directions for different bits :/
> 
> But if we do draw that distinction, and have helpers that copy fields
> based on direction, things become more intuitive:
> 
> During nested VMRUN, we use the "in" and "in/out" fields from cached
> vmcb12 to construct vmcb02 through nested_vmcb02_prepare_control().
> 
> During save, we save "in" fields from the cached vmcb12, "out" and
> "in/out" fields from vmcb02.
> 
> During restore, we use the "in" and "in/out" fields from the restored
> payload to construct vmcb02 through nested_vmcb02_prepare_control(), AND
> update the "out" fields as well from the payload.

Why the last part?  If L2 is active, then the pure "out" fields are guaranteed
to be written on nested #VMEXIT.  Anything else simply can't work.

> During synthesized #VMEXIT, we save the "out" and "in/out" fields from
> vmcb02 (shared part with save/restore).

Yeah, that all works.  We could also treat save() as an extension of #VMEXIT, but
that could make KVM_GET_NESTED_STATE non-idempotent (which might already be the
case for VMX?).  I.e. we could sync vmcb02 to vmcb12 (cache), and then copy that
to userspace. 

> The save/restore changes would need a flag to avoid restoring garbage
> from an older KVM.

I don't follow.  I was thinking we'd only change how KVM maintains authoritative
state while runnign L2, i.e. not make any changes (other than fixes) to the
serialized state for save/restore.

> It's also probably not as straightforward as I am making it out to be. For
> example, "in/out" fields may not be reflected as-is from vmcb12 to vmcb02, so
> if we save+restore with nested_run_pending, we end up creating vmcb02 on the
> destination from what we put in vmcb02 in the source, not vmcb12, which may
> or may not be the right thing to do.

If that's a problem, we've already messed up.  Because we _must_ get that right
for nested #VMEXIT, i.e. KVM _must_ be able to extract the pieces of vmcb02 that
belong to L1 vs. L0.  At a glance, it seems very doable to writ the code so that
it's shareable between #VMEXIT and save().

> This is probably a heavier lift than we think it is, or maybe it's
> simpler once I start coding it :)

Maybe?  It's certainly not trivial, but I don't think it's terribly complex either,
at least not what I have in mind.

> > That'll probably end up a bit fat for a stable@ patch, so we could do a gross
> > one-off fix for this issue, and then do cleanups on top.
> 
> Honestly, I'd rather keep the existing patch for stable@. It's not that
> complicated, and downstream trees that take it don't have to live with
> the FIXME code.

I'd rather omit the FIXME than move the nested_sync_control_from_vmcb02() call.
Because it's entirely possible that the code movement will apply cleanly to an
older kernel, but semantically be broken due something in-between consuming
int_ctl.  The odds of that happening are low, but I don't want to have to audit
every LTS kernel (or set up others to fail).

> The heavier lift to clean this up can be done separately,

Yes, for sure.

> or I can send a new version with the first 2 patches in the beginning for
> stable@ and the cleanups on top, depends on how we decide to implement this.

As above, my preference is to throw in a super minimal fix for next_rip, and then
commit ourselves to removing nested_sync_control_from_vmcb02() sooner or later.

> > > @@ -4435,6 +4424,21 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> > >  
> > >  	svm_complete_interrupts(vcpu);
> > >  
> > > +	/*
> > > +	 * svm_complete_interrupts() may update svm->vmcb->control.next_rip,
> > > +	 * which is sync'd by nested_sync_control_from_vmcb02() below.
> > 
> > Please try to avoid referencing functions and fields in comments.  History has
> > shown that they almost always become stale.
> 
> Generally agree, but in this case I am referencing the calls right above
> and right below, and it's probably clearer to mention the ordering
> constraint directly with their names.
> 
> That being said, if you feel strongly

I do.  I appreciate that it requires more effort to write comments that don't
reference functions/variables/fields, and that it can be downright annoying, but
we've had far too many orphaned comments over the years.

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-10 19:20       ` Sean Christopherson
@ 2026-02-10 22:19         ` Yosry Ahmed
  2026-02-11  0:09           ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Yosry Ahmed @ 2026-02-10 22:19 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, stable

On Tue, Feb 10, 2026 at 11:20:19AM -0800, Sean Christopherson wrote:
> On Tue, Feb 10, 2026, Yosry Ahmed wrote:
> > On Mon, Feb 09, 2026 at 05:49:06PM -0800, Sean Christopherson wrote:
> > > On Tue, Feb 10, 2026, Yosry Ahmed wrote:
> > > > After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs
> > > > fields written by the CPU from vmcb02 to the cached vmcb12. This is
> > > > because the cached vmcb12 is used as the authoritative copy of some of
> > > > the controls, and is the payload when saving/restoring nested state.
> > > > 
> > > > next_rip is also written by the CPU (in some cases) after VMRUN, but is
> > > > not sync'd to cached vmcb12. As a result, it is corrupted after
> > > > save/restore (replaced by the original value written by L1 on nested
> > > > VMRUN). This could cause problems for both KVM (e.g. when injecting a
> > > > soft IRQ) or L1 (e.g. when using next_rip to advance RIP after emulating
> > > > an instruction).
> > > > 
> > > > Fix this by sync'ing next_rip in nested_sync_control_from_vmcb02(). Move
> > > > the call to nested_sync_control_from_vmcb02() (and the entire
> > > > is_guest_mode() block) after svm_complete_interrupts(), as it may update
> > > > next_rip in vmcb02.
> > > 
> > > I'll give you one guess as to what I would say about bundling changes.  AFAICT,
> > > there is _zero_ reason to move the call nested_sync_control_from_vmcb02() in a
> > > patch tagged for stable@.
> > 
> > I generally agree with your previous feedback about combining changes,
> > but I think I disagree for this specific instance. I did actually have
> > two separate changes: one to move the call to
> > nested_sync_control_from_vmcb02() (still tagged for stable@), and one to
> > add next_rip.
> > 
> > However, I found myself explaining a lot of the next_rip context in the
> > commit log of moving nested_sync_control_from_vmcb02(), to explain why
> > it specifically needed to go after svm_complete_interrupts().
> 
> And?  That's kinda the whole point of changelogs.  I'm also not seeing an onerous
> amount of documentation, e.g.
> 
>   KVM: SVM: Sync control from vmcb02 on #VMEXIT after completing interrupts
> 
>   Refresh KVM's cache of VMCB control fields, which is a weird combination
>   of original vmcb12 data and current vmcb02 data, after completing
>   interrupts and exceptions so that a future fix can refresh next_rip
>   without dropping the next_rip updates made for completing soft interrupts.
>   
> > Also, I had to add the comment above the call to
> > nested_sync_control_from_vmcb02() in the patch adding next_rip to it.
> > 
> > Looking at both patches, it made more sense to combine them given their
> > tight connection and simplicity. The history is clearer when the move,
> > comment, and next_rip addition are bundled.
> > 
> > Or..
> > 
> > Did you mean to have a patch that just copied next_rip outside of
> > nested_sync_control_from_vmcb02(), after svm_complete_interrupts(), for
> > stable@, and then clean it up on top? Eh, not a big fan of that either
> > because the current patch is simple enough for stable@ imo.
> 
> My objection to bundling is that it subtly requires guarnteeing that none of the
> fields updated by nested_sync_control_from_vmcb02() are consumed between its
> current location and the new location.  That alone warrants a changelog.  I.e.
> it's as much that there's _zero_ analysis in the current changelog as to the
> safety, as it is that the movement is bundled together.

Now that I agree this, I should have included more details about why
it's safe to move the call later. But that feedback is irrelevant to
splitting the commit imo.

Anyway, we'll probably drop the code movement.

> 
> > > > Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
> > > > CC: stable@vger.kernel.org
> > > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > ---
> > > >  arch/x86/kvm/svm/nested.c |  6 ++++--
> > > >  arch/x86/kvm/svm/svm.c    | 26 +++++++++++++++-----------
> > > >  2 files changed, 19 insertions(+), 13 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > > index de90b104a0dd..70086ba6497f 100644
> > > > --- a/arch/x86/kvm/svm/nested.c
> > > > +++ b/arch/x86/kvm/svm/nested.c
> > > > @@ -519,8 +519,10 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
> > > >  void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
> > > >  {
> > > >  	u32 mask;
> > > > -	svm->nested.ctl.event_inj      = svm->vmcb->control.event_inj;
> > > > -	svm->nested.ctl.event_inj_err  = svm->vmcb->control.event_inj_err;
> > > > +
> > > > +	svm->nested.ctl.event_inj	= svm->vmcb->control.event_inj;
> > > > +	svm->nested.ctl.event_inj_err	= svm->vmcb->control.event_inj_err;
> > > > +	svm->nested.ctl.next_rip	= svm->vmcb->control.next_rip;
> > > 
> > > This is all a mess (the existing code).  nested_svm_vmexit() does this:
> > > 
> > > 	vmcb12->control.int_state         = vmcb02->control.int_state;
> > > 	vmcb12->control.exit_code         = vmcb02->control.exit_code;
> > > 	vmcb12->control.exit_info_1       = vmcb02->control.exit_info_1;
> > > 	vmcb12->control.exit_info_2       = vmcb02->control.exit_info_2;
> > > 
> > > 	if (!svm_is_vmrun_failure(vmcb12->control.exit_code))
> > > 		nested_save_pending_event_to_vmcb12(svm, vmcb12);
> > > 
> > > 	if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
> > > 		vmcb12->control.next_rip  = vmcb02->control.next_rip;
> > > 
> > > 	vmcb12->control.int_ctl           = svm->nested.ctl.int_ctl;
> > > 	vmcb12->control.event_inj         = svm->nested.ctl.event_inj;
> > > 	vmcb12->control.event_inj_err     = svm->nested.ctl.event_inj_err;
> > > 
> > > but then svm_get_nested_state(), by way of nested_copy_vmcb_cache_to_control(),
> > > pulls everything from the cached fields.  Which probably only works because the
> > > only fields that are pulled from vmcb02 nested_svm_vmexit() are never modified
> > > by the CPU.
> > 
> > Yeah I think that's the key. The main distinction is whether fields
> > are"in", "out" or "in/out" fields. I wish those were more clearly
> > separated by the APM. More below.
> > 
> > > 
> > > Actually, I take that back, I have no idea how this code works.  How does e.g.
> > > exit_info_1 not get clobbered on save/restore?
> > 
> > I *think* KVM always sets the error_code and exit_info_* fields before
> > synthesizing a #VMEXIT to L1, usually right before calling
> > nested_svm_vmeit(), so no chance for save/restore in between.
> 
> Ugh, right, KVM generally doesn't recognize signals until after invoking the exit
> handler.  Actually, even that isn't the key, it's that this flaw only affects
> "in/out" fields, as you note above.  Heh, and even that probably isn't entirely
> precise, as it's really "in/out fields that KVM consumes while running L2 are
> buggy".  E.g. in this case, nested_vmcb02_prepare_control() pulls next_rip for
> vmcb02 from the cache.
> 
> 	if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
> 		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
> 	else if (boot_cpu_has(X86_FEATURE_NRIPS))
> 		vmcb02->control.next_rip    = vmcb12_rip;

Hmm I thin a pure "out" field would be affected if it's consumed by KVM
later on, after save+restore is possible, right?

Do you mean that it just happens that the currently affected fields are
"in/out" fields? Or is there a reason why pure "out" fields cannot be
affected? AFAICT, these fields are lost after save+restore.

> 
> > I think generally, most "out" fields are consumed by KVM before userspace can
> > save/restore, hence them getting lost on save/restore is fine?
> 
> Yep.
> 
> > It's still probably worse than we think, I see
> > svm->nested.ctl.bus_lock_rip is not saved/restored, because it's not
> > part of the VMCB. So in
> > svm_set_nested_state()->nested_vmcb02_prepare_control() we end up
> > comparing garbage to garbage (because vmcb02->save.rip is also wrong):
> > 
> > 	if (vmcb02->save.rip && (svm->nested.ctl.bus_lock_rip == vmcb02->save.rip))
> > 		vmcb02->control.bus_lock_counter = 1;
> > 	else
> > 		vmcb02->control.bus_lock_counter = 0;
> 
> Let's ignore this one.  It's overall a non-issue, and we're already planning on
> moving it out of the control cache.
> 
> > > In other words, AFAICT, nested.ctl.int_ctl is special in that KVM needs it to be
> > > up-to-date at all times, *and* it needs to copied back to vmcb12 (or userspace).
> > 
> > Hmm actually looking at nested.ctl.int_ctl, I don't think it's that special.
> > Most KVM usages are checking "in" bits, i.e. whether some features (e.g.
> > vGIF) are enabled or not.
> >
> > The "out" bits seem to only be consumed by svm_clear_vintr(), and I
> > think this can be worked around.
> 
> OMG that code makes my head hurt.  Isn't that code just this?
> 
> 	/*
> 	 * Drop int_ctl fields related to VINTR injection.  If L2 is active,
> 	 * restore the virtual IRQ flag and its vector from vmcb12 now that KVM
> 	 * is done usurping virtual IRQs for its own purposes.
> 	 */
> 	svm->vmcb01.ptr->control.int_ctl &= ~V_IRQ_INJECTION_BITS_MASK;
> 
> 	if (is_guest_mode(&svm->vcpu)) {
> 		svm->vmcb->control.int_ctl = (svm->vmcb->control.int_ctl & ~V_IRQ_MASK) |
> 					     (svm->nested.ctl.int_ctl & V_IRQ_MASK);

Also V_INTR_PRIO_MASK, I think?

But otherwise yeah I think that's what the function is doing
more-or-less.

> 		svm->vmcb->control.int_vector = svm->nested.ctl.int_vector;
> 	} else {
> 		WARN_ON_ONCE(svm->vmcb != svm->vmcb01.ptr);
> 	}
> 
> 	svm_clr_intercept(svm, INTERCEPT_VINTR);
> 	vmcb_mark_dirty(svm->vmcb, VMCB_INTR);
> 
> > So maybe we don't really need to keep it up-to-date in the cache at all
> > times.
> 
> Yeah, IMO that approach is unnecessarily convoluted.  The actual logic isn't all
> that complex, all of the complexity comes from juggling state between the cache
> and vmcb02 just so that the cache can be authoritative.  Given that we failed
> miserably in actually making the cache authoritative, e.g. see the nested #VMEXIT
> flow, I think we should kill the entire concept and instead maintain an *exact*

When you say *exact* snapshot, do you mean move all the sanitizing logic
recently introduced in __nested_copy_vmcb_control_to_cache() (by Kevin
and myself) to sanitize vmcb02 instead?

That would be annoying. For example, for the intercepts, sanitizing
vmcb02 (but not vmcb12) means that we need to also add checks in the
exit path (i.e. in nested_svm_intercept() or even
vmcb12_is_intercept()), as vmcb12 could have illegal intercepts.

> snapshot of vmcb12's controls, and then make vmcb02 authoritative.  We'll still
> need the logic in nested_sync_control_from_vmcb02() to updated int_ctl on save
> or #VMEXIT if KVM is still intercepting VINTR for its own purposes, but at least
> the code will be contained.

Yeah. We should leave it out of the vmcb12 cache though.

> 
> Then to make it all but impossible to re-introduce this mess, do something like
> this so that someone would have to go way out of their way to try and modify the
> cache.
> 
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index ebd7b36b1ceb..2de6305be9ce 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -199,14 +199,13 @@ struct svm_nested_state {
>          * we cannot inject a nested vmexit yet.  */
>         bool nested_run_pending;
>  
> -       /* cache for control fields of the guest */
> -       struct vmcb_ctrl_area_cached ctl;
> -
>         /*
> -        * Note: this struct is not kept up-to-date while L2 runs; it is only
> -        * valid within nested_svm_vmrun.
> +        * An opaque, read-only cache of vmcb12 controls, used to query L1's
> +        * controls while running L2, e.g. to route intercepts appropriately.
> +        * All reads are routed through accessors to make it all but impossible
> +        * for KVM to clobber its snapshot of vmcb12.
>          */
> -       struct vmcb_save_area_cached save;

Is dropping the cached save area intentional?

We can drop it and make it a local vaiable in nested_svm_vmrun(), and
plumb it all the way down. But it could be too big for the stack.
Allocating it every time isn't nice either.

Do you mean to also make it opaque?

> +       u8 __vmcb12_ctrl[sizeof(struct vmcb_ctrl_area_cached)];

We have a lot of accesses to svm->nested.ctl, so we'll need a lot of
clutter to cast the field in all of these places.

Maybe we add a read-only accessor that returns a pointer to a constant
struct?

>  
>         bool initialized;
> 
> > > Part of me wants to remove these two fields entirely:
> > > 
> > > 	/* cache for control fields of the guest */
> > > 	struct vmcb_ctrl_area_cached ctl;
> > > 
> > > 	/*
> > > 	 * Note: this struct is not kept up-to-date while L2 runs; it is only
> > > 	 * valid within nested_svm_vmrun.
> > > 	 */
> > > 	struct vmcb_save_area_cached save;
> > > 
> > > and instead use "full" caches only for the duration of nested_svm_vmrun().  Or
> > > hell, just copy the entire vmcb12 and throw the cached structures in the garbage.
> > > But that'll probably end in a game of whack-a-mole as things get moved back in.
> > 
> > Yeah, KVM needs to keep some of the fields around :/
> 
> For me, that's totally fine.  As above, the problem I see is that there is no
> single source of truth, i.e. that the authoritative state is spread across vmcb02
> and the cache.
> 
> > > So rather than do something totally drastic, I think we should kill
> > > nested_copy_vmcb_cache_to_control() and replace it with a "save control" flow.
> > > And then have it share code as much code as possible with nested_svm_vmexit(),
> > > and fixup nested_svm_vmexit() to not pull from svm->nested.ctl unnecessarily.
> > > Which, again AFICT, is pretty much limited to int_ctl: either vmcb02 is
> > > authoritative, or KVM shouldn't be updating vmcb12, and so only the "save control"
> > > for KVM_GET_NESTED_STATE needs to copy from the cache to the migrated vmcb12.
> > 
> > I think this works if we draw a clear extinction between "in","out", and
> > "in/out" fields, which is not great because some fields (like int_ctl)
> > have different directions for different bits :/
> > 
> > But if we do draw that distinction, and have helpers that copy fields
> > based on direction, things become more intuitive:
> > 
> > During nested VMRUN, we use the "in" and "in/out" fields from cached
> > vmcb12 to construct vmcb02 through nested_vmcb02_prepare_control().
> > 
> > During save, we save "in" fields from the cached vmcb12, "out" and
> > "in/out" fields from vmcb02.
> > 
> > During restore, we use the "in" and "in/out" fields from the restored
> > payload to construct vmcb02 through nested_vmcb02_prepare_control(), AND
> > update the "out" fields as well from the payload.
> 
> Why the last part?  If L2 is active, then the pure "out" fields are guaranteed
> to be written on nested #VMEXIT.  Anything else simply can't work.

I think it just so happens that all pure "out" fields are consumed by
KVM before save+restore is possible, but it is possible for an "out"
field to be used by KVM at a later point, or copied from vmcb02 to
vmcb12 during nested #VMEXIT (e.g. if KVM exits to L1 directly after
save+restore, before running L2).

next_rip and int_state fall in this bucket, it just happens to also be
an "in" field.

For example, if support for decode assists is added, there will be cases
where KVM just copies insn_bytes from vmcb02 to vmcb12 on nested
#VMEXIT. If insn_bytes is lost on save+restore, and KVM immediately
exits to L1 after restore, insn_bytes is lost.

So we need to also save+restore pure "out" fields, which we do not do
today.

> 
> > During synthesized #VMEXIT, we save the "out" and "in/out" fields from
> > vmcb02 (shared part with save/restore).
> 
> Yeah, that all works.  We could also treat save() as an extension of #VMEXIT, but
> that could make KVM_GET_NESTED_STATE non-idempotent (which might already be the
> case for VMX?).  I.e. we could sync vmcb02 to vmcb12 (cache), and then copy that
> to userspace.

I think this will require adding more fields to the cache, but wait, we
already have a lot of "out" fields there but I don't think they are
being used at all..

Anyway, this may make things simpler. Instead of pulling different
fields from either cached vmcb12 and vmcb02, we always combine them
first. I will keep that in mind.

> 
> > The save/restore changes would need a flag to avoid restoring garbage
> > from an older KVM.
> 
> I don't follow.  I was thinking we'd only change how KVM maintains authoritative
> state while runnign L2, i.e. not make any changes (other than fixes) to the
> serialized state for save/restore.

I thought we're not currently saving "out" fields at all, but
apparently we are, we just do not use them in svm_set_nested_state(). So
we probably do not need a flag. Even if some fields are not currently
copied, I assume KVM restoring garbage from an older KVM is no worse
than having uninitialized garbage :)

I think this will be annoying when new fields are added, like
insn_bytes. Perhaps at some point we move to just serializing the entire
combined vmcb02/vmcb12 control area and add a flag for that.

> 
> > It's also probably not as straightforward as I am making it out to be. For
> > example, "in/out" fields may not be reflected as-is from vmcb12 to vmcb02, so
> > if we save+restore with nested_run_pending, we end up creating vmcb02 on the
> > destination from what we put in vmcb02 in the source, not vmcb12, which may
> > or may not be the right thing to do.
> 
> If that's a problem, we've already messed up.  Because we _must_ get that right
> for nested #VMEXIT, i.e. KVM _must_ be able to extract the pieces of vmcb02 that
> belong to L1 vs. L0.  At a glance, it seems very doable to writ the code so that
> it's shareable between #VMEXIT and save().
> 
> > This is probably a heavier lift than we think it is, or maybe it's
> > simpler once I start coding it :)
> 
> Maybe?  It's certainly not trivial, but I don't think it's terribly complex either,
> at least not what I have in mind.
> 
> > > That'll probably end up a bit fat for a stable@ patch, so we could do a gross
> > > one-off fix for this issue, and then do cleanups on top.
> > 
> > Honestly, I'd rather keep the existing patch for stable@. It's not that
> > complicated, and downstream trees that take it don't have to live with
> > the FIXME code.
> 
> I'd rather omit the FIXME than move the nested_sync_control_from_vmcb02() call.
> Because it's entirely possible that the code movement will apply cleanly to an
> older kernel, but semantically be broken due something in-between consuming
> int_ctl.  The odds of that happening are low, but I don't want to have to audit
> every LTS kernel (or set up others to fail).

That's fair, but I will keep the FIXME. My problem was not the comment,
it was sync'ing that one field outside of
nested_sync_control_from_vmcb02().

> 
> > The heavier lift to clean this up can be done separately,
> 
> Yes, for sure.
> 
> > or I can send a new version with the first 2 patches in the beginning for
> > stable@ and the cleanups on top, depends on how we decide to implement this.
> 
> As above, my preference is to throw in a super minimal fix for next_rip, and then
> commit ourselves to removing nested_sync_control_from_vmcb02() sooner or later.

I will send a v2 with the FIXME fix for next_rip, I assume the int_state
fix in nested_sync_control_from_vmcb02() is fine.

Hopefully I will find time in the coming weeks to work on the broader
cleanup.

> 
> > > > @@ -4435,6 +4424,21 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> > > >  
> > > >  	svm_complete_interrupts(vcpu);
> > > >  
> > > > +	/*
> > > > +	 * svm_complete_interrupts() may update svm->vmcb->control.next_rip,
> > > > +	 * which is sync'd by nested_sync_control_from_vmcb02() below.
> > > 
> > > Please try to avoid referencing functions and fields in comments.  History has
> > > shown that they almost always become stale.
> > 
> > Generally agree, but in this case I am referencing the calls right above
> > and right below, and it's probably clearer to mention the ordering
> > constraint directly with their names.
> > 
> > That being said, if you feel strongly
> 
> I do.  I appreciate that it requires more effort to write comments that don't
> reference functions/variables/fields, and that it can be downright annoying, but
> we've had far too many orphaned comments over the years.

Ack, will use your FIXME wording as-is.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-10 22:19         ` Yosry Ahmed
@ 2026-02-11  0:09           ` Sean Christopherson
  2026-02-11  0:27             ` Yosry Ahmed
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2026-02-11  0:09 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel, stable

On Tue, Feb 10, 2026, Yosry Ahmed wrote:
> On Tue, Feb 10, 2026 at 11:20:19AM -0800, Sean Christopherson wrote:
> > > > Actually, I take that back, I have no idea how this code works.  How does e.g.
> > > > exit_info_1 not get clobbered on save/restore?
> > > 
> > > I *think* KVM always sets the error_code and exit_info_* fields before
> > > synthesizing a #VMEXIT to L1, usually right before calling
> > > nested_svm_vmeit(), so no chance for save/restore in between.
> > 
> > Ugh, right, KVM generally doesn't recognize signals until after invoking the exit
> > handler.  Actually, even that isn't the key, it's that this flaw only affects
> > "in/out" fields, as you note above.  Heh, and even that probably isn't entirely
> > precise, as it's really "in/out fields that KVM consumes while running L2 are
> > buggy".  E.g. in this case, nested_vmcb02_prepare_control() pulls next_rip for
> > vmcb02 from the cache.
> > 
> > 	if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
> > 		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
> > 	else if (boot_cpu_has(X86_FEATURE_NRIPS))
> > 		vmcb02->control.next_rip    = vmcb12_rip;
> 
> Hmm I thin a pure "out" field would be affected if it's consumed by KVM
> later on, after save+restore is possible, right?

True.

> Do you mean that it just happens that the currently affected fields are
> "in/out" fields? Or is there a reason why pure "out" fields cannot be
> affected? AFAICT, these fields are lost after save+restore.

I was essentially assuming KVM wouldn't ever consume pure out fields from the
cache.


> > > > In other words, AFAICT, nested.ctl.int_ctl is special in that KVM needs it to be
> > > > up-to-date at all times, *and* it needs to copied back to vmcb12 (or userspace).
> > > 
> > > Hmm actually looking at nested.ctl.int_ctl, I don't think it's that special.
> > > Most KVM usages are checking "in" bits, i.e. whether some features (e.g.
> > > vGIF) are enabled or not.
> > >
> > > The "out" bits seem to only be consumed by svm_clear_vintr(), and I
> > > think this can be worked around.
> > 
> > OMG that code makes my head hurt.  Isn't that code just this?
> > 
> > 	/*
> > 	 * Drop int_ctl fields related to VINTR injection.  If L2 is active,
> > 	 * restore the virtual IRQ flag and its vector from vmcb12 now that KVM
> > 	 * is done usurping virtual IRQs for its own purposes.
> > 	 */
> > 	svm->vmcb01.ptr->control.int_ctl &= ~V_IRQ_INJECTION_BITS_MASK;
> > 
> > 	if (is_guest_mode(&svm->vcpu)) {
> > 		svm->vmcb->control.int_ctl = (svm->vmcb->control.int_ctl & ~V_IRQ_MASK) |
> > 					     (svm->nested.ctl.int_ctl & V_IRQ_MASK);
> 
> Also V_INTR_PRIO_MASK, I think?

Ugh, yes.  And I think V_IGN_TPR as well?  I can't tell if that's a bug or not.
It looks like a bug.  AFAICT, svm_set_vintr() uses whatever V_IGN_TPR_MASK value
happens to be in vmcb02.  I don't see how that can be desirable.

> But otherwise yeah I think that's what the function is doing
> more-or-less.
> 
> > 		svm->vmcb->control.int_vector = svm->nested.ctl.int_vector;
> > 	} else {
> > 		WARN_ON_ONCE(svm->vmcb != svm->vmcb01.ptr);
> > 	}
> > 
> > 	svm_clr_intercept(svm, INTERCEPT_VINTR);
> > 	vmcb_mark_dirty(svm->vmcb, VMCB_INTR);
> > 
> > > So maybe we don't really need to keep it up-to-date in the cache at all
> > > times.
> > 
> > Yeah, IMO that approach is unnecessarily convoluted.  The actual logic isn't all
> > that complex, all of the complexity comes from juggling state between the cache
> > and vmcb02 just so that the cache can be authoritative.  Given that we failed
> > miserably in actually making the cache authoritative, e.g. see the nested #VMEXIT
> > flow, I think we should kill the entire concept and instead maintain an *exact*
> 
> When you say *exact* snapshot, do you mean move all the sanitizing logic
> recently introduced in __nested_copy_vmcb_control_to_cache() (by Kevin
> and myself) to sanitize vmcb02 instead?

Oh, no, not _that_ exact.  More "unchanged after emulated VMRUN"

> That would be annoying. For example, for the intercepts, sanitizing
> vmcb02 (but not vmcb12) means that we need to also add checks in the
> exit path (i.e. in nested_svm_intercept() or even
> vmcb12_is_intercept()), as vmcb12 could have illegal intercepts.
> 
> > snapshot of vmcb12's controls, and then make vmcb02 authoritative.  We'll still
> > need the logic in nested_sync_control_from_vmcb02() to updated int_ctl on save
> > or #VMEXIT if KVM is still intercepting VINTR for its own purposes, but at least
> > the code will be contained.
> 
> Yeah. We should leave it out of the vmcb12 cache though.

+1

> > Then to make it all but impossible to re-introduce this mess, do something like
> > this so that someone would have to go way out of their way to try and modify the
> > cache.
> > 
> > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > index ebd7b36b1ceb..2de6305be9ce 100644
> > --- a/arch/x86/kvm/svm/svm.h
> > +++ b/arch/x86/kvm/svm/svm.h
> > @@ -199,14 +199,13 @@ struct svm_nested_state {
> >          * we cannot inject a nested vmexit yet.  */
> >         bool nested_run_pending;
> >  
> > -       /* cache for control fields of the guest */
> > -       struct vmcb_ctrl_area_cached ctl;
> > -
> >         /*
> > -        * Note: this struct is not kept up-to-date while L2 runs; it is only
> > -        * valid within nested_svm_vmrun.
> > +        * An opaque, read-only cache of vmcb12 controls, used to query L1's
> > +        * controls while running L2, e.g. to route intercepts appropriately.
> > +        * All reads are routed through accessors to make it all but impossible
> > +        * for KVM to clobber its snapshot of vmcb12.
> >          */
> > -       struct vmcb_save_area_cached save;
> 
> Is dropping the cached save area intentional?

Yes, I think we should try to drop it.  Assuming the comment is correct and it
really only 

> We can drop it and make it a local vaiable in nested_svm_vmrun(), and
> plumb it all the way down. But it could be too big for the stack.

It's 48 bytes, there's no way that's too big.

> Allocating it every time isn't nice either.

> Do you mean to also make it opaque?

I'd prefer to drop it.

> > +       u8 __vmcb12_ctrl[sizeof(struct vmcb_ctrl_area_cached)];
> 
> We have a lot of accesses to svm->nested.ctl, so we'll need a lot of
> clutter to cast the field in all of these places.
> 
> Maybe we add a read-only accessor that returns a pointer to a constant
> struct?

That's what I said :-D

	* All reads are routed through accessors to make it all but impossible
	* for KVM to clobber its snapshot of vmcb12.

There might be a lot of helpers, but I bet it's less than nVMX has for vmcs12.

> >         bool initialized;
> > 
> > > > Part of me wants to remove these two fields entirely:
> > > > 
> > > > 	/* cache for control fields of the guest */
> > > > 	struct vmcb_ctrl_area_cached ctl;
> > > > 
> > > > 	/*
> > > > 	 * Note: this struct is not kept up-to-date while L2 runs; it is only
> > > > 	 * valid within nested_svm_vmrun.
> > > > 	 */
> > > > 	struct vmcb_save_area_cached save;
> > > > 
> > > > and instead use "full" caches only for the duration of nested_svm_vmrun().  Or
> > > > hell, just copy the entire vmcb12 and throw the cached structures in the garbage.
> > > > But that'll probably end in a game of whack-a-mole as things get moved back in.
> > > 
> > > Yeah, KVM needs to keep some of the fields around :/
> > 
> > For me, that's totally fine.  As above, the problem I see is that there is no
> > single source of truth, i.e. that the authoritative state is spread across vmcb02
> > and the cache.
> > 
> > > > So rather than do something totally drastic, I think we should kill
> > > > nested_copy_vmcb_cache_to_control() and replace it with a "save control" flow.
> > > > And then have it share code as much code as possible with nested_svm_vmexit(),
> > > > and fixup nested_svm_vmexit() to not pull from svm->nested.ctl unnecessarily.
> > > > Which, again AFICT, is pretty much limited to int_ctl: either vmcb02 is
> > > > authoritative, or KVM shouldn't be updating vmcb12, and so only the "save control"
> > > > for KVM_GET_NESTED_STATE needs to copy from the cache to the migrated vmcb12.
> > > 
> > > I think this works if we draw a clear extinction between "in","out", and
> > > "in/out" fields, which is not great because some fields (like int_ctl)
> > > have different directions for different bits :/
> > > 
> > > But if we do draw that distinction, and have helpers that copy fields
> > > based on direction, things become more intuitive:
> > > 
> > > During nested VMRUN, we use the "in" and "in/out" fields from cached
> > > vmcb12 to construct vmcb02 through nested_vmcb02_prepare_control().
> > > 
> > > During save, we save "in" fields from the cached vmcb12, "out" and
> > > "in/out" fields from vmcb02.
> > > 
> > > During restore, we use the "in" and "in/out" fields from the restored
> > > payload to construct vmcb02 through nested_vmcb02_prepare_control(), AND
> > > update the "out" fields as well from the payload.
> > 
> > Why the last part?  If L2 is active, then the pure "out" fields are guaranteed
> > to be written on nested #VMEXIT.  Anything else simply can't work.
> 
> I think it just so happens that all pure "out" fields are consumed by
> KVM before save+restore is possible, but it is possible for an "out"
> field to be used by KVM at a later point, or copied from vmcb02 to
> vmcb12 during nested #VMEXIT (e.g. if KVM exits to L1 directly after
> save+restore, before running L2).

Ya.

> next_rip and int_state fall in this bucket, it just happens to also be
> an "in" field.
> 
> For example, if support for decode assists is added, there will be cases
> where KVM just copies insn_bytes from vmcb02 to vmcb12 on nested
> #VMEXIT. If insn_bytes is lost on save+restore, and KVM immediately
> exits to L1 after restore, insn_bytes is lost.
> 
> So we need to also save+restore pure "out" fields, which we do not do
> today.

Hmm, strictly speaking, no.  We'd be fixing a bug that doesn't exist, yet.  But
the word yet...

> > > During synthesized #VMEXIT, we save the "out" and "in/out" fields from
> > > vmcb02 (shared part with save/restore).
> > 
> > Yeah, that all works.  We could also treat save() as an extension of #VMEXIT, but
> > that could make KVM_GET_NESTED_STATE non-idempotent (which might already be the
> > case for VMX?).  I.e. we could sync vmcb02 to vmcb12 (cache), and then copy that
> > to userspace.
> 
> I think this will require adding more fields to the cache, but wait, we
> already have a lot of "out" fields there but I don't think they are
> being used at all..
> 
> Anyway, this may make things simpler. Instead of pulling different
> fields from either cached vmcb12 and vmcb02, we always combine them
> first. I will keep that in mind.
> 
> > 
> > > The save/restore changes would need a flag to avoid restoring garbage
> > > from an older KVM.
> > 
> > I don't follow.  I was thinking we'd only change how KVM maintains authoritative
> > state while runnign L2, i.e. not make any changes (other than fixes) to the
> > serialized state for save/restore.
> 
> I thought we're not currently saving "out" fields at all, but
> apparently we are, we just do not use them in svm_set_nested_state(). So
> we probably do not need a flag. Even if some fields are not currently
> copied, I assume KVM restoring garbage from an older KVM is no worse
> than having uninitialized garbage :)

Heh, yep.

> I think this will be annoying when new fields are added, like
> insn_bytes. Perhaps at some point we move to just serializing the entire
> combined vmcb02/vmcb12 control area and add a flag for that.

If we do it now, can we avoid the flag?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-11  0:09           ` Sean Christopherson
@ 2026-02-11  0:27             ` Yosry Ahmed
  2026-02-11  0:39               ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Yosry Ahmed @ 2026-02-11  0:27 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, stable

 
> > > > > In other words, AFAICT, nested.ctl.int_ctl is special in that KVM needs it to be
> > > > > up-to-date at all times, *and* it needs to copied back to vmcb12 (or userspace).
> > > > 
> > > > Hmm actually looking at nested.ctl.int_ctl, I don't think it's that special.
> > > > Most KVM usages are checking "in" bits, i.e. whether some features (e.g.
> > > > vGIF) are enabled or not.
> > > >
> > > > The "out" bits seem to only be consumed by svm_clear_vintr(), and I
> > > > think this can be worked around.
> > > 
> > > OMG that code makes my head hurt.  Isn't that code just this?
> > > 
> > > 	/*
> > > 	 * Drop int_ctl fields related to VINTR injection.  If L2 is active,
> > > 	 * restore the virtual IRQ flag and its vector from vmcb12 now that KVM
> > > 	 * is done usurping virtual IRQs for its own purposes.
> > > 	 */
> > > 	svm->vmcb01.ptr->control.int_ctl &= ~V_IRQ_INJECTION_BITS_MASK;
> > > 
> > > 	if (is_guest_mode(&svm->vcpu)) {
> > > 		svm->vmcb->control.int_ctl = (svm->vmcb->control.int_ctl & ~V_IRQ_MASK) |
> > > 					     (svm->nested.ctl.int_ctl & V_IRQ_MASK);
> > 
> > Also V_INTR_PRIO_MASK, I think?
> 
> Ugh, yes.  And I think V_IGN_TPR as well?  I can't tell if that's a bug or not.
> It looks like a bug.  AFAICT, svm_set_vintr() uses whatever V_IGN_TPR_MASK value
> happens to be in vmcb02.  I don't see how that can be desirable.

Yeah I assumed that's not a bug (for some reason), in which case
restoring V_IGN_TPR_MASK is not needed.

> 
> > But otherwise yeah I think that's what the function is doing
> > more-or-less.
> > 
> > > 		svm->vmcb->control.int_vector = svm->nested.ctl.int_vector;
> > > 	} else {
> > > 		WARN_ON_ONCE(svm->vmcb != svm->vmcb01.ptr);
> > > 	}
> > > 
> > > 	svm_clr_intercept(svm, INTERCEPT_VINTR);
> > > 	vmcb_mark_dirty(svm->vmcb, VMCB_INTR);
> > > 
> > > > So maybe we don't really need to keep it up-to-date in the cache at all
> > > > times.
> > > 
> > > Yeah, IMO that approach is unnecessarily convoluted.  The actual logic isn't all
> > > that complex, all of the complexity comes from juggling state between the cache
> > > and vmcb02 just so that the cache can be authoritative.  Given that we failed
> > > miserably in actually making the cache authoritative, e.g. see the nested #VMEXIT
> > > flow, I think we should kill the entire concept and instead maintain an *exact*
> > 
> > When you say *exact* snapshot, do you mean move all the sanitizing logic
> > recently introduced in __nested_copy_vmcb_control_to_cache() (by Kevin
> > and myself) to sanitize vmcb02 instead?
> 
> Oh, no, not _that_ exact.  More "unchanged after emulated VMRUN"

Whew.

> > > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > > index ebd7b36b1ceb..2de6305be9ce 100644
> > > --- a/arch/x86/kvm/svm/svm.h
> > > +++ b/arch/x86/kvm/svm/svm.h
> > > @@ -199,14 +199,13 @@ struct svm_nested_state {
> > >          * we cannot inject a nested vmexit yet.  */
> > >         bool nested_run_pending;
> > >  
> > > -       /* cache for control fields of the guest */
> > > -       struct vmcb_ctrl_area_cached ctl;
> > > -
> > >         /*
> > > -        * Note: this struct is not kept up-to-date while L2 runs; it is only
> > > -        * valid within nested_svm_vmrun.
> > > +        * An opaque, read-only cache of vmcb12 controls, used to query L1's
> > > +        * controls while running L2, e.g. to route intercepts appropriately.
> > > +        * All reads are routed through accessors to make it all but impossible
> > > +        * for KVM to clobber its snapshot of vmcb12.
> > >          */
> > > -       struct vmcb_save_area_cached save;
> > 
> > Is dropping the cached save area intentional?
> 
> Yes, I think we should try to drop it.  Assuming the comment is correct and it
> really only 
> 
> > We can drop it and make it a local vaiable in nested_svm_vmrun(), and
> > plumb it all the way down. But it could be too big for the stack.
> 
> It's 48 bytes, there's no way that's too big.

That's before my hardening series shoved everything in there. It's now
256 bytes, which is not huge, but makes me nervous. Especially that it
may grow more in the future.

> > Allocating it every time isn't nice either.
> 
> > Do you mean to also make it opaque?
> 
> I'd prefer to drop it.

Me too, but I am nervous about putting it on the stack.

> 
> > > +       u8 __vmcb12_ctrl[sizeof(struct vmcb_ctrl_area_cached)];
> > 
> > We have a lot of accesses to svm->nested.ctl, so we'll need a lot of
> > clutter to cast the field in all of these places.
> > 
> > Maybe we add a read-only accessor that returns a pointer to a constant
> > struct?
> 
> That's what I said :-D
> 
> 	* All reads are routed through accessors to make it all but impossible
> 	* for KVM to clobber its snapshot of vmcb12.
> 
> There might be a lot of helpers, but I bet it's less than nVMX has for vmcs12.

Oh I meant instead of having a lot of helpers, have a single helper that
returns it as a pointer to const struct vmcb_ctrl_area_cached? Then all
current users just switch to the helper instead of directly using
svm->nested.ctl.

We can even name it sth more intuitive like svm_cached_vmcb12_control().

> 
> > >         bool initialized;
> > > 
> > > > > Part of me wants to remove these two fields entirely:
> > > > > 
> > > > > 	/* cache for control fields of the guest */
> > > > > 	struct vmcb_ctrl_area_cached ctl;
> > > > > 
> > > > > 	/*
> > > > > 	 * Note: this struct is not kept up-to-date while L2 runs; it is only
> > > > > 	 * valid within nested_svm_vmrun.
> > > > > 	 */
> > > > > 	struct vmcb_save_area_cached save;
> > > > > 
> > > > > and instead use "full" caches only for the duration of nested_svm_vmrun().  Or
> > > > > hell, just copy the entire vmcb12 and throw the cached structures in the garbage.
> > > > > But that'll probably end in a game of whack-a-mole as things get moved back in.
> > > > 
> > > > Yeah, KVM needs to keep some of the fields around :/
> > > 
> > > For me, that's totally fine.  As above, the problem I see is that there is no
> > > single source of truth, i.e. that the authoritative state is spread across vmcb02
> > > and the cache.
> > > 
> > > > > So rather than do something totally drastic, I think we should kill
> > > > > nested_copy_vmcb_cache_to_control() and replace it with a "save control" flow.
> > > > > And then have it share code as much code as possible with nested_svm_vmexit(),
> > > > > and fixup nested_svm_vmexit() to not pull from svm->nested.ctl unnecessarily.
> > > > > Which, again AFICT, is pretty much limited to int_ctl: either vmcb02 is
> > > > > authoritative, or KVM shouldn't be updating vmcb12, and so only the "save control"
> > > > > for KVM_GET_NESTED_STATE needs to copy from the cache to the migrated vmcb12.
> > > > 
> > > > I think this works if we draw a clear extinction between "in","out", and
> > > > "in/out" fields, which is not great because some fields (like int_ctl)
> > > > have different directions for different bits :/
> > > > 
> > > > But if we do draw that distinction, and have helpers that copy fields
> > > > based on direction, things become more intuitive:
> > > > 
> > > > During nested VMRUN, we use the "in" and "in/out" fields from cached
> > > > vmcb12 to construct vmcb02 through nested_vmcb02_prepare_control().
> > > > 
> > > > During save, we save "in" fields from the cached vmcb12, "out" and
> > > > "in/out" fields from vmcb02.
> > > > 
> > > > During restore, we use the "in" and "in/out" fields from the restored
> > > > payload to construct vmcb02 through nested_vmcb02_prepare_control(), AND
> > > > update the "out" fields as well from the payload.
> > > 
> > > Why the last part?  If L2 is active, then the pure "out" fields are guaranteed
> > > to be written on nested #VMEXIT.  Anything else simply can't work.
> > 
> > I think it just so happens that all pure "out" fields are consumed by
> > KVM before save+restore is possible, but it is possible for an "out"
> > field to be used by KVM at a later point, or copied from vmcb02 to
> > vmcb12 during nested #VMEXIT (e.g. if KVM exits to L1 directly after
> > save+restore, before running L2).
> 
> Ya.
> 
> > next_rip and int_state fall in this bucket, it just happens to also be
> > an "in" field.
> > 
> > For example, if support for decode assists is added, there will be cases
> > where KVM just copies insn_bytes from vmcb02 to vmcb12 on nested
> > #VMEXIT. If insn_bytes is lost on save+restore, and KVM immediately
> > exits to L1 after restore, insn_bytes is lost.
> > 
> > So we need to also save+restore pure "out" fields, which we do not do
> > today.
> 
> Hmm, strictly speaking, no.  We'd be fixing a bug that doesn't exist, yet.  But
> the word yet...
> 
> > > > During synthesized #VMEXIT, we save the "out" and "in/out" fields from
> > > > vmcb02 (shared part with save/restore).
> > > 
> > > Yeah, that all works.  We could also treat save() as an extension of #VMEXIT, but
> > > that could make KVM_GET_NESTED_STATE non-idempotent (which might already be the
> > > case for VMX?).  I.e. we could sync vmcb02 to vmcb12 (cache), and then copy that
> > > to userspace.
> > 
> > I think this will require adding more fields to the cache, but wait, we
> > already have a lot of "out" fields there but I don't think they are
> > being used at all..
> > 
> > Anyway, this may make things simpler. Instead of pulling different
> > fields from either cached vmcb12 and vmcb02, we always combine them
> > first. I will keep that in mind.
> > 
> > > 
> > > > The save/restore changes would need a flag to avoid restoring garbage
> > > > from an older KVM.
> > > 
> > > I don't follow.  I was thinking we'd only change how KVM maintains authoritative
> > > state while runnign L2, i.e. not make any changes (other than fixes) to the
> > > serialized state for save/restore.
> > 
> > I thought we're not currently saving "out" fields at all, but
> > apparently we are, we just do not use them in svm_set_nested_state(). So
> > we probably do not need a flag. Even if some fields are not currently
> > copied, I assume KVM restoring garbage from an older KVM is no worse
> > than having uninitialized garbage :)
> 
> Heh, yep.
> 
> > I think this will be annoying when new fields are added, like
> > insn_bytes. Perhaps at some point we move to just serializing the entire
> > combined vmcb02/vmcb12 control area and add a flag for that.
> 
> If we do it now, can we avoid the flag?

I don't think so. Fields like insn_bytes are not currently serialized at
all. The moment we need them, we'll probably need to add a flag, at
which point serializing everything under the flag would probably be the
sane thing to do.

That being said, I don't really know how a KVM that uses insn_bytes
should handle restoring from an older KVM that doesn't serialize it :/

Problem for the future, I guess :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-11  0:27             ` Yosry Ahmed
@ 2026-02-11  0:39               ` Sean Christopherson
  2026-02-11  1:02                 ` Yosry Ahmed
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2026-02-11  0:39 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel, stable

On Wed, Feb 11, 2026, Yosry Ahmed wrote:
> > > We can drop it and make it a local vaiable in nested_svm_vmrun(), and
> > > plumb it all the way down. But it could be too big for the stack.
> > 
> > It's 48 bytes, there's no way that's too big.
> 
> That's before my hardening series shoved everything in there. It's now
> 256 bytes, which is not huge, but makes me nervous. Especially that it
> may grow more in the future.
> 
> > > Allocating it every time isn't nice either.
> > 
> > > Do you mean to also make it opaque?
> > 
> > I'd prefer to drop it.
> 
> Me too, but I am nervous about putting it on the stack.

256 bytes should be tolerable.  500+ is where things tend to get dicey.

> > > > +       u8 __vmcb12_ctrl[sizeof(struct vmcb_ctrl_area_cached)];
> > > 
> > > We have a lot of accesses to svm->nested.ctl, so we'll need a lot of
> > > clutter to cast the field in all of these places.
> > > 
> > > Maybe we add a read-only accessor that returns a pointer to a constant
> > > struct?
> > 
> > That's what I said :-D
> > 
> > 	* All reads are routed through accessors to make it all but impossible
> > 	* for KVM to clobber its snapshot of vmcb12.
> > 
> > There might be a lot of helpers, but I bet it's less than nVMX has for vmcs12.
> 
> Oh I meant instead of having a lot of helpers, have a single helper that
> returns it as a pointer to const struct vmcb_ctrl_area_cached? Then all
> current users just switch to the helper instead of directly using
> svm->nested.ctl.
> 
> We can even name it sth more intuitive like svm_cached_vmcb12_control().

That makes it to easy to do something like:


	u32 *int_ctl = svm_cached_vmcb12_control(xxx).

	*int_ctl |= xxx;

Which is what I want to defend against.

> > > I think this will be annoying when new fields are added, like
> > > insn_bytes. Perhaps at some point we move to just serializing the entire
> > > combined vmcb02/vmcb12 control area and add a flag for that.
> > 
> > If we do it now, can we avoid the flag?
> 
> I don't think so. Fields like insn_bytes are not currently serialized at
> all. The moment we need them, we'll probably need to add a flag, at
> which point serializing everything under the flag would probably be the
> sane thing to do.
> 
> That being said, I don't really know how a KVM that uses insn_bytes
> should handle restoring from an older KVM that doesn't serialize it :/
> 
> Problem for the future, I guess :)

Oh, good point.  In that case, I think it makes sense to add the flag asap, so
that _if_ it turns out that KVM needs to consume a field that isn't currently
saved/restored, we'll at least have a better story for KVM's that save/restore
everything.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-11  0:39               ` Sean Christopherson
@ 2026-02-11  1:02                 ` Yosry Ahmed
  2026-02-21  0:03                   ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Yosry Ahmed @ 2026-02-11  1:02 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, stable

On Tue, Feb 10, 2026 at 04:39:28PM -0800, Sean Christopherson wrote:
> On Wed, Feb 11, 2026, Yosry Ahmed wrote:
> > > > We can drop it and make it a local vaiable in nested_svm_vmrun(), and
> > > > plumb it all the way down. But it could be too big for the stack.
> > > 
> > > It's 48 bytes, there's no way that's too big.
> > 
> > That's before my hardening series shoved everything in there. It's now
> > 256 bytes, which is not huge, but makes me nervous. Especially that it
> > may grow more in the future.
> > 
> > > > Allocating it every time isn't nice either.
> > > 
> > > > Do you mean to also make it opaque?
> > > 
> > > I'd prefer to drop it.
> > 
> > Me too, but I am nervous about putting it on the stack.
> 
> 256 bytes should be tolerable.  500+ is where things tend to get dicey.

In that case I think removing it completely should be fine.

> 
> > > > > +       u8 __vmcb12_ctrl[sizeof(struct vmcb_ctrl_area_cached)];
> > > > 
> > > > We have a lot of accesses to svm->nested.ctl, so we'll need a lot of
> > > > clutter to cast the field in all of these places.
> > > > 
> > > > Maybe we add a read-only accessor that returns a pointer to a constant
> > > > struct?
> > > 
> > > That's what I said :-D
> > > 
> > > 	* All reads are routed through accessors to make it all but impossible
> > > 	* for KVM to clobber its snapshot of vmcb12.
> > > 
> > > There might be a lot of helpers, but I bet it's less than nVMX has for vmcs12.
> > 
> > Oh I meant instead of having a lot of helpers, have a single helper that
> > returns it as a pointer to const struct vmcb_ctrl_area_cached? Then all
> > current users just switch to the helper instead of directly using
> > svm->nested.ctl.
> > 
> > We can even name it sth more intuitive like svm_cached_vmcb12_control().
> 
> That makes it to easy to do something like:
> 
> 
> 	u32 *int_ctl = svm_cached_vmcb12_control(xxx).
> 
> 	*int_ctl |= xxx;
> 
> Which is what I want to defend against.

Do compilers allow implicit dropping of const qualifiers?

Building with this diff fails for me:

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index de90b104a0dd..0a73dd8f9163 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1343,10 +1343,17 @@ static void nested_svm_triple_fault(struct kvm_vcpu *vcpu)
        nested_svm_simple_vmexit(to_svm(vcpu), SVM_EXIT_SHUTDOWN);
 }

+static const struct vmcb_ctrl_area_cached *svm_cached_vmcb12_control(struct vcpu_svm *svm) {
+       return &svm->nested.ctl;
+}
+
 int svm_allocate_nested(struct vcpu_svm *svm)
 {
+       struct vmcb_ctrl_area_cached *ctl = svm_cached_vmcb12_control(svm);
        struct page *vmcb02_page;

+       pr_info("%p\n", ctl);
+
        if (svm->nested.initialized)
                return 0;


I see:

arch/x86/kvm/svm/nested.c:1352:32: error: initializing 'struct vmcb_ctrl_area_cached *' with an expression of type 'const struct vmcb_ctrl_area_cached *' discards qualifiers [-Werror,-Wincompatible-pointer-types-discards-qualifiers]
 1352 |         struct vmcb_ctrl_area_cached *ctl = svm_cached_vmcb12_control(svm);
      |                                       ^     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

I don't explicitly see 'incompatible-pointer-types-discards-qualifiers'
anywhere, but I do see 'incompatible-pointer-types' in
scripts/Makefile.warn:

KBUILD_CFLAGS += $(call cc-option,-Werror=incompatible-pointer-types)

Is this sufficient?

> 
> > > > I think this will be annoying when new fields are added, like
> > > > insn_bytes. Perhaps at some point we move to just serializing the entire
> > > > combined vmcb02/vmcb12 control area and add a flag for that.
> > > 
> > > If we do it now, can we avoid the flag?
> > 
> > I don't think so. Fields like insn_bytes are not currently serialized at
> > all. The moment we need them, we'll probably need to add a flag, at
> > which point serializing everything under the flag would probably be the
> > sane thing to do.
> > 
> > That being said, I don't really know how a KVM that uses insn_bytes
> > should handle restoring from an older KVM that doesn't serialize it :/
> > 
> > Problem for the future, I guess :)
> 
> Oh, good point.  In that case, I think it makes sense to add the flag asap, so
> that _if_ it turns out that KVM needs to consume a field that isn't currently
> saved/restored, we'll at least have a better story for KVM's that save/restore
> everything.

Not sure I follow. Do you mean start serializing everything and setting
the flag ASAP (which IIUC would be after the rework we discussed), or
what do you mean by "add the flag"?

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-11  1:02                 ` Yosry Ahmed
@ 2026-02-21  0:03                   ` Sean Christopherson
  2026-02-21  9:11                     ` Yosry Ahmed
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2026-02-21  0:03 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Paolo Bonzini, kvm, linux-kernel, stable

On Wed, Feb 11, 2026, Yosry Ahmed wrote:
> > > > > > +       u8 __vmcb12_ctrl[sizeof(struct vmcb_ctrl_area_cached)];
> > > > > 
> > > > > We have a lot of accesses to svm->nested.ctl, so we'll need a lot of
> > > > > clutter to cast the field in all of these places.
> > > > > 
> > > > > Maybe we add a read-only accessor that returns a pointer to a constant
> > > > > struct?
> > > > 
> > > > That's what I said :-D
> > > > 
> > > > 	* All reads are routed through accessors to make it all but impossible
> > > > 	* for KVM to clobber its snapshot of vmcb12.
> > > > 
> > > > There might be a lot of helpers, but I bet it's less than nVMX has for vmcs12.
> > > 
> > > Oh I meant instead of having a lot of helpers, have a single helper that
> > > returns it as a pointer to const struct vmcb_ctrl_area_cached? Then all
> > > current users just switch to the helper instead of directly using
> > > svm->nested.ctl.
> > > 
> > > We can even name it sth more intuitive like svm_cached_vmcb12_control().
> > 
> > That makes it to easy to do something like:
> > 
> > 
> > 	u32 *int_ctl = svm_cached_vmcb12_control(xxx).
> > 
> > 	*int_ctl |= xxx;
> > 
> > Which is what I want to defend against.
> 
> Do compilers allow implicit dropping of const qualifiers?

Nope, not with the kernel's build flags.

> Building with this diff fails for me:
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index de90b104a0dd..0a73dd8f9163 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1343,10 +1343,17 @@ static void nested_svm_triple_fault(struct kvm_vcpu *vcpu)
>         nested_svm_simple_vmexit(to_svm(vcpu), SVM_EXIT_SHUTDOWN);
>  }
> 
> +static const struct vmcb_ctrl_area_cached *svm_cached_vmcb12_control(struct vcpu_svm *svm) {
> +       return &svm->nested.ctl;
> +}

...

> Is this sufficient?

It's certainly better, but unless a sea of helpers is orders of magnitude worse,
I would prefer to make it even harder to put hole in our foot.

E.g. unless we're hyper diligent about constifying everything, it's not _that_
hard to imagine a chain of events where we end up with a "live" pointer to the
cache.

  1. A helper like __nested_vmcb_check_controls() isn't const, so we cast to strip
     the const.

  2. Someone "improves" the code by grabbing the non-const variable to pass it
     into other helpers.

  3. The non-const variable is used to update the cache for whatever reason, and
     it works 99.9% of the time, until it doesn't.

Now, I don't think that's at all likely to happen, but as the years pile on and
developers come and go, the probability of introducing a goof goes up, bit by bit.

> > > > > I think this will be annoying when new fields are added, like
> > > > > insn_bytes. Perhaps at some point we move to just serializing the entire
> > > > > combined vmcb02/vmcb12 control area and add a flag for that.
> > > > 
> > > > If we do it now, can we avoid the flag?
> > > 
> > > I don't think so. Fields like insn_bytes are not currently serialized at
> > > all. The moment we need them, we'll probably need to add a flag, at
> > > which point serializing everything under the flag would probably be the
> > > sane thing to do.
> > > 
> > > That being said, I don't really know how a KVM that uses insn_bytes
> > > should handle restoring from an older KVM that doesn't serialize it :/
> > > 
> > > Problem for the future, I guess :)
> > 
> > Oh, good point.  In that case, I think it makes sense to add the flag asap, so
> > that _if_ it turns out that KVM needs to consume a field that isn't currently
> > saved/restored, we'll at least have a better story for KVM's that save/restore
> > everything.
> 
> Not sure I follow. Do you mean start serializing everything and setting
> the flag ASAP (which IIUC would be after the rework we discussed), 

Yep.

> or what do you mean by "add the flag"?




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-21  0:03                   ` Sean Christopherson
@ 2026-02-21  9:11                     ` Yosry Ahmed
  2026-02-23 16:59                       ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Yosry Ahmed @ 2026-02-21  9:11 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Yosry Ahmed, Paolo Bonzini, kvm, linux-kernel, stable

[..]
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index de90b104a0dd..0a73dd8f9163 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1343,10 +1343,17 @@ static void nested_svm_triple_fault(struct kvm_vcpu *vcpu)
> >         nested_svm_simple_vmexit(to_svm(vcpu), SVM_EXIT_SHUTDOWN);
> >  }
> > 
> > +static const struct vmcb_ctrl_area_cached *svm_cached_vmcb12_control(struct vcpu_svm *svm) {
> > +       return &svm->nested.ctl;
> > +}
> 
> ...
> 
> > Is this sufficient?
> 
> It's certainly better, but unless a sea of helpers is orders of magnitude worse,
> I would prefer to make it even harder to put hole in our foot.
> 
> E.g. unless we're hyper diligent about constifying everything, it's not _that_
> hard to imagine a chain of events where we end up with a "live" pointer to the
> cache.
> 
>   1. A helper like __nested_vmcb_check_controls() isn't const, so we cast to strip
>      the const.

I would argue that casting to strip the const is a red flag and this
scenario should have stopped here :P

That being said, as you said, it depends on how many helpers we'll
actually need.

> 
>   2. Someone "improves" the code by grabbing the non-const variable to pass it
>      into other helpers.
> 
>   3. The non-const variable is used to update the cache for whatever reason, and
>      it works 99.9% of the time, until it doesn't.
> 
> Now, I don't think that's at all likely to happen, but as the years pile on and
> developers come and go, the probability of introducing a goof goes up, bit by bit.
> 
> > > > > > I think this will be annoying when new fields are added, like
> > > > > > insn_bytes. Perhaps at some point we move to just serializing the entire
> > > > > > combined vmcb02/vmcb12 control area and add a flag for that.
> > > > > 
> > > > > If we do it now, can we avoid the flag?
> > > > 
> > > > I don't think so. Fields like insn_bytes are not currently serialized at
> > > > all. The moment we need them, we'll probably need to add a flag, at
> > > > which point serializing everything under the flag would probably be the
> > > > sane thing to do.
> > > > 
> > > > That being said, I don't really know how a KVM that uses insn_bytes
> > > > should handle restoring from an older KVM that doesn't serialize it :/
> > > > 
> > > > Problem for the future, I guess :)
> > > 
> > > Oh, good point.  In that case, I think it makes sense to add the flag asap, so
> > > that _if_ it turns out that KVM needs to consume a field that isn't currently
> > > saved/restored, we'll at least have a better story for KVM's that save/restore
> > > everything.
> > 
> > Not sure I follow. Do you mean start serializing everything and setting
> > the flag ASAP (which IIUC would be after the rework we discussed), 
> 
> Yep.

I don't think it matters that much when we start doing this. In all
cases:

1. KVM will need to be backward-compatible.

2. Any new features that depend on save+restore of those fields will be
a in a new KVM that does the 'full' save+restore (assuming we don't let
people add per-field flags).

The only scenario that I can think of is if a feature can be enabled at
runtime, and we want to be able to enable it for a running VM after
migrating from an old KVM to a new KVM. Not sure how likely this is.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-21  9:11                     ` Yosry Ahmed
@ 2026-02-23 16:59                       ` Sean Christopherson
  2026-02-23 17:21                         ` Yosry Ahmed
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2026-02-23 16:59 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Yosry Ahmed, Paolo Bonzini, kvm, linux-kernel, stable

On Sat, Feb 21, 2026, Yosry Ahmed wrote:
> [..]
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index de90b104a0dd..0a73dd8f9163 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -1343,10 +1343,17 @@ static void nested_svm_triple_fault(struct kvm_vcpu *vcpu)
> > >         nested_svm_simple_vmexit(to_svm(vcpu), SVM_EXIT_SHUTDOWN);
> > >  }
> > > 
> > > +static const struct vmcb_ctrl_area_cached *svm_cached_vmcb12_control(struct vcpu_svm *svm) {
> > > +       return &svm->nested.ctl;
> > > +}
> > 
> > ...
> > 
> > > Is this sufficient?
> > 
> > It's certainly better, but unless a sea of helpers is orders of magnitude worse,
> > I would prefer to make it even harder to put hole in our foot.
> > 
> > E.g. unless we're hyper diligent about constifying everything, it's not _that_
> > hard to imagine a chain of events where we end up with a "live" pointer to the
> > cache.
> > 
> >   1. A helper like __nested_vmcb_check_controls() isn't const, so we cast to strip
> >      the const.
> 
> I would argue that casting to strip the const is a red flag and this
> scenario should have stopped here :P

Key word "should" :-)

> > > > Oh, good point.  In that case, I think it makes sense to add the flag asap, so
> > > > that _if_ it turns out that KVM needs to consume a field that isn't currently
> > > > saved/restored, we'll at least have a better story for KVM's that save/restore
> > > > everything.
> > > 
> > > Not sure I follow. Do you mean start serializing everything and setting
> > > the flag ASAP (which IIUC would be after the rework we discussed), 
> > 
> > Yep.
> 
> I don't think it matters that much when we start doing this. In all
> cases:
> 
> 1. KVM will need to be backward-compatible.
> 
> 2. Any new features that depend on save+restore of those fields will be
> a in a new KVM that does the 'full' save+restore (assuming we don't let
> people add per-field flags).
> 
> The only scenario that I can think of is if a feature can be enabled at
> runtime, and we want to be able to enable it for a running VM after
> migrating from an old KVM to a new KVM. Not sure how likely this is.

The scenario I'm thinking of is where we belatedly realize we should have been
saving+restoring a field for a feature that is already supported, e.g. gpat.  If
KVM saves+restores everything, then we don't have to come up with a hacky solution
for older KVM, because it already provides the desired behavior for the "save",
only the "restore" for the older KVM is broken.

Does that make sense?  It makes sense in my head, but I'm not sure I communicated
the idea very well...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-23 16:59                       ` Sean Christopherson
@ 2026-02-23 17:21                         ` Yosry Ahmed
  2026-02-23 20:23                           ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Yosry Ahmed @ 2026-02-23 17:21 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Yosry Ahmed, Paolo Bonzini, kvm, linux-kernel, stable

> > > > > Oh, good point.  In that case, I think it makes sense to add the flag asap, so
> > > > > that _if_ it turns out that KVM needs to consume a field that isn't currently
> > > > > saved/restored, we'll at least have a better story for KVM's that save/restore
> > > > > everything.
> > > >
> > > > Not sure I follow. Do you mean start serializing everything and setting
> > > > the flag ASAP (which IIUC would be after the rework we discussed),
> > >
> > > Yep.
> >
> > I don't think it matters that much when we start doing this. In all
> > cases:
> >
> > 1. KVM will need to be backward-compatible.
> >
> > 2. Any new features that depend on save+restore of those fields will be
> > a in a new KVM that does the 'full' save+restore (assuming we don't let
> > people add per-field flags).
> >
> > The only scenario that I can think of is if a feature can be enabled at
> > runtime, and we want to be able to enable it for a running VM after
> > migrating from an old KVM to a new KVM. Not sure how likely this is.
>
> The scenario I'm thinking of is where we belatedly realize we should have been
> saving+restoring a field for a feature that is already supported, e.g. gpat.  If
> KVM saves+restores everything, then we don't have to come up with a hacky solution
> for older KVM, because it already provides the desired behavior for the "save",
> only the "restore" for the older KVM is broken.
>
> Does that make sense?  It makes sense in my head, but I'm not sure I communicated
> the idea very well...

Kinda? What I am getting at is that we'll always have an old KVM that
doesn't save everything that we'll need to handle. I think the
scenario you have in mind is where we introduce a feature *after* we
start saving everything, and at a later point realize we didn't add
proper "restore" support, but the "save" support must have always been
there.

gPAT is not a good example because it's in the "save" area :P

But yeah, I see your point. It's not very straightforward now because
what we save comes from the cache, and we only cache what we need for
the current set of features. So this will need to be done on top of
the rework we've been discussing, where vmcb02 starts being the source
of truth instead of the cache, then we just (mostly) save vmcb02's
control area as-is.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2
  2026-02-23 17:21                         ` Yosry Ahmed
@ 2026-02-23 20:23                           ` Sean Christopherson
  0 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2026-02-23 20:23 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Yosry Ahmed, Paolo Bonzini, kvm, linux-kernel, stable

On Mon, Feb 23, 2026, Yosry Ahmed wrote:
> > > > > > Oh, good point.  In that case, I think it makes sense to add the flag asap, so
> > > > > > that _if_ it turns out that KVM needs to consume a field that isn't currently
> > > > > > saved/restored, we'll at least have a better story for KVM's that save/restore
> > > > > > everything.
> > > > >
> > > > > Not sure I follow. Do you mean start serializing everything and setting
> > > > > the flag ASAP (which IIUC would be after the rework we discussed),
> > > >
> > > > Yep.
> > >
> > > I don't think it matters that much when we start doing this. In all
> > > cases:
> > >
> > > 1. KVM will need to be backward-compatible.
> > >
> > > 2. Any new features that depend on save+restore of those fields will be
> > > a in a new KVM that does the 'full' save+restore (assuming we don't let
> > > people add per-field flags).
> > >
> > > The only scenario that I can think of is if a feature can be enabled at
> > > runtime, and we want to be able to enable it for a running VM after
> > > migrating from an old KVM to a new KVM. Not sure how likely this is.
> >
> > The scenario I'm thinking of is where we belatedly realize we should have been
> > saving+restoring a field for a feature that is already supported, e.g. gpat.  If
> > KVM saves+restores everything, then we don't have to come up with a hacky solution
> > for older KVM, because it already provides the desired behavior for the "save",
> > only the "restore" for the older KVM is broken.
> >
> > Does that make sense?  It makes sense in my head, but I'm not sure I communicated
> > the idea very well...
> 
> Kinda? What I am getting at is that we'll always have an old KVM that
> doesn't save everything that we'll need to handle. I think the
> scenario you have in mind is where we introduce a feature *after* we
> start saving everything, and at a later point realize we didn't add
> proper "restore" support, but the "save" support must have always been
> there.

Yes.

> gPAT is not a good example because it's in the "save" area :P

Oh, I don't think I realized this is only talking about the control area.  Or I
probably forgot.

> But yeah, I see your point. It's not very straightforward now because
> what we save comes from the cache, and we only cache what we need for
> the current set of features. So this will need to be done on top of
> the rework we've been discussing, where vmcb02 starts being the source
> of truth instead of the cache, then we just (mostly) save vmcb02's
> control area as-is.

Yeah, it's not urgent by any means.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-02-23 20:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260210005449.3125133-1-yosry.ahmed@linux.dev>
2026-02-10  0:54 ` [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2 Yosry Ahmed
2026-02-10  1:49   ` Sean Christopherson
2026-02-10 16:25     ` Yosry Ahmed
2026-02-10 19:20       ` Sean Christopherson
2026-02-10 22:19         ` Yosry Ahmed
2026-02-11  0:09           ` Sean Christopherson
2026-02-11  0:27             ` Yosry Ahmed
2026-02-11  0:39               ` Sean Christopherson
2026-02-11  1:02                 ` Yosry Ahmed
2026-02-21  0:03                   ` Sean Christopherson
2026-02-21  9:11                     ` Yosry Ahmed
2026-02-23 16:59                       ` Sean Christopherson
2026-02-23 17:21                         ` Yosry Ahmed
2026-02-23 20:23                           ` Sean Christopherson
2026-02-10  0:54 ` [PATCH 2/4] KVM: nSVM: Sync int_state " Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox