[PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced
@ 2025-10-27 23:06 Omar Sandoval
  2025-10-28  0:24 ` Sean Christopherson
  0 siblings, 1 reply; 4+ messages in thread
From: Omar Sandoval @ 2025-10-27 23:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, kvm; +Cc: Gregory Price, kernel-team

From: Omar Sandoval <osandov@fb.com>

We've been seeing guest VM kernel panics with "Oops: int3" coming from
static branch checks. I found that the reported RIP is one instruction
after where it's supposed to be, and I determined that when this
happens, the RIP was injected by __svm_skip_emulated_instruction() on
the host when a nested page fault was hit while vectoring an int3.

Static branches use the smp_text_poke mechanism, which works by
temporarily inserting an int3, then overwriting it. In the meantime,
smp_text_poke_int3_handler() relies on getting an accurate RIP.

This temporary int3 from smp_text_poke is triggering the exact scenario
described in the fixes commit: "if the guest executes INTn (no KVM
injection), an exit occurs while vectoring the INTn, and the guest
modifies the code stream while the exit is being handled, KVM will
compute the incorrect next_rip due to "skipping" the wrong instruction."

I'm able to reproduce this almost instantly by patching the guest kernel
to hammer on a static branch [1] while a drgn script [2] on the host
constantly swaps out the memory containing the guest's Task State
Segment.

The fixes commit also suggests a workaround: "A future enhancement to
make this less awful would be for KVM to detect that the decoded
instruction is not the correct INTn and drop the to-be-injected soft
event." This implements that workaround.

[1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a
[2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b

Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Based on Linus's current tree.

 arch/x86/kvm/svm/svm.c | 40 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 153c12dbf3eb..4d72c308b50b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -271,8 +271,29 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
 
 }
 
+static bool emulated_instruction_matches_vector(struct kvm_vcpu *vcpu,
+						unsigned int vector)
+{
+	switch (vector) {
+	case BP_VECTOR:
+		return vcpu->arch.emulate_ctxt->b == 0xcc ||
+		       (vcpu->arch.emulate_ctxt->b == 0xcd &&
+			vcpu->arch.emulate_ctxt->src.val == BP_VECTOR);
+	case OF_VECTOR:
+		return vcpu->arch.emulate_ctxt->b == 0xce;
+	default:
+		return false;
+	}
+}
+
+/*
+ * If vector != 0, then this skips the instruction only if the instruction could
+ * generate an interrupt with that vector. If not, then it fails, indicating
+ * that the instruction should be retried.
+ */
 static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
-					   bool commit_side_effects)
+					   bool commit_side_effects,
+					   unsigned int vector)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 	unsigned long old_rflags;
@@ -293,8 +314,18 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
 		if (unlikely(!commit_side_effects))
 			old_rflags = svm->vmcb->save.rflags;
 
-		if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
+		if (vector == 0) {
+			if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
+				return 0;
+		} else if (x86_decode_emulated_instruction(vcpu, EMULTYPE_SKIP,
+							   NULL,
+							   0) != EMULATION_OK ||
+			   !emulated_instruction_matches_vector(vcpu, vector) ||
+			   !kvm_emulate_instruction(vcpu,
+						    EMULTYPE_SKIP |
+						    EMULTYPE_NO_DECODE)) {
 			return 0;
+		}
 
 		if (unlikely(!commit_side_effects))
 			svm->vmcb->save.rflags = old_rflags;
@@ -311,7 +342,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
 
 static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 {
-	return __svm_skip_emulated_instruction(vcpu, true);
+	return __svm_skip_emulated_instruction(vcpu, true, 0);
 }
 
 static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
@@ -331,7 +362,8 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
 	 * in use, the skip must not commit any side effects such as clearing
 	 * the interrupt shadow or RFLAGS.RF.
 	 */
-	if (!__svm_skip_emulated_instruction(vcpu, !nrips))
+	if (!__svm_skip_emulated_instruction(vcpu, !nrips,
+					     vcpu->arch.exception.vector))
 		return -EIO;
 
 	rip = kvm_rip_read(vcpu);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced
  2025-10-27 23:06 [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced Omar Sandoval
@ 2025-10-28  0:24 ` Sean Christopherson
  2025-10-28  6:44   ` Omar Sandoval
  0 siblings, 1 reply; 4+ messages in thread
From: Sean Christopherson @ 2025-10-28  0:24 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Paolo Bonzini, kvm, Gregory Price, kernel-team

"INT3/INTO", because the code handles both.

On Mon, Oct 27, 2025, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> We've been seeing guest VM kernel panics with "Oops: int3" coming from

No pronouns please, "we" relies too much on context and that context can be lost
over time, or might be interpreted differently by readers, etc.

> static branch checks. I found that the reported RIP is one instruction
> after where it's supposed to be, and I determined that when this
> happens, the RIP was injected by __svm_skip_emulated_instruction() on
> the host when a nested page fault was hit while vectoring an int3.

I definitely appreciate the gory details, but please capture just the technical
details of the bug and fix.  A play-by-play of the debugging process is useful
(and interesting!) for bug reports and cover letters, but it's mostly noise for
future readers that usually care about what was changed and/or what the code is
doing.

> Static branches use the smp_text_poke mechanism, which works by
> temporarily inserting an int3, then overwriting it. In the meantime,
> smp_text_poke_int3_handler() relies on getting an accurate RIP.
> 
> This temporary int3 from smp_text_poke is triggering the exact scenario
> described in the fixes commit: "if the guest executes INTn (no KVM
> injection), an exit occurs while vectoring the INTn, and the guest
> modifies the code stream while the exit is being handled, KVM will
> compute the incorrect next_rip due to "skipping" the wrong instruction."
> 
> I'm able to reproduce this almost instantly by patching the guest kernel
> to hammer on a static branch [1] while a drgn script [2] on the host
> constantly swaps out the memory containing the guest's Task State
> Segment.

I would actually just say "TSS".  Normally I'm all for spelling out acronyms on
their first use, but in this case I think TSS is more obviously a "thing" than
Task State Segment, e.g. I had a momentary brain fart and was wondering what you
meant by swapping out a segment :-).

> The fixes commit also suggests a workaround: "A future enhancement to
> make this less awful would be for KVM to detect that the decoded
> instruction is not the correct INTn and drop the to-be-injected soft
> event." This implements that workaround.

Please focus on the actual change.  Again, I love the detail, but that was a _lot_
to get through just to see "Sorry, but the princess is in another castle." :-D

E.g. I think this captures everything in about the same word count, but with a
stronger focus on what the patch is actually doing.

  When re-injecting an soft interrupt from an INT3, INT0, or (select) INTn
  instruction, discard the exception and retry the instruction if the code
  stream is changed (e.g. by a different vCPU) between when the CPU executes
  the instruction and when KVM decodes the instruction to get the next RIP.

  As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject
  INT3/INTO instead of retrying the instruction"), failure to verify the
  correct INTn instruction was decoded can effectively clobber guest state
  due to decoding the wrong instruction and thus specifying the wrong next
  RIP.

  The bug most often manifests as "Oops: int3" panics on static branch
  checks when running Linux guests.  Linux's static branch patching uses
  he kernel's "text poke" code patching   mechanism.  To modify code while
  other CPUs may be executing said code, Linux (temporarily) replaces the
  first byte of the original instruction with an int3 (opcode 0xcc), then
  patches in the new code stream except for the first byte, and finally
  replaces the int3 the first byte of the new code stream.  If a CPU hits
  the int3, i.e. executes the code while it's being modified, the guest
  kernel will gracefully handle the resulting #BP, e.g. by looping until
  the int3 is replaced, by emulating the new instruction, etc.

  E.g. the bug reproduces almost instantly by hacking the guest kernel to
  provide a convenient static branch[1], while running a drgn script[2] on
  the host to constantly swap out the memory containing the guest's TSS.

> [1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a
> [2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b
> 
> Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")

Cc: stable@vger.kernel.org

> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> Based on Linus's current tree.
> 
>  arch/x86/kvm/svm/svm.c | 40 ++++++++++++++++++++++++++++++++++++----
>  1 file changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 153c12dbf3eb..4d72c308b50b 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -271,8 +271,29 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
>  
>  }
>  
> +static bool emulated_instruction_matches_vector(struct kvm_vcpu *vcpu,
> +						unsigned int vector)
> +{
> +	switch (vector) {
> +	case BP_VECTOR:
> +		return vcpu->arch.emulate_ctxt->b == 0xcc ||
> +		       (vcpu->arch.emulate_ctxt->b == 0xcd &&
> +			vcpu->arch.emulate_ctxt->src.val == BP_VECTOR);
> +	case OF_VECTOR:
> +		return vcpu->arch.emulate_ctxt->b == 0xce;

Hmm, unless I'm missing something, this should handle 0xcd for both.

> +	default:
> +		return false;
> +	}
> +}
> +
> +/*
> + * If vector != 0, then this skips the instruction only if the instruction could
> + * generate an interrupt with that vector. If not, then it fails, indicating
> + * that the instruction should be retried.
> + */
>  static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> -					   bool commit_side_effects)
> +					   bool commit_side_effects,
> +					   unsigned int vector)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  	unsigned long old_rflags;
> @@ -293,8 +314,18 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
>  		if (unlikely(!commit_side_effects))
>  			old_rflags = svm->vmcb->save.rflags;
>  
> -		if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> +		if (vector == 0) {
> +			if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> +				return 0;
> +		} else if (x86_decode_emulated_instruction(vcpu, EMULTYPE_SKIP,
> +							   NULL,
> +							   0) != EMULATION_OK ||

I think I would rather handle this in kvm_emulate_instruction(), not in the SVM
code.  x86_decode_emulated_instruction() really shouldn't be exposed outside of
x86.c, the use in gp_interception() is all kinds of gross. :-/

The best idea I can come up with is to add an EMULTYPE_SKIP_SOFT_INT, and use that
to communicate to the kvm_emulate_instruction() that it should verify the
to-be-skipped instruction.  I don't love the implicit passing of the vector via
vcpu->arch.exception.vector, but all of this code is quite heinous, so I won't
loose sleep over it.

Compile-tested only, but something like this?

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2bfae1cfa514..eca4704e3934 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2154,6 +2154,7 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 #define EMULTYPE_PF                (1 << 6)
 #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7)
 #define EMULTYPE_WRITE_PF_TO_SP            (1 << 8)
+#define EMULTYPE_SKIP_SOFT_INT     ((1 << 9) | EMULTYPE_SKIP)
 
 static inline bool kvm_can_emulate_event_vectoring(int emul_type)
 {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index f14709a511aa..82e97f2f8635 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -272,6 +272,7 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
 }
 
 static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
+                                          int emul_type,
                                           bool commit_side_effects)
 {
        struct vcpu_svm *svm = to_svm(vcpu);
@@ -293,7 +294,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
                if (unlikely(!commit_side_effects))
                        old_rflags = svm->vmcb->save.rflags;
 
-               if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
+               if (!kvm_emulate_instruction(vcpu, emul_type))
                        return 0;
 
                if (unlikely(!commit_side_effects))
@@ -311,7 +312,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
 
 static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 {
-       return __svm_skip_emulated_instruction(vcpu, true);
+       return __svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP, true);
 }
 
 static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
@@ -331,7 +332,7 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
         * in use, the skip must not commit any side effects such as clearing
         * the interrupt shadow or RFLAGS.RF.
         */
-       if (!__svm_skip_emulated_instruction(vcpu, !nrips))
+       if (!__svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP_SOFT_INT, !nrips))
                return -EIO;
 
        rip = kvm_rip_read(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 593fccc9cf1c..500f9b7f564e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9351,6 +9351,23 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
        return false;
 }
 
+static bool is_soft_int_instruction(struct x86_emulate_ctxt *ctxt, u8 vector)
+{
+       if (WARN_ON_ONCE(vector != BP_VECTOR && vector != OF_VECTOR))
+               return false;
+
+       switch (ctxt->b) {
+       case 0xcc:
+               return vector == BP_VECTOR;
+       case 0xcd:
+               return vector == ctxt->src.val;
+       case 0xce:
+               return vector == OF_VECTOR;
+       default:
+               return false;
+       }
+}
+
 /*
  * Decode an instruction for emulation.  The caller is responsible for handling
  * code breakpoints.  Note, manually detecting code breakpoints is unnecessary
@@ -9461,6 +9478,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
         * injecting single-step #DBs.
         */
        if (emulation_type & EMULTYPE_SKIP) {
+               if ((emulation_type & EMULTYPE_SKIP_SOFT_INT) &&
+                   !is_soft_int_instruction(ctxt, vcpu->arch.exception.vector))
+                       return 0;
+
                if (ctxt->mode != X86EMUL_MODE_PROT64)
                        ctxt->eip = (u32)ctxt->_eip;
                else

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced
  2025-10-28  0:24 ` Sean Christopherson
@ 2025-10-28  6:44   ` Omar Sandoval
  2025-10-29  0:18     ` Sean Christopherson
  0 siblings, 1 reply; 4+ messages in thread
From: Omar Sandoval @ 2025-10-28  6:44 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, Gregory Price, kernel-team

On Mon, Oct 27, 2025 at 05:24:47PM -0700, Sean Christopherson wrote:
> "INT3/INTO", because the code handles both.
> 
> On Mon, Oct 27, 2025, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > We've been seeing guest VM kernel panics with "Oops: int3" coming from
> 
> No pronouns please, "we" relies too much on context and that context can be lost
> over time, or might be interpreted differently by readers, etc.
> 
> > static branch checks. I found that the reported RIP is one instruction
> > after where it's supposed to be, and I determined that when this
> > happens, the RIP was injected by __svm_skip_emulated_instruction() on
> > the host when a nested page fault was hit while vectoring an int3.
> 
> I definitely appreciate the gory details, but please capture just the technical
> details of the bug and fix.  A play-by-play of the debugging process is useful
> (and interesting!) for bug reports and cover letters, but it's mostly noise for
> future readers that usually care about what was changed and/or what the code is
> doing.
> 
> > Static branches use the smp_text_poke mechanism, which works by
> > temporarily inserting an int3, then overwriting it. In the meantime,
> > smp_text_poke_int3_handler() relies on getting an accurate RIP.
> > 
> > This temporary int3 from smp_text_poke is triggering the exact scenario
> > described in the fixes commit: "if the guest executes INTn (no KVM
> > injection), an exit occurs while vectoring the INTn, and the guest
> > modifies the code stream while the exit is being handled, KVM will
> > compute the incorrect next_rip due to "skipping" the wrong instruction."
> > 
> > I'm able to reproduce this almost instantly by patching the guest kernel
> > to hammer on a static branch [1] while a drgn script [2] on the host
> > constantly swaps out the memory containing the guest's Task State
> > Segment.
> 
> I would actually just say "TSS".  Normally I'm all for spelling out acronyms on
> their first use, but in this case I think TSS is more obviously a "thing" than
> Task State Segment, e.g. I had a momentary brain fart and was wondering what you
> meant by swapping out a segment :-).
> 
> > The fixes commit also suggests a workaround: "A future enhancement to
> > make this less awful would be for KVM to detect that the decoded
> > instruction is not the correct INTn and drop the to-be-injected soft
> > event." This implements that workaround.
> 
> Please focus on the actual change.  Again, I love the detail, but that was a _lot_
> to get through just to see "Sorry, but the princess is in another castle." :-D

Fair enough.

> E.g. I think this captures everything in about the same word count, but with a
> stronger focus on what the patch is actually doing.
> 
>   When re-injecting an soft interrupt from an INT3, INT0, or (select) INTn
>   instruction, discard the exception and retry the instruction if the code
>   stream is changed (e.g. by a different vCPU) between when the CPU executes
>   the instruction and when KVM decodes the instruction to get the next RIP.
> 
>   As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject
>   INT3/INTO instead of retrying the instruction"), failure to verify the
>   correct INTn instruction was decoded can effectively clobber guest state
>   due to decoding the wrong instruction and thus specifying the wrong next
>   RIP.
> 
>   The bug most often manifests as "Oops: int3" panics on static branch
>   checks when running Linux guests.  Linux's static branch patching uses
>   he kernel's "text poke" code patching   mechanism.  To modify code while
>   other CPUs may be executing said code, Linux (temporarily) replaces the
>   first byte of the original instruction with an int3 (opcode 0xcc), then
>   patches in the new code stream except for the first byte, and finally
>   replaces the int3 the first byte of the new code stream.  If a CPU hits
>   the int3, i.e. executes the code while it's being modified, the guest
>   kernel will gracefully handle the resulting #BP, e.g. by looping until
>   the int3 is replaced, by emulating the new instruction, etc.
> 
>   E.g. the bug reproduces almost instantly by hacking the guest kernel to
>   provide a convenient static branch[1], while running a drgn script[2] on
>   the host to constantly swap out the memory containing the guest's TSS.
> 
> > [1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a
> > [2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b
> > 
> > Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
> 
> Cc: stable@vger.kernel.org
> 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> > Based on Linus's current tree.
> > 
> >  arch/x86/kvm/svm/svm.c | 40 ++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 36 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 153c12dbf3eb..4d72c308b50b 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -271,8 +271,29 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
> >  
> >  }
> >  
> > +static bool emulated_instruction_matches_vector(struct kvm_vcpu *vcpu,
> > +						unsigned int vector)
> > +{
> > +	switch (vector) {
> > +	case BP_VECTOR:
> > +		return vcpu->arch.emulate_ctxt->b == 0xcc ||
> > +		       (vcpu->arch.emulate_ctxt->b == 0xcd &&
> > +			vcpu->arch.emulate_ctxt->src.val == BP_VECTOR);
> > +	case OF_VECTOR:
> > +		return vcpu->arch.emulate_ctxt->b == 0xce;
> 
> Hmm, unless I'm missing something, this should handle 0xcd for both.

Yup, I forgot about int 4.

> > +	default:
> > +		return false;
> > +	}
> > +}
> > +
> > +/*
> > + * If vector != 0, then this skips the instruction only if the instruction could
> > + * generate an interrupt with that vector. If not, then it fails, indicating
> > + * that the instruction should be retried.
> > + */
> >  static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> > -					   bool commit_side_effects)
> > +					   bool commit_side_effects,
> > +					   unsigned int vector)
> >  {
> >  	struct vcpu_svm *svm = to_svm(vcpu);
> >  	unsigned long old_rflags;
> > @@ -293,8 +314,18 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> >  		if (unlikely(!commit_side_effects))
> >  			old_rflags = svm->vmcb->save.rflags;
> >  
> > -		if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> > +		if (vector == 0) {
> > +			if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> > +				return 0;
> > +		} else if (x86_decode_emulated_instruction(vcpu, EMULTYPE_SKIP,
> > +							   NULL,
> > +							   0) != EMULATION_OK ||
> 
> I think I would rather handle this in kvm_emulate_instruction(), not in the SVM
> code.  x86_decode_emulated_instruction() really shouldn't be exposed outside of
> x86.c, the use in gp_interception() is all kinds of gross. :-/
> 
> The best idea I can come up with is to add an EMULTYPE_SKIP_SOFT_INT, and use that
> to communicate to the kvm_emulate_instruction() that it should verify the
> to-be-skipped instruction.  I don't love the implicit passing of the vector via
> vcpu->arch.exception.vector, but all of this code is quite heinous, so I won't
> loose sleep over it.
> 
> Compile-tested only, but something like this?
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2bfae1cfa514..eca4704e3934 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2154,6 +2154,7 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>  #define EMULTYPE_PF                (1 << 6)
>  #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7)
>  #define EMULTYPE_WRITE_PF_TO_SP            (1 << 8)
> +#define EMULTYPE_SKIP_SOFT_INT     ((1 << 9) | EMULTYPE_SKIP)
>  
>  static inline bool kvm_can_emulate_event_vectoring(int emul_type)
>  {
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index f14709a511aa..82e97f2f8635 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -272,6 +272,7 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
>  }
>  
>  static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> +                                          int emul_type,
>                                            bool commit_side_effects)
>  {
>         struct vcpu_svm *svm = to_svm(vcpu);
> @@ -293,7 +294,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
>                 if (unlikely(!commit_side_effects))
>                         old_rflags = svm->vmcb->save.rflags;
>  
> -               if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> +               if (!kvm_emulate_instruction(vcpu, emul_type))
>                         return 0;
>  
>                 if (unlikely(!commit_side_effects))
> @@ -311,7 +312,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
>  
>  static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
>  {
> -       return __svm_skip_emulated_instruction(vcpu, true);
> +       return __svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP, true);
>  }
>  
>  static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
> @@ -331,7 +332,7 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
>          * in use, the skip must not commit any side effects such as clearing
>          * the interrupt shadow or RFLAGS.RF.
>          */
> -       if (!__svm_skip_emulated_instruction(vcpu, !nrips))
> +       if (!__svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP_SOFT_INT, !nrips))
>                 return -EIO;
>  
>         rip = kvm_rip_read(vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 593fccc9cf1c..500f9b7f564e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9351,6 +9351,23 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
>         return false;
>  }
>  
> +static bool is_soft_int_instruction(struct x86_emulate_ctxt *ctxt, u8 vector)
> +{
> +       if (WARN_ON_ONCE(vector != BP_VECTOR && vector != OF_VECTOR))
> +               return false;

This warning triggers when called from the svm_inject_irq() path since
that case uses vcpu->arch.interrupt.nr. I can't tell whether it'd be
okay to set vcpu->arch.exception.vector = vcpu->arch.interrupt.nr in
that case, or if we need yet another emultype. (I believe my patch had
the same problem, but silently.)

> +       switch (ctxt->b) {
> +       case 0xcc:
> +               return vector == BP_VECTOR;
> +       case 0xcd:
> +               return vector == ctxt->src.val;
> +       case 0xce:
> +               return vector == OF_VECTOR;
> +       default:
> +               return false;
> +       }
> +}
> +
>  /*
>   * Decode an instruction for emulation.  The caller is responsible for handling
>   * code breakpoints.  Note, manually detecting code breakpoints is unnecessary
> @@ -9461,6 +9478,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>          * injecting single-step #DBs.
>          */
>         if (emulation_type & EMULTYPE_SKIP) {
> +               if ((emulation_type & EMULTYPE_SKIP_SOFT_INT) &&

This needs to be (emulation_type & EMULTYPE_SKIP_SOFT_INT) == EMULTYPE_SKIP_SOFT_INT
(either that or making EMULTYPE_SKIP_SOFT_INT not include
EMULTYPE_SKIP).

Did you intend to send this out as a patch, or should I send a v2 based
on this?

Thanks,
Omar

> +                   !is_soft_int_instruction(ctxt, vcpu->arch.exception.vector))
> +                       return 0;
> +
>                 if (ctxt->mode != X86EMUL_MODE_PROT64)
>                         ctxt->eip = (u32)ctxt->_eip;
>                 else

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced
  2025-10-28  6:44   ` Omar Sandoval
@ 2025-10-29  0:18     ` Sean Christopherson
  0 siblings, 0 replies; 4+ messages in thread
From: Sean Christopherson @ 2025-10-29  0:18 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Paolo Bonzini, kvm, Gregory Price, kernel-team

On Mon, Oct 27, 2025, Omar Sandoval wrote:
> On Mon, Oct 27, 2025 at 05:24:47PM -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 593fccc9cf1c..500f9b7f564e 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9351,6 +9351,23 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
> >         return false;
> >  }
> >  
> > +static bool is_soft_int_instruction(struct x86_emulate_ctxt *ctxt, u8 vector)
> > +{
> > +       if (WARN_ON_ONCE(vector != BP_VECTOR && vector != OF_VECTOR))
> > +               return false;
> 
> This warning triggers when called from the svm_inject_irq() path since
> that case uses vcpu->arch.interrupt.nr. I can't tell whether it'd be
> okay to set vcpu->arch.exception.vector = vcpu->arch.interrupt.nr in
> that case

It's not, they are two separate things/concepts.  Even if we can somehow squeak
by (an exception can't/shouldn't be pending), that crosses over the "acceptable
hack" threshold for me.

> or if we need yet another emultype.

We have enough EMULTYPE bits available, we can just shove the vector into bits
23:16.  That definitely makes me question whether or not handling this in the
emulator is actually better than dealing with it entirely in svm.c, but after
looking at your original patch again, my vote is still to handle it in the
emulator, as there are subtle flaws in the original patch.

-               if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
+               if (vector == 0) {

This actually needs to be "vector < 0", with @vector being an "int", as I'm
pretty sure that generating a #DE with INTn is legal.

+                       if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
+                               return 0;
+               } else if (x86_decode_emulated_instruction(vcpu, EMULTYPE_SKIP,
+                                                          NULL,
+                                                          0) != EMULATION_OK ||
+                          !emulated_instruction_matches_vector(vcpu, vector) ||
+                          !kvm_emulate_instruction(vcpu,
+                                                   EMULTYPE_SKIP |
+                                                   EMULTYPE_NO_DECODE)) {
                        return 0;

And this early return is flawed because it doesn't unwind save.rflags, which
somewhat confusingly would need to be done irrespective of the value of
@commit_side_effects.  That's easy enough to fix, though the code would be ugly.
All in all, I think dealing with this in SVM would be harder to follow, even
though it would be nice to contain the SVM insanity to SVM :-/

+               }

                if (unlikely(!commit_side_effects))
                        svm->vmcb->save.rflags = old_rflags;

> (I believe my patch had the same problem, but silently.)

Gah, I was looking for that path and couldn't find it.  Hrm, the entire WARN is
somewhat bogus as the INTn case means vector can be anything architecturally
legal.  Side note, that also the "vector == ctxt->src.val" check for 0xcd is
meaningful.

> > +       switch (ctxt->b) {
> > +       case 0xcc:
> > +               return vector == BP_VECTOR;
> > +       case 0xcd:
> > +               return vector == ctxt->src.val;
> > +       case 0xce:
> > +               return vector == OF_VECTOR;
> > +       default:
> > +               return false;
> > +       }
> > +}
> > +
> >  /*
> >   * Decode an instruction for emulation.  The caller is responsible for handling
> >   * code breakpoints.  Note, manually detecting code breakpoints is unnecessary
> > @@ -9461,6 +9478,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >          * injecting single-step #DBs.
> >          */
> >         if (emulation_type & EMULTYPE_SKIP) {
> > +               if ((emulation_type & EMULTYPE_SKIP_SOFT_INT) &&
> 
> This needs to be (emulation_type & EMULTYPE_SKIP_SOFT_INT) == EMULTYPE_SKIP_SOFT_INT
> (either that or making EMULTYPE_SKIP_SOFT_INT not include
> EMULTYPE_SKIP).

Hrm, for consistency with the other EMULTYPE flags, probably best to make it
separate.

> Did you intend to send this out as a patch, or should I send a v2 based
> on this?

Your choice.  I'm happy to send a formal patch, but I'm just as happy to let you
do the testing and take all the credit :-).  If you want, you can also throw a
Co-developed-by: my way, but that's entirely optional.

Roughly this?  In the end, it doesn't look too awful.

---
Signed-off-by: Sean Christopherson <seanjc@google.com>

---
 arch/x86/include/asm/kvm_host.h |  4 ++++
 arch/x86/kvm/svm/svm.c          | 24 +++++++++++++-----------
 arch/x86/kvm/x86.c              | 21 +++++++++++++++++++++
 3 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 48598d017d6f..e6c242d89869 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2153,6 +2153,10 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 #define EMULTYPE_PF		    (1 << 6)
 #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7)
 #define EMULTYPE_WRITE_PF_TO_SP	    (1 << 8)
+#define EMULTYPE_SKIP_SOFT_INT	    (1 << 9)
+
+#define EMULTYPE_SET_SOFT_INT_VECTOR(v)	(((v) & 0xff) << 16)
+#define EMULTYPE_GET_SOFT_INT_VECTOR(e)	(((e) >> 16) & 0xff)
 
 static inline bool kvm_can_emulate_event_vectoring(int emul_type)
 {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index f14709a511aa..625f51890e29 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -272,6 +272,7 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
 }
 
 static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
+					   int emul_type,
 					   bool commit_side_effects)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -293,7 +294,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
 		if (unlikely(!commit_side_effects))
 			old_rflags = svm->vmcb->save.rflags;
 
-		if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
+		if (!kvm_emulate_instruction(vcpu, emul_type))
 			return 0;
 
 		if (unlikely(!commit_side_effects))
@@ -311,11 +312,13 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
 
 static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 {
-	return __svm_skip_emulated_instruction(vcpu, true);
+	return __svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP, true);
 }
 
-static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
+static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu, u8 vector)
 {
+	const int emul_type = EMULTYPE_SKIP | EMULTYPE_SKIP_SOFT_INT |
+			      EMULTYPE_GET_SOFT_INT_VECTOR(vector);
 	unsigned long rip, old_rip = kvm_rip_read(vcpu);
 	struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -331,7 +334,7 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
 	 * in use, the skip must not commit any side effects such as clearing
 	 * the interrupt shadow or RFLAGS.RF.
 	 */
-	if (!__svm_skip_emulated_instruction(vcpu, !nrips))
+	if (!__svm_skip_emulated_instruction(vcpu, emul_type, !nrips))
 		return -EIO;
 
 	rip = kvm_rip_read(vcpu);
@@ -367,7 +370,7 @@ static void svm_inject_exception(struct kvm_vcpu *vcpu)
 	kvm_deliver_exception_payload(vcpu, ex);
 
 	if (kvm_exception_is_soft(ex->vector) &&
-	    svm_update_soft_interrupt_rip(vcpu))
+	    svm_update_soft_interrupt_rip(vcpu, ex->vector))
 		return;
 
 	svm->vmcb->control.event_inj = ex->vector
@@ -3642,11 +3645,12 @@ static bool svm_set_vnmi_pending(struct kvm_vcpu *vcpu)
 
 static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
 {
+	struct kvm_queued_interrupt *intr = &vcpu->arch.interrupt;
 	struct vcpu_svm *svm = to_svm(vcpu);
 	u32 type;
 
-	if (vcpu->arch.interrupt.soft) {
-		if (svm_update_soft_interrupt_rip(vcpu))
+	if (intr->soft) {
+		if (svm_update_soft_interrupt_rip(vcpu, intr->nr))
 			return;
 
 		type = SVM_EVTINJ_TYPE_SOFT;
@@ -3654,12 +3658,10 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
 		type = SVM_EVTINJ_TYPE_INTR;
 	}
 
-	trace_kvm_inj_virq(vcpu->arch.interrupt.nr,
-			   vcpu->arch.interrupt.soft, reinjected);
+	trace_kvm_inj_virq(intr->nr, intr->soft, reinjected);
 	++vcpu->stat.irq_injections;
 
-	svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
-				       SVM_EVTINJ_VALID | type;
+	svm->vmcb->control.event_inj = intr->nr | SVM_EVTINJ_VALID | type;
 }
 
 void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b4b5d2d09634..9dc66cca154d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9338,6 +9338,23 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
 	return false;
 }
 
+static bool is_soft_int_instruction(struct x86_emulate_ctxt *ctxt,
+				    int emulation_type)
+{
+	u8 vector = EMULTYPE_GET_SOFT_INT_VECTOR(emulation_type);
+
+	switch (ctxt->b) {
+	case 0xcc:
+		return vector == BP_VECTOR;
+	case 0xcd:
+		return vector == ctxt->src.val;
+	case 0xce:
+		return vector == OF_VECTOR;
+	default:
+		return false;
+	}
+}
+
 /*
  * Decode an instruction for emulation.  The caller is responsible for handling
  * code breakpoints.  Note, manually detecting code breakpoints is unnecessary
@@ -9448,6 +9465,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	 * injecting single-step #DBs.
 	 */
 	if (emulation_type & EMULTYPE_SKIP) {
+		if (emulation_type & EMULTYPE_SKIP_SOFT_INT &&
+		    !is_soft_int_instruction(ctxt, emulation_type))
+			return 0;
+
 		if (ctxt->mode != X86EMUL_MODE_PROT64)
 			ctxt->eip = (u32)ctxt->_eip;
 		else

base-commit: 4cc167c50eb19d44ac7e204938724e685e3d8057
--

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-10-29  0:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27 23:06 [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced Omar Sandoval
2025-10-28  0:24 ` Sean Christopherson
2025-10-28  6:44   ` Omar Sandoval
2025-10-29  0:18     ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox