public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Omar Sandoval <osandov@osandov.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, Gregory Price <gourry@gourry.net>,
	kernel-team@fb.com
Subject: Re: [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced
Date: Mon, 27 Oct 2025 23:44:00 -0700	[thread overview]
Message-ID: <aQBmME40vhf-lh7R@telecaster> (raw)
In-Reply-To: <aQANT9rvO9FMmmkG@google.com>

On Mon, Oct 27, 2025 at 05:24:47PM -0700, Sean Christopherson wrote:
> "INT3/INTO", because the code handles both.
> 
> On Mon, Oct 27, 2025, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > We've been seeing guest VM kernel panics with "Oops: int3" coming from
> 
> No pronouns please, "we" relies too much on context and that context can be lost
> over time, or might be interpreted differently by readers, etc.
> 
> > static branch checks. I found that the reported RIP is one instruction
> > after where it's supposed to be, and I determined that when this
> > happens, the RIP was injected by __svm_skip_emulated_instruction() on
> > the host when a nested page fault was hit while vectoring an int3.
> 
> I definitely appreciate the gory details, but please capture just the technical
> details of the bug and fix.  A play-by-play of the debugging process is useful
> (and interesting!) for bug reports and cover letters, but it's mostly noise for
> future readers that usually care about what was changed and/or what the code is
> doing.
> 
> > Static branches use the smp_text_poke mechanism, which works by
> > temporarily inserting an int3, then overwriting it. In the meantime,
> > smp_text_poke_int3_handler() relies on getting an accurate RIP.
> > 
> > This temporary int3 from smp_text_poke is triggering the exact scenario
> > described in the fixes commit: "if the guest executes INTn (no KVM
> > injection), an exit occurs while vectoring the INTn, and the guest
> > modifies the code stream while the exit is being handled, KVM will
> > compute the incorrect next_rip due to "skipping" the wrong instruction."
> > 
> > I'm able to reproduce this almost instantly by patching the guest kernel
> > to hammer on a static branch [1] while a drgn script [2] on the host
> > constantly swaps out the memory containing the guest's Task State
> > Segment.
> 
> I would actually just say "TSS".  Normally I'm all for spelling out acronyms on
> their first use, but in this case I think TSS is more obviously a "thing" than
> Task State Segment, e.g. I had a momentary brain fart and was wondering what you
> meant by swapping out a segment :-).
> 
> > The fixes commit also suggests a workaround: "A future enhancement to
> > make this less awful would be for KVM to detect that the decoded
> > instruction is not the correct INTn and drop the to-be-injected soft
> > event." This implements that workaround.
> 
> Please focus on the actual change.  Again, I love the detail, but that was a _lot_
> to get through just to see "Sorry, but the princess is in another castle." :-D

Fair enough.

> E.g. I think this captures everything in about the same word count, but with a
> stronger focus on what the patch is actually doing.
> 
>   When re-injecting an soft interrupt from an INT3, INT0, or (select) INTn
>   instruction, discard the exception and retry the instruction if the code
>   stream is changed (e.g. by a different vCPU) between when the CPU executes
>   the instruction and when KVM decodes the instruction to get the next RIP.
> 
>   As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject
>   INT3/INTO instead of retrying the instruction"), failure to verify the
>   correct INTn instruction was decoded can effectively clobber guest state
>   due to decoding the wrong instruction and thus specifying the wrong next
>   RIP.
> 
>   The bug most often manifests as "Oops: int3" panics on static branch
>   checks when running Linux guests.  Linux's static branch patching uses
>   he kernel's "text poke" code patching   mechanism.  To modify code while
>   other CPUs may be executing said code, Linux (temporarily) replaces the
>   first byte of the original instruction with an int3 (opcode 0xcc), then
>   patches in the new code stream except for the first byte, and finally
>   replaces the int3 the first byte of the new code stream.  If a CPU hits
>   the int3, i.e. executes the code while it's being modified, the guest
>   kernel will gracefully handle the resulting #BP, e.g. by looping until
>   the int3 is replaced, by emulating the new instruction, etc.
> 
>   E.g. the bug reproduces almost instantly by hacking the guest kernel to
>   provide a convenient static branch[1], while running a drgn script[2] on
>   the host to constantly swap out the memory containing the guest's TSS.
> 
> > [1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a
> > [2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b
> > 
> > Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
> 
> Cc: stable@vger.kernel.org
> 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> > Based on Linus's current tree.
> > 
> >  arch/x86/kvm/svm/svm.c | 40 ++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 36 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 153c12dbf3eb..4d72c308b50b 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -271,8 +271,29 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
> >  
> >  }
> >  
> > +static bool emulated_instruction_matches_vector(struct kvm_vcpu *vcpu,
> > +						unsigned int vector)
> > +{
> > +	switch (vector) {
> > +	case BP_VECTOR:
> > +		return vcpu->arch.emulate_ctxt->b == 0xcc ||
> > +		       (vcpu->arch.emulate_ctxt->b == 0xcd &&
> > +			vcpu->arch.emulate_ctxt->src.val == BP_VECTOR);
> > +	case OF_VECTOR:
> > +		return vcpu->arch.emulate_ctxt->b == 0xce;
> 
> Hmm, unless I'm missing something, this should handle 0xcd for both.

Yup, I forgot about int 4.

> > +	default:
> > +		return false;
> > +	}
> > +}
> > +
> > +/*
> > + * If vector != 0, then this skips the instruction only if the instruction could
> > + * generate an interrupt with that vector. If not, then it fails, indicating
> > + * that the instruction should be retried.
> > + */
> >  static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> > -					   bool commit_side_effects)
> > +					   bool commit_side_effects,
> > +					   unsigned int vector)
> >  {
> >  	struct vcpu_svm *svm = to_svm(vcpu);
> >  	unsigned long old_rflags;
> > @@ -293,8 +314,18 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> >  		if (unlikely(!commit_side_effects))
> >  			old_rflags = svm->vmcb->save.rflags;
> >  
> > -		if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> > +		if (vector == 0) {
> > +			if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> > +				return 0;
> > +		} else if (x86_decode_emulated_instruction(vcpu, EMULTYPE_SKIP,
> > +							   NULL,
> > +							   0) != EMULATION_OK ||
> 
> I think I would rather handle this in kvm_emulate_instruction(), not in the SVM
> code.  x86_decode_emulated_instruction() really shouldn't be exposed outside of
> x86.c, the use in gp_interception() is all kinds of gross. :-/
> 
> The best idea I can come up with is to add an EMULTYPE_SKIP_SOFT_INT, and use that
> to communicate to the kvm_emulate_instruction() that it should verify the
> to-be-skipped instruction.  I don't love the implicit passing of the vector via
> vcpu->arch.exception.vector, but all of this code is quite heinous, so I won't
> loose sleep over it.
> 
> Compile-tested only, but something like this?
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2bfae1cfa514..eca4704e3934 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2154,6 +2154,7 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>  #define EMULTYPE_PF                (1 << 6)
>  #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7)
>  #define EMULTYPE_WRITE_PF_TO_SP            (1 << 8)
> +#define EMULTYPE_SKIP_SOFT_INT     ((1 << 9) | EMULTYPE_SKIP)
>  
>  static inline bool kvm_can_emulate_event_vectoring(int emul_type)
>  {
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index f14709a511aa..82e97f2f8635 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -272,6 +272,7 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
>  }
>  
>  static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> +                                          int emul_type,
>                                            bool commit_side_effects)
>  {
>         struct vcpu_svm *svm = to_svm(vcpu);
> @@ -293,7 +294,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
>                 if (unlikely(!commit_side_effects))
>                         old_rflags = svm->vmcb->save.rflags;
>  
> -               if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> +               if (!kvm_emulate_instruction(vcpu, emul_type))
>                         return 0;
>  
>                 if (unlikely(!commit_side_effects))
> @@ -311,7 +312,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
>  
>  static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
>  {
> -       return __svm_skip_emulated_instruction(vcpu, true);
> +       return __svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP, true);
>  }
>  
>  static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
> @@ -331,7 +332,7 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
>          * in use, the skip must not commit any side effects such as clearing
>          * the interrupt shadow or RFLAGS.RF.
>          */
> -       if (!__svm_skip_emulated_instruction(vcpu, !nrips))
> +       if (!__svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP_SOFT_INT, !nrips))
>                 return -EIO;
>  
>         rip = kvm_rip_read(vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 593fccc9cf1c..500f9b7f564e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9351,6 +9351,23 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
>         return false;
>  }
>  
> +static bool is_soft_int_instruction(struct x86_emulate_ctxt *ctxt, u8 vector)
> +{
> +       if (WARN_ON_ONCE(vector != BP_VECTOR && vector != OF_VECTOR))
> +               return false;

This warning triggers when called from the svm_inject_irq() path since
that case uses vcpu->arch.interrupt.nr. I can't tell whether it'd be
okay to set vcpu->arch.exception.vector = vcpu->arch.interrupt.nr in
that case, or if we need yet another emultype. (I believe my patch had
the same problem, but silently.)

> +       switch (ctxt->b) {
> +       case 0xcc:
> +               return vector == BP_VECTOR;
> +       case 0xcd:
> +               return vector == ctxt->src.val;
> +       case 0xce:
> +               return vector == OF_VECTOR;
> +       default:
> +               return false;
> +       }
> +}
> +
>  /*
>   * Decode an instruction for emulation.  The caller is responsible for handling
>   * code breakpoints.  Note, manually detecting code breakpoints is unnecessary
> @@ -9461,6 +9478,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>          * injecting single-step #DBs.
>          */
>         if (emulation_type & EMULTYPE_SKIP) {
> +               if ((emulation_type & EMULTYPE_SKIP_SOFT_INT) &&

This needs to be (emulation_type & EMULTYPE_SKIP_SOFT_INT) == EMULTYPE_SKIP_SOFT_INT
(either that or making EMULTYPE_SKIP_SOFT_INT not include
EMULTYPE_SKIP).

Did you intend to send this out as a patch, or should I send a v2 based
on this?

Thanks,
Omar

> +                   !is_soft_int_instruction(ctxt, vcpu->arch.exception.vector))
> +                       return 0;
> +
>                 if (ctxt->mode != X86EMUL_MODE_PROT64)
>                         ctxt->eip = (u32)ctxt->_eip;
>                 else

  reply	other threads:[~2025-10-28  6:44 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-27 23:06 [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced Omar Sandoval
2025-10-28  0:24 ` Sean Christopherson
2025-10-28  6:44   ` Omar Sandoval [this message]
2025-10-29  0:18     ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQBmME40vhf-lh7R@telecaster \
    --to=osandov@osandov.com \
    --cc=gourry@gourry.net \
    --cc=kernel-team@fb.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox