All of lore.kernel.org
 help / color / mirror / Atom feed
From: Omar Sandoval <osandov@osandov.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, Gregory Price <gourry@gourry.net>,
	kernel-team@fb.com
Subject: Re: [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced
Date: Mon, 27 Oct 2025 23:44:00 -0700	[thread overview]
Message-ID: <aQBmME40vhf-lh7R@telecaster> (raw)
In-Reply-To: <aQANT9rvO9FMmmkG@google.com>

On Mon, Oct 27, 2025 at 05:24:47PM -0700, Sean Christopherson wrote:
> "INT3/INTO", because the code handles both.
> 
> On Mon, Oct 27, 2025, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > We've been seeing guest VM kernel panics with "Oops: int3" coming from
> 
> No pronouns please, "we" relies too much on context and that context can be lost
> over time, or might be interpreted differently by readers, etc.
> 
> > static branch checks. I found that the reported RIP is one instruction
> > after where it's supposed to be, and I determined that when this
> > happens, the RIP was injected by __svm_skip_emulated_instruction() on
> > the host when a nested page fault was hit while vectoring an int3.
> 
> I definitely appreciate the gory details, but please capture just the technical
> details of the bug and fix.  A play-by-play of the debugging process is useful
> (and interesting!) for bug reports and cover letters, but it's mostly noise for
> future readers that usually care about what was changed and/or what the code is
> doing.
> 
> > Static branches use the smp_text_poke mechanism, which works by
> > temporarily inserting an int3, then overwriting it. In the meantime,
> > smp_text_poke_int3_handler() relies on getting an accurate RIP.
> > 
> > This temporary int3 from smp_text_poke is triggering the exact scenario
> > described in the fixes commit: "if the guest executes INTn (no KVM
> > injection), an exit occurs while vectoring the INTn, and the guest
> > modifies the code stream while the exit is being handled, KVM will
> > compute the incorrect next_rip due to "skipping" the wrong instruction."
> > 
> > I'm able to reproduce this almost instantly by patching the guest kernel
> > to hammer on a static branch [1] while a drgn script [2] on the host
> > constantly swaps out the memory containing the guest's Task State
> > Segment.
> 
> I would actually just say "TSS".  Normally I'm all for spelling out acronyms on
> their first use, but in this case I think TSS is more obviously a "thing" than
> Task State Segment, e.g. I had a momentary brain fart and was wondering what you
> meant by swapping out a segment :-).
> 
> > The fixes commit also suggests a workaround: "A future enhancement to
> > make this less awful would be for KVM to detect that the decoded
> > instruction is not the correct INTn and drop the to-be-injected soft
> > event." This implements that workaround.
> 
> Please focus on the actual change.  Again, I love the detail, but that was a _lot_
> to get through just to see "Sorry, but the princess is in another castle." :-D

Fair enough.

> E.g. I think this captures everything in about the same word count, but with a
> stronger focus on what the patch is actually doing.
> 
>   When re-injecting an soft interrupt from an INT3, INT0, or (select) INTn
>   instruction, discard the exception and retry the instruction if the code
>   stream is changed (e.g. by a different vCPU) between when the CPU executes
>   the instruction and when KVM decodes the instruction to get the next RIP.
> 
>   As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject
>   INT3/INTO instead of retrying the instruction"), failure to verify the
>   correct INTn instruction was decoded can effectively clobber guest state
>   due to decoding the wrong instruction and thus specifying the wrong next
>   RIP.
> 
>   The bug most often manifests as "Oops: int3" panics on static branch
>   checks when running Linux guests.  Linux's static branch patching uses
>   he kernel's "text poke" code patching   mechanism.  To modify code while
>   other CPUs may be executing said code, Linux (temporarily) replaces the
>   first byte of the original instruction with an int3 (opcode 0xcc), then
>   patches in the new code stream except for the first byte, and finally
>   replaces the int3 the first byte of the new code stream.  If a CPU hits
>   the int3, i.e. executes the code while it's being modified, the guest
>   kernel will gracefully handle the resulting #BP, e.g. by looping until
>   the int3 is replaced, by emulating the new instruction, etc.
> 
>   E.g. the bug reproduces almost instantly by hacking the guest kernel to
>   provide a convenient static branch[1], while running a drgn script[2] on
>   the host to constantly swap out the memory containing the guest's TSS.
> 
> > [1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a
> > [2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b
> > 
> > Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
> 
> Cc: stable@vger.kernel.org
> 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> > Based on Linus's current tree.
> > 
> >  arch/x86/kvm/svm/svm.c | 40 ++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 36 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 153c12dbf3eb..4d72c308b50b 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -271,8 +271,29 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
> >  
> >  }
> >  
> > +static bool emulated_instruction_matches_vector(struct kvm_vcpu *vcpu,
> > +						unsigned int vector)
> > +{
> > +	switch (vector) {
> > +	case BP_VECTOR:
> > +		return vcpu->arch.emulate_ctxt->b == 0xcc ||
> > +		       (vcpu->arch.emulate_ctxt->b == 0xcd &&
> > +			vcpu->arch.emulate_ctxt->src.val == BP_VECTOR);
> > +	case OF_VECTOR:
> > +		return vcpu->arch.emulate_ctxt->b == 0xce;
> 
> Hmm, unless I'm missing something, this should handle 0xcd for both.

Yup, I forgot about int 4.

> > +	default:
> > +		return false;
> > +	}
> > +}
> > +
> > +/*
> > + * If vector != 0, then this skips the instruction only if the instruction could
> > + * generate an interrupt with that vector. If not, then it fails, indicating
> > + * that the instruction should be retried.
> > + */
> >  static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> > -					   bool commit_side_effects)
> > +					   bool commit_side_effects,
> > +					   unsigned int vector)
> >  {
> >  	struct vcpu_svm *svm = to_svm(vcpu);
> >  	unsigned long old_rflags;
> > @@ -293,8 +314,18 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> >  		if (unlikely(!commit_side_effects))
> >  			old_rflags = svm->vmcb->save.rflags;
> >  
> > -		if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> > +		if (vector == 0) {
> > +			if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> > +				return 0;
> > +		} else if (x86_decode_emulated_instruction(vcpu, EMULTYPE_SKIP,
> > +							   NULL,
> > +							   0) != EMULATION_OK ||
> 
> I think I would rather handle this in kvm_emulate_instruction(), not in the SVM
> code.  x86_decode_emulated_instruction() really shouldn't be exposed outside of
> x86.c, the use in gp_interception() is all kinds of gross. :-/
> 
> The best idea I can come up with is to add an EMULTYPE_SKIP_SOFT_INT, and use that
> to communicate to the kvm_emulate_instruction() that it should verify the
> to-be-skipped instruction.  I don't love the implicit passing of the vector via
> vcpu->arch.exception.vector, but all of this code is quite heinous, so I won't
> loose sleep over it.
> 
> Compile-tested only, but something like this?
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2bfae1cfa514..eca4704e3934 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2154,6 +2154,7 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>  #define EMULTYPE_PF                (1 << 6)
>  #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7)
>  #define EMULTYPE_WRITE_PF_TO_SP            (1 << 8)
> +#define EMULTYPE_SKIP_SOFT_INT     ((1 << 9) | EMULTYPE_SKIP)
>  
>  static inline bool kvm_can_emulate_event_vectoring(int emul_type)
>  {
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index f14709a511aa..82e97f2f8635 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -272,6 +272,7 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
>  }
>  
>  static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> +                                          int emul_type,
>                                            bool commit_side_effects)
>  {
>         struct vcpu_svm *svm = to_svm(vcpu);
> @@ -293,7 +294,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
>                 if (unlikely(!commit_side_effects))
>                         old_rflags = svm->vmcb->save.rflags;
>  
> -               if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> +               if (!kvm_emulate_instruction(vcpu, emul_type))
>                         return 0;
>  
>                 if (unlikely(!commit_side_effects))
> @@ -311,7 +312,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
>  
>  static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
>  {
> -       return __svm_skip_emulated_instruction(vcpu, true);
> +       return __svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP, true);
>  }
>  
>  static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
> @@ -331,7 +332,7 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
>          * in use, the skip must not commit any side effects such as clearing
>          * the interrupt shadow or RFLAGS.RF.
>          */
> -       if (!__svm_skip_emulated_instruction(vcpu, !nrips))
> +       if (!__svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP_SOFT_INT, !nrips))
>                 return -EIO;
>  
>         rip = kvm_rip_read(vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 593fccc9cf1c..500f9b7f564e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9351,6 +9351,23 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
>         return false;
>  }
>  
> +static bool is_soft_int_instruction(struct x86_emulate_ctxt *ctxt, u8 vector)
> +{
> +       if (WARN_ON_ONCE(vector != BP_VECTOR && vector != OF_VECTOR))
> +               return false;

This warning triggers when called from the svm_inject_irq() path since
that case uses vcpu->arch.interrupt.nr. I can't tell whether it'd be
okay to set vcpu->arch.exception.vector = vcpu->arch.interrupt.nr in
that case, or if we need yet another emultype. (I believe my patch had
the same problem, but silently.)

> +       switch (ctxt->b) {
> +       case 0xcc:
> +               return vector == BP_VECTOR;
> +       case 0xcd:
> +               return vector == ctxt->src.val;
> +       case 0xce:
> +               return vector == OF_VECTOR;
> +       default:
> +               return false;
> +       }
> +}
> +
>  /*
>   * Decode an instruction for emulation.  The caller is responsible for handling
>   * code breakpoints.  Note, manually detecting code breakpoints is unnecessary
> @@ -9461,6 +9478,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>          * injecting single-step #DBs.
>          */
>         if (emulation_type & EMULTYPE_SKIP) {
> +               if ((emulation_type & EMULTYPE_SKIP_SOFT_INT) &&

This needs to be (emulation_type & EMULTYPE_SKIP_SOFT_INT) == EMULTYPE_SKIP_SOFT_INT
(either that or making EMULTYPE_SKIP_SOFT_INT not include
EMULTYPE_SKIP).

Did you intend to send this out as a patch, or should I send a v2 based
on this?

Thanks,
Omar

> +                   !is_soft_int_instruction(ctxt, vcpu->arch.exception.vector))
> +                       return 0;
> +
>                 if (ctxt->mode != X86EMUL_MODE_PROT64)
>                         ctxt->eip = (u32)ctxt->_eip;
>                 else

  reply	other threads:[~2025-10-28  6:44 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-27 23:06 [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced Omar Sandoval
2025-10-28  0:24 ` Sean Christopherson
2025-10-28  6:44   ` Omar Sandoval [this message]
2025-10-29  0:18     ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQBmME40vhf-lh7R@telecaster \
    --to=osandov@osandov.com \
    --cc=gourry@gourry.net \
    --cc=kernel-team@fb.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.