From: Omar Sandoval <osandov@osandov.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
kvm@vger.kernel.org, Gregory Price <gourry@gourry.net>,
kernel-team@fb.com
Subject: Re: [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced
Date: Mon, 27 Oct 2025 23:44:00 -0700 [thread overview]
Message-ID: <aQBmME40vhf-lh7R@telecaster> (raw)
In-Reply-To: <aQANT9rvO9FMmmkG@google.com>
On Mon, Oct 27, 2025 at 05:24:47PM -0700, Sean Christopherson wrote:
> "INT3/INTO", because the code handles both.
>
> On Mon, Oct 27, 2025, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> >
> > We've been seeing guest VM kernel panics with "Oops: int3" coming from
>
> No pronouns please, "we" relies too much on context and that context can be lost
> over time, or might be interpreted differently by readers, etc.
>
> > static branch checks. I found that the reported RIP is one instruction
> > after where it's supposed to be, and I determined that when this
> > happens, the RIP was injected by __svm_skip_emulated_instruction() on
> > the host when a nested page fault was hit while vectoring an int3.
>
> I definitely appreciate the gory details, but please capture just the technical
> details of the bug and fix. A play-by-play of the debugging process is useful
> (and interesting!) for bug reports and cover letters, but it's mostly noise for
> future readers that usually care about what was changed and/or what the code is
> doing.
>
> > Static branches use the smp_text_poke mechanism, which works by
> > temporarily inserting an int3, then overwriting it. In the meantime,
> > smp_text_poke_int3_handler() relies on getting an accurate RIP.
> >
> > This temporary int3 from smp_text_poke is triggering the exact scenario
> > described in the fixes commit: "if the guest executes INTn (no KVM
> > injection), an exit occurs while vectoring the INTn, and the guest
> > modifies the code stream while the exit is being handled, KVM will
> > compute the incorrect next_rip due to "skipping" the wrong instruction."
> >
> > I'm able to reproduce this almost instantly by patching the guest kernel
> > to hammer on a static branch [1] while a drgn script [2] on the host
> > constantly swaps out the memory containing the guest's Task State
> > Segment.
>
> I would actually just say "TSS". Normally I'm all for spelling out acronyms on
> their first use, but in this case I think TSS is more obviously a "thing" than
> Task State Segment, e.g. I had a momentary brain fart and was wondering what you
> meant by swapping out a segment :-).
>
> > The fixes commit also suggests a workaround: "A future enhancement to
> > make this less awful would be for KVM to detect that the decoded
> > instruction is not the correct INTn and drop the to-be-injected soft
> > event." This implements that workaround.
>
> Please focus on the actual change. Again, I love the detail, but that was a _lot_
> to get through just to see "Sorry, but the princess is in another castle." :-D
Fair enough.
> E.g. I think this captures everything in about the same word count, but with a
> stronger focus on what the patch is actually doing.
>
> When re-injecting an soft interrupt from an INT3, INT0, or (select) INTn
> instruction, discard the exception and retry the instruction if the code
> stream is changed (e.g. by a different vCPU) between when the CPU executes
> the instruction and when KVM decodes the instruction to get the next RIP.
>
> As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject
> INT3/INTO instead of retrying the instruction"), failure to verify the
> correct INTn instruction was decoded can effectively clobber guest state
> due to decoding the wrong instruction and thus specifying the wrong next
> RIP.
>
> The bug most often manifests as "Oops: int3" panics on static branch
> checks when running Linux guests. Linux's static branch patching uses
> he kernel's "text poke" code patching mechanism. To modify code while
> other CPUs may be executing said code, Linux (temporarily) replaces the
> first byte of the original instruction with an int3 (opcode 0xcc), then
> patches in the new code stream except for the first byte, and finally
> replaces the int3 the first byte of the new code stream. If a CPU hits
> the int3, i.e. executes the code while it's being modified, the guest
> kernel will gracefully handle the resulting #BP, e.g. by looping until
> the int3 is replaced, by emulating the new instruction, etc.
>
> E.g. the bug reproduces almost instantly by hacking the guest kernel to
> provide a convenient static branch[1], while running a drgn script[2] on
> the host to constantly swap out the memory containing the guest's TSS.
>
> > [1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a
> > [2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b
> >
> > Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
>
> Cc: stable@vger.kernel.org
>
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> > Based on Linus's current tree.
> >
> > arch/x86/kvm/svm/svm.c | 40 ++++++++++++++++++++++++++++++++++++----
> > 1 file changed, 36 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 153c12dbf3eb..4d72c308b50b 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -271,8 +271,29 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
> >
> > }
> >
> > +static bool emulated_instruction_matches_vector(struct kvm_vcpu *vcpu,
> > + unsigned int vector)
> > +{
> > + switch (vector) {
> > + case BP_VECTOR:
> > + return vcpu->arch.emulate_ctxt->b == 0xcc ||
> > + (vcpu->arch.emulate_ctxt->b == 0xcd &&
> > + vcpu->arch.emulate_ctxt->src.val == BP_VECTOR);
> > + case OF_VECTOR:
> > + return vcpu->arch.emulate_ctxt->b == 0xce;
>
> Hmm, unless I'm missing something, this should handle 0xcd for both.
Yup, I forgot about int 4.
> > + default:
> > + return false;
> > + }
> > +}
> > +
> > +/*
> > + * If vector != 0, then this skips the instruction only if the instruction could
> > + * generate an interrupt with that vector. If not, then it fails, indicating
> > + * that the instruction should be retried.
> > + */
> > static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> > - bool commit_side_effects)
> > + bool commit_side_effects,
> > + unsigned int vector)
> > {
> > struct vcpu_svm *svm = to_svm(vcpu);
> > unsigned long old_rflags;
> > @@ -293,8 +314,18 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> > if (unlikely(!commit_side_effects))
> > old_rflags = svm->vmcb->save.rflags;
> >
> > - if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> > + if (vector == 0) {
> > + if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> > + return 0;
> > + } else if (x86_decode_emulated_instruction(vcpu, EMULTYPE_SKIP,
> > + NULL,
> > + 0) != EMULATION_OK ||
>
> I think I would rather handle this in kvm_emulate_instruction(), not in the SVM
> code. x86_decode_emulated_instruction() really shouldn't be exposed outside of
> x86.c, the use in gp_interception() is all kinds of gross. :-/
>
> The best idea I can come up with is to add an EMULTYPE_SKIP_SOFT_INT, and use that
> to communicate to the kvm_emulate_instruction() that it should verify the
> to-be-skipped instruction. I don't love the implicit passing of the vector via
> vcpu->arch.exception.vector, but all of this code is quite heinous, so I won't
> loose sleep over it.
>
> Compile-tested only, but something like this?
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2bfae1cfa514..eca4704e3934 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2154,6 +2154,7 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
> #define EMULTYPE_PF (1 << 6)
> #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7)
> #define EMULTYPE_WRITE_PF_TO_SP (1 << 8)
> +#define EMULTYPE_SKIP_SOFT_INT ((1 << 9) | EMULTYPE_SKIP)
>
> static inline bool kvm_can_emulate_event_vectoring(int emul_type)
> {
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index f14709a511aa..82e97f2f8635 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -272,6 +272,7 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
> }
>
> static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> + int emul_type,
> bool commit_side_effects)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> @@ -293,7 +294,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> if (unlikely(!commit_side_effects))
> old_rflags = svm->vmcb->save.rflags;
>
> - if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
> + if (!kvm_emulate_instruction(vcpu, emul_type))
> return 0;
>
> if (unlikely(!commit_side_effects))
> @@ -311,7 +312,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
>
> static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
> {
> - return __svm_skip_emulated_instruction(vcpu, true);
> + return __svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP, true);
> }
>
> static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
> @@ -331,7 +332,7 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
> * in use, the skip must not commit any side effects such as clearing
> * the interrupt shadow or RFLAGS.RF.
> */
> - if (!__svm_skip_emulated_instruction(vcpu, !nrips))
> + if (!__svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP_SOFT_INT, !nrips))
> return -EIO;
>
> rip = kvm_rip_read(vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 593fccc9cf1c..500f9b7f564e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9351,6 +9351,23 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
> return false;
> }
>
> +static bool is_soft_int_instruction(struct x86_emulate_ctxt *ctxt, u8 vector)
> +{
> + if (WARN_ON_ONCE(vector != BP_VECTOR && vector != OF_VECTOR))
> + return false;
This warning triggers when called from the svm_inject_irq() path since
that case uses vcpu->arch.interrupt.nr. I can't tell whether it'd be
okay to set vcpu->arch.exception.vector = vcpu->arch.interrupt.nr in
that case, or if we need yet another emultype. (I believe my patch had
the same problem, but silently.)
> + switch (ctxt->b) {
> + case 0xcc:
> + return vector == BP_VECTOR;
> + case 0xcd:
> + return vector == ctxt->src.val;
> + case 0xce:
> + return vector == OF_VECTOR;
> + default:
> + return false;
> + }
> +}
> +
> /*
> * Decode an instruction for emulation. The caller is responsible for handling
> * code breakpoints. Note, manually detecting code breakpoints is unnecessary
> @@ -9461,6 +9478,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> * injecting single-step #DBs.
> */
> if (emulation_type & EMULTYPE_SKIP) {
> + if ((emulation_type & EMULTYPE_SKIP_SOFT_INT) &&
This needs to be (emulation_type & EMULTYPE_SKIP_SOFT_INT) == EMULTYPE_SKIP_SOFT_INT
(either that or making EMULTYPE_SKIP_SOFT_INT not include
EMULTYPE_SKIP).
Did you intend to send this out as a patch, or should I send a v2 based
on this?
Thanks,
Omar
> + !is_soft_int_instruction(ctxt, vcpu->arch.exception.vector))
> + return 0;
> +
> if (ctxt->mode != X86EMUL_MODE_PROT64)
> ctxt->eip = (u32)ctxt->_eip;
> else
next prev parent reply other threads:[~2025-10-28 6:44 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-27 23:06 [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced Omar Sandoval
2025-10-28 0:24 ` Sean Christopherson
2025-10-28 6:44 ` Omar Sandoval [this message]
2025-10-29 0:18 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aQBmME40vhf-lh7R@telecaster \
--to=osandov@osandov.com \
--cc=gourry@gourry.net \
--cc=kernel-team@fb.com \
--cc=kvm@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox