From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 206AA1C5D46 for ; Tue, 28 Oct 2025 06:44:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.176 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761633845; cv=none; b=uu1lGNSuRhICgB1MqWd42vVM2lj+/E++JCgeTcbX4wZMlNjOp2paZStYl5OA/VO5tU/MQrCZoExEIpLjSf+6vAvXADlmf288SbkZRacCKu7FKfQbf38IFUR54BbEI3wJmhVLYPzQbVWbqMz8EKCYTz4TgFBs8WkrV0yyTZrLwlM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761633845; c=relaxed/simple; bh=zDF/Y9XBJA9RVlnGEARbuvgklIvOOmTItV9BhF1FT8c=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=jV3bvtq5UOjKJ+BK8WiBYxeLLN9h1TiMj/uP+/d41gdqFpzJJaXAgQE7Hsd1HI9M9dqit11msp1rAl99G6HhSY9avLFZEPLwjviILbFbweUCbSuaA2b5fFmFuGSU0Fjf9d3uZlv3teZ5hmjMtTnaew20s7JG7lXvYypYRWoojbo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=osandov.com; spf=none smtp.mailfrom=osandov.com; dkim=pass (2048-bit key) header.d=osandov-com.20230601.gappssmtp.com header.i=@osandov-com.20230601.gappssmtp.com header.b=NPjAbHiG; arc=none smtp.client-ip=209.85.210.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=osandov.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=osandov.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20230601.gappssmtp.com header.i=@osandov-com.20230601.gappssmtp.com header.b="NPjAbHiG" Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-7a27f2469b8so427317b3a.0 for ; Mon, 27 Oct 2025 23:44:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20230601.gappssmtp.com; s=20230601; t=1761633842; x=1762238642; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=dukmnIaQsneJf8PdY7kP8xyXxID9jl8eMskGfgfqgQ4=; b=NPjAbHiG3NliJ1pwL1waMAAf8Ro40fW8kA9y/B2mrZzD5Ac4d9gMsO123u2kPHOHWl kGqmfoE936fFrNCdnh1Bj6P4I1Tjj8IfrzmuOUdqptZqkM7Fgk0m2cL+F36mGJnSCaAe nwvAGGKhruyWConbo3FTOQNwSCDH9kzPa/HNMpEeq2aaxG4P3f4qUw+xqMOGaeOVQaQI um/cceWMXJtXQ/+DILUQWs/knKTa9YRj5b3SeHBMDuWLc+hwhCdgqSsvA3fZNqM/bw30 tKPtH/URvBCpfkZqqAYIvve+AT2GS+N0NiljtxTo5w0+qaqsxNkZJbe1yBdz5ulYfPHu wcog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761633842; x=1762238642; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=dukmnIaQsneJf8PdY7kP8xyXxID9jl8eMskGfgfqgQ4=; b=F3LAhDPrBAOwxl3JTRuktKTy6nnxXy8DreGmx9At03ZekyHBc1DAo15VYbNQS19BEy Y3hESphg0OAO75/l67VEVHLsJSFLPF8aDmwODdzpCEPropSQBS2ZifIlp/SSRajEiZ4T Ns9LM7OpypGjQHwUG9njqFIBaaF7Z2JpFX5VH2lyjCI2ISooFKfLtSTszZbJ55qEjYQn tziaLz6JO6/PxXWuyFJbFFcP1GKaeH8uwcyak6hmdsJAP12N/viImdABLwIeYvP4JibL FTLxTqjTC20zfFmVWTfUwf0o0TCGykfq4X9rzyIpZduez1GPNCLVrfyVr4GRBS3iNeNJ X0Gg== X-Forwarded-Encrypted: i=1; AJvYcCWvDnf9oKvHNa4RyRGIncBG7lUBq7dN5Cm/fx5UMmvFxTb5sasRutNWwJwWUWmm8oVH+pQ=@vger.kernel.org X-Gm-Message-State: AOJu0YzBRSshY0+KYRXCb/ZW+oJhDTDqzNHPgiVhujNXD2JG8X2jYoQd sHamiSiFh86CPmngGjpAgyqDSSjSMXBfZhkMAqLsN9kqk3oFbNIxCz1cqS5z/uGGCx0= X-Gm-Gg: ASbGnctoiWfOjQaDye9ldUS1eArODI0Aq+43xH9Mawry2Zl7PRRWAtLZDSzmiLWYzp8 Au+3wzXgIA6PT7eR6tsCiAdQOdnfo4TrI7Blc/89WnGQtpVTIQaqmQ9pUfxBQL7O40SWZ4BFrhB 3ixLbN3y+8bo+PE6npK10WV9XH+ZJkS2qoM2g/erFRFLrrJW8DfmVpFMReEeassWJ2inGMAxk6R L6Fh80Nox3rlCR2fAa+6JzPjHHa7hhKumthpTH5dplacVrX8afQ6HGh7RGsr4GyNKtRv2LEf1BZ A/S9ezrUkq/LF1HMwjw8JKM5G677emrWqk0+miJ1SjY5g+8W37g3APv4UpvL1xulqQTTWat3d03 z/w12hIEigZ4z0mDGDp6RVyMR9JUfbHYnyXCp+6a64H3EBwOpy4dSxMfA9w== X-Google-Smtp-Source: AGHT+IH9dLRnTDwqIBGOhTLlI8reFrNWC2bubrxKI8j9aumrWpki3wiUkYg+1I5B7yWm2IZ7ePZhZQ== X-Received: by 2002:a05:6a20:5484:b0:341:730a:ef54 with SMTP id adf61e73a8af0-344d19b7a43mr1643572637.1.1761633842225; Mon, 27 Oct 2025 23:44:02 -0700 (PDT) Received: from telecaster ([2620:10d:c090:400::5:1183]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7a41404987esm10175301b3a.36.2025.10.27.23.44.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Oct 2025 23:44:01 -0700 (PDT) Date: Mon, 27 Oct 2025 23:44:00 -0700 From: Omar Sandoval To: Sean Christopherson Cc: Paolo Bonzini , kvm@vger.kernel.org, Gregory Price , kernel-team@fb.com Subject: Re: [PATCH] KVM: SVM: Don't skip unrelated instruction if INT3 is replaced Message-ID: References: <71043b76fc073af0fb27493a8e8d7f38c3c782c0.1761606191.git.osandov@fb.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Mon, Oct 27, 2025 at 05:24:47PM -0700, Sean Christopherson wrote: > "INT3/INTO", because the code handles both. > > On Mon, Oct 27, 2025, Omar Sandoval wrote: > > From: Omar Sandoval > > > > We've been seeing guest VM kernel panics with "Oops: int3" coming from > > No pronouns please, "we" relies too much on context and that context can be lost > over time, or might be interpreted differently by readers, etc. > > > static branch checks. I found that the reported RIP is one instruction > > after where it's supposed to be, and I determined that when this > > happens, the RIP was injected by __svm_skip_emulated_instruction() on > > the host when a nested page fault was hit while vectoring an int3. > > I definitely appreciate the gory details, but please capture just the technical > details of the bug and fix. A play-by-play of the debugging process is useful > (and interesting!) for bug reports and cover letters, but it's mostly noise for > future readers that usually care about what was changed and/or what the code is > doing. > > > Static branches use the smp_text_poke mechanism, which works by > > temporarily inserting an int3, then overwriting it. In the meantime, > > smp_text_poke_int3_handler() relies on getting an accurate RIP. > > > > This temporary int3 from smp_text_poke is triggering the exact scenario > > described in the fixes commit: "if the guest executes INTn (no KVM > > injection), an exit occurs while vectoring the INTn, and the guest > > modifies the code stream while the exit is being handled, KVM will > > compute the incorrect next_rip due to "skipping" the wrong instruction." > > > > I'm able to reproduce this almost instantly by patching the guest kernel > > to hammer on a static branch [1] while a drgn script [2] on the host > > constantly swaps out the memory containing the guest's Task State > > Segment. > > I would actually just say "TSS". Normally I'm all for spelling out acronyms on > their first use, but in this case I think TSS is more obviously a "thing" than > Task State Segment, e.g. I had a momentary brain fart and was wondering what you > meant by swapping out a segment :-). > > > The fixes commit also suggests a workaround: "A future enhancement to > > make this less awful would be for KVM to detect that the decoded > > instruction is not the correct INTn and drop the to-be-injected soft > > event." This implements that workaround. > > Please focus on the actual change. Again, I love the detail, but that was a _lot_ > to get through just to see "Sorry, but the princess is in another castle." :-D Fair enough. > E.g. I think this captures everything in about the same word count, but with a > stronger focus on what the patch is actually doing. > > When re-injecting an soft interrupt from an INT3, INT0, or (select) INTn > instruction, discard the exception and retry the instruction if the code > stream is changed (e.g. by a different vCPU) between when the CPU executes > the instruction and when KVM decodes the instruction to get the next RIP. > > As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject > INT3/INTO instead of retrying the instruction"), failure to verify the > correct INTn instruction was decoded can effectively clobber guest state > due to decoding the wrong instruction and thus specifying the wrong next > RIP. > > The bug most often manifests as "Oops: int3" panics on static branch > checks when running Linux guests. Linux's static branch patching uses > he kernel's "text poke" code patching mechanism. To modify code while > other CPUs may be executing said code, Linux (temporarily) replaces the > first byte of the original instruction with an int3 (opcode 0xcc), then > patches in the new code stream except for the first byte, and finally > replaces the int3 the first byte of the new code stream. If a CPU hits > the int3, i.e. executes the code while it's being modified, the guest > kernel will gracefully handle the resulting #BP, e.g. by looping until > the int3 is replaced, by emulating the new instruction, etc. > > E.g. the bug reproduces almost instantly by hacking the guest kernel to > provide a convenient static branch[1], while running a drgn script[2] on > the host to constantly swap out the memory containing the guest's TSS. > > > [1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a > > [2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b > > > > Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction") > > Cc: stable@vger.kernel.org > > > Signed-off-by: Omar Sandoval > > --- > > Based on Linus's current tree. > > > > arch/x86/kvm/svm/svm.c | 40 ++++++++++++++++++++++++++++++++++++---- > > 1 file changed, 36 insertions(+), 4 deletions(-) > > > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > > index 153c12dbf3eb..4d72c308b50b 100644 > > --- a/arch/x86/kvm/svm/svm.c > > +++ b/arch/x86/kvm/svm/svm.c > > @@ -271,8 +271,29 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask) > > > > } > > > > +static bool emulated_instruction_matches_vector(struct kvm_vcpu *vcpu, > > + unsigned int vector) > > +{ > > + switch (vector) { > > + case BP_VECTOR: > > + return vcpu->arch.emulate_ctxt->b == 0xcc || > > + (vcpu->arch.emulate_ctxt->b == 0xcd && > > + vcpu->arch.emulate_ctxt->src.val == BP_VECTOR); > > + case OF_VECTOR: > > + return vcpu->arch.emulate_ctxt->b == 0xce; > > Hmm, unless I'm missing something, this should handle 0xcd for both. Yup, I forgot about int 4. > > + default: > > + return false; > > + } > > +} > > + > > +/* > > + * If vector != 0, then this skips the instruction only if the instruction could > > + * generate an interrupt with that vector. If not, then it fails, indicating > > + * that the instruction should be retried. > > + */ > > static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu, > > - bool commit_side_effects) > > + bool commit_side_effects, > > + unsigned int vector) > > { > > struct vcpu_svm *svm = to_svm(vcpu); > > unsigned long old_rflags; > > @@ -293,8 +314,18 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu, > > if (unlikely(!commit_side_effects)) > > old_rflags = svm->vmcb->save.rflags; > > > > - if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP)) > > + if (vector == 0) { > > + if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP)) > > + return 0; > > + } else if (x86_decode_emulated_instruction(vcpu, EMULTYPE_SKIP, > > + NULL, > > + 0) != EMULATION_OK || > > I think I would rather handle this in kvm_emulate_instruction(), not in the SVM > code. x86_decode_emulated_instruction() really shouldn't be exposed outside of > x86.c, the use in gp_interception() is all kinds of gross. :-/ > > The best idea I can come up with is to add an EMULTYPE_SKIP_SOFT_INT, and use that > to communicate to the kvm_emulate_instruction() that it should verify the > to-be-skipped instruction. I don't love the implicit passing of the vector via > vcpu->arch.exception.vector, but all of this code is quite heinous, so I won't > loose sleep over it. > > Compile-tested only, but something like this? > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 2bfae1cfa514..eca4704e3934 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -2154,6 +2154,7 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu); > #define EMULTYPE_PF (1 << 6) > #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7) > #define EMULTYPE_WRITE_PF_TO_SP (1 << 8) > +#define EMULTYPE_SKIP_SOFT_INT ((1 << 9) | EMULTYPE_SKIP) > > static inline bool kvm_can_emulate_event_vectoring(int emul_type) > { > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index f14709a511aa..82e97f2f8635 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -272,6 +272,7 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask) > } > > static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu, > + int emul_type, > bool commit_side_effects) > { > struct vcpu_svm *svm = to_svm(vcpu); > @@ -293,7 +294,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu, > if (unlikely(!commit_side_effects)) > old_rflags = svm->vmcb->save.rflags; > > - if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP)) > + if (!kvm_emulate_instruction(vcpu, emul_type)) > return 0; > > if (unlikely(!commit_side_effects)) > @@ -311,7 +312,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu, > > static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu) > { > - return __svm_skip_emulated_instruction(vcpu, true); > + return __svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP, true); > } > > static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu) > @@ -331,7 +332,7 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu) > * in use, the skip must not commit any side effects such as clearing > * the interrupt shadow or RFLAGS.RF. > */ > - if (!__svm_skip_emulated_instruction(vcpu, !nrips)) > + if (!__svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP_SOFT_INT, !nrips)) > return -EIO; > > rip = kvm_rip_read(vcpu); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 593fccc9cf1c..500f9b7f564e 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -9351,6 +9351,23 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt) > return false; > } > > +static bool is_soft_int_instruction(struct x86_emulate_ctxt *ctxt, u8 vector) > +{ > + if (WARN_ON_ONCE(vector != BP_VECTOR && vector != OF_VECTOR)) > + return false; This warning triggers when called from the svm_inject_irq() path since that case uses vcpu->arch.interrupt.nr. I can't tell whether it'd be okay to set vcpu->arch.exception.vector = vcpu->arch.interrupt.nr in that case, or if we need yet another emultype. (I believe my patch had the same problem, but silently.) > + switch (ctxt->b) { > + case 0xcc: > + return vector == BP_VECTOR; > + case 0xcd: > + return vector == ctxt->src.val; > + case 0xce: > + return vector == OF_VECTOR; > + default: > + return false; > + } > +} > + > /* > * Decode an instruction for emulation. The caller is responsible for handling > * code breakpoints. Note, manually detecting code breakpoints is unnecessary > @@ -9461,6 +9478,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > * injecting single-step #DBs. > */ > if (emulation_type & EMULTYPE_SKIP) { > + if ((emulation_type & EMULTYPE_SKIP_SOFT_INT) && This needs to be (emulation_type & EMULTYPE_SKIP_SOFT_INT) == EMULTYPE_SKIP_SOFT_INT (either that or making EMULTYPE_SKIP_SOFT_INT not include EMULTYPE_SKIP). Did you intend to send this out as a patch, or should I send a v2 based on this? Thanks, Omar > + !is_soft_int_instruction(ctxt, vcpu->arch.exception.vector)) > + return 0; > + > if (ctxt->mode != X86EMUL_MODE_PROT64) > ctxt->eip = (u32)ctxt->_eip; > else