From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B219144B666
	for <linux-kernel@vger.kernel.org>; Fri, 27 Feb 2026 16:06:23 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1772208386; cv=none; b=to3YtBFYD3uI13RDFSNuAWqqpEfdpR5S6kbrv7e18g7MlUuMSODGlhNVOfKo5APnru5DrM0hgcEuBky88QsmKSklA9cOHUFbkrcmdc/mJDPBhSc2UcygLTJOhYgLiX2yCa0hsc1iAbx39VrP52R5S0AJyhS5GFoBHy+w3R+C77g=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1772208386; c=relaxed/simple;
	bh=YxsXvIvfh76VzhK6UXSp7c2YfOWFOfSgaHE4CY/huLc=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type; b=HWSL87R2zI+YCHsCiLKWsm5SFRX1OymNwcNQl3LTY7kpD128KS1UkSo93+pueH6AneeD8TT1spn8W7r/8gXaRKh2oPscfI8pLrgfMjWEVtd4OMlaZUKH7cyM53SbwNrQZdui4Yk20w+ay5XpWsSRf5wXqFakQuSzqMMZhArHbC0=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=4U5oaeTo; arc=none smtp.client-ip=209.85.216.74
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="4U5oaeTo"
Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-3568090851aso11407997a91.1
        for <linux-kernel@vger.kernel.org>; Fri, 27 Feb 2026 08:06:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1772208383; x=1772813183; darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=dLi6oTrN62qdRvWLksa/GdFl3MRGUGFNAoGOySxgx7k=;
        b=4U5oaeToYfkWk0d6gzPreNxI1WFMZ9ygUR94p2MyI9kMkeuW5MclSNzVCEcyP9TNxr
         xY5aZvc/I1w7MsiMxE3Lpolo8z7BG89orpad7vJynep1h5hDq6YTA3p0bMXPbJgcDMWW
         61sdpbHRLlvL11FTpANuqiu8VE8YXBVSwmJ8xNTYCBJWBVJF9iaOkZ1uImVsoWEaImco
         xLn43kHUQqvYd7w+oCP+TxfSCsTEHo496BYk5qg+4rCRy/KLyd51dgxV5w+GEa6GnKyU
         vyK0uLLpJQepplAF5xjMUNNu2cN3hkMpfSBVjMVKTEFCCFcfhv4M1RHjPr5C299ugyxE
         r7Ow==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1772208383; x=1772813183;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=dLi6oTrN62qdRvWLksa/GdFl3MRGUGFNAoGOySxgx7k=;
        b=YDit5scdlIhPdxAgrq8MXXt8abzxmuZe6hSFJGsUFNDnIAHacwOJYOTcu6lmxFEKh5
         wFi8J42xGcsBVHJSF4WiE2EEI70SUkpu8/x7n8KCh23Kc/2ontVtNBRK7x5EV5UD38An
         XXrPzWF0Qc6+dx2qHTshjvV2NhffueaYputfEEJCnel8RUv/d6H52AI5VGcyUPeIwVTt
         56Vf6zAlhcuxmXbMqSZuz4dqdtY7YhLiWxM5LVDvWkqMVZ+7Df7mVD4f/JdHoqjJdMs9
         Y5m7m4t/3U8PxJd5+t9jhrrLUhzI3ZD4wQPCC4uTwKxKhhO5PQQTdZbPIRfYqEIDTSJF
         mcRA==
X-Forwarded-Encrypted: i=1; AJvYcCVnUN1QdE/rb/ni8r1JDF37yN7aN5im7KVIgMHGfUF0vLffs+juSFC9W6Z9/cFA8+Pr+GQFzlghO2WuTlw=@vger.kernel.org
X-Gm-Message-State: AOJu0Yx3ku6yFXFOWpY9T8v6QSxqHJG/P9rP/kjMYYZQ2HMKthkIiuVJ
	iWkacTOaCcgnn5kIt79+thqIMxtP8XfeB82z1BgEX8HyO5lKN9Jy0jHvOcq0vKpusigJ+xX9vep
	RAHQJRQ==
X-Received: from pjuz24.prod.google.com ([2002:a17:90a:d798:b0:358:f5eb:36b3])
 (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:558d:b0:335:2eef:4ca8
 with SMTP id 98e67ed59e1d1-35965cfbad1mr3744854a91.33.1772208382843; Fri, 27
 Feb 2026 08:06:22 -0800 (PST)
Date: Fri, 27 Feb 2026 08:06:21 -0800
In-Reply-To: <20260227011306.3111731-4-yosry@kernel.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20260227011306.3111731-1-yosry@kernel.org> <20260227011306.3111731-4-yosry@kernel.org>
Message-ID: <aaG_o58_0aHT8Xjg@google.com>
Subject: Re: [PATCH 3/3] KVM: x86: Check for injected exceptions before
 queuing a debug exception
From: Sean Christopherson <seanjc@google.com>
To: Yosry Ahmed <yosry@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="us-ascii"

On Fri, Feb 27, 2026, Yosry Ahmed wrote:
> On KVM_SET_GUEST_DEBUG, if a #DB or #BP is injected with
> KVM_GUESTDBG_INJECT_DB or KVM_GUESTDBG_INJECT_BP, KVM fails with -EBUSY
> if there is an existing pending exception. This was introduced in
> commit 4f926bf29186 ("KVM: x86: Polish exception injection via
> KVM_SET_GUEST_DEBUG") to avoid a warning in kvm_queue_exception(),
> presumably to avoid overriding a pending exception.
> 
> This added another (arguably nice) property, if there's a pending
> exception, KVM_SET_GUEST_DEBUG cannot cause a #DF or triple fault.
> However, if an exception is injected, KVM_SET_GUEST_DEBUG will cause
> a #DF or triple fault in the guest, as kvm_multiple_exception() combines
> them.

First off, this patch looks good irrespective of nested crud.  Disallowing injection
of #DB/#BP while there's already an injected exception aligns with architectural
behavior; KVM needs to finish delivering the exception and thus "complete" the
instruction before queueing a new exception.

As for nested, I _was_ going to say that, assuming the original motivation for
this patch is to avoid triggering a nested VM-Exit while nested_run_pending=1,
this is incomplete.  E.g. if the #BP or #DB itself is being intercepted by L1,
then queueing the exception will incorrectly deliver the nested VM-Exit before
nested VM-Enter completes.  I.e. I _thought_ we would also need:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d9..fdbd272027ed 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12526,6 +12526,9 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
        if (vcpu->arch.guest_state_protected)
                return -EINVAL;
 
+       if (vcpu->arch.nested_run_pending)
+               return -EBUSY;
+
        vcpu_load(vcpu);
 
        if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {

But that isn't actually the case, because {svm,vmx}_check_nested_events() blocks
exceptions and VM-Exits while nested_run_pending=1.  Off-list, I had rejected
modifying nested_vmx_triple_fault() to drop the triple fault like so:

@@ -5191,6 +5191,9 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 
 static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu)
 {
+       if (to_vmx(vcpu)->nested.nested_run_pending)
+               return;
+
        kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
        nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0);
 }

because "That would largely defeat the entire purpose of the WARN. The whole point
is to detect bugs where KVM synthesizes a VM-Exit before completing nested VM-Enter.
It should be impossible for nested_{vmx,svm}_triple_fault() to be called with
nested_run_pending=true."

Given the above, that argument doesn't hold up at first glance, because blocking
the triple fault if nested_run_pending=1 would be consistent with how other "synchronous"
events are handled by nVMX and nSVM.  But after staring at this a bit, unless I'm
forgetting something (entirely possible), I actually think {svm,vmx}_check_nested_events()
are wrong, because I stand behind my quoted statement: it should be impossible for
KVM to synthesize a synchronous event before completing nested VM-Enter.

Blocking pending and injected exceptions was added by bfcf83b1444d ("KVM: nVMX:
Fix trying to cancel vmlauch/vmresume"), and unfortunately neither the changelog
nor Lore[1] provides any details as to what exactly prompted the fix.  I _suspect_
it was either syzkaller induced, or the manifestation of other bugs in KVM's
exception handling.

Finally getting to the point, in theory, I _think_ KVM should actually WARN on
all cases where it temporarily blocks a synchronous event due to a pending nested
VM-Enter (see below diff for the basic gist).  However, that would open the
floodgates to syzkaller, because KVM_SET_VCPU_EVENTS can obviously stuff events,
KVM_X86_SET_MCE can queue a #MC, etc.

I _think_ we might be able to get away with rejecting KVM_SET_VCPU_EVENTS if
nested_run_pending=1, without breaking userspace?  Google's VMM is insane and does
KVM_SET_VCPU_EVENTS before KVM_SET_NESTED_STATE, but in real usage, i.e. outside
of selftests, (I hope) no VMM will restore into a "live" vCPU.

So instead of patch 1, I want to try either (a) blocking KVM_SET_VCPU_EVENTS,
KVM_X86_SET_MCE, and KVM_SET_GUEST_DEBUG if nested_run_pending=1, *and* follow-up
with the below WARN-spree, or (b) add a separate flag, e.g. nested_run_in_progress
or so, that is set with nested_run_pending, but cleared on an exit to userspace,
and then WARN on _that_, i.e. so that we can detect KVM bugs (the whole point of
the WARN) and hopefully stop playing this losing game of whack-a-mole with syzkaller.

I think I'm leaning toward (b)?  Except for KVM_SET_GUEST_DEBUG, where userspace
is trying to interpose on the guest, restricting ioctls doesn't really add any
value in practice.  Yeah, in theory it could _maybe_ prevent userspace from shooting
itself in the foot, but practically speaking, if userspace is restoring state into
a vCPU with nested_run_pending=1, it's either playing on expert mode or is already
completely broken.

My only hesitation with (b) is that KVM wouldn't be entirely consistent, since
vmx_unhandleable_emulation_required() _does_ explicitly reject a "userspace did
something stupid with nested_run_pending=1" case.  So from that perspective, part
of me wants to get greedy and try for (a).

[1] https://lore.kernel.org/all/?q=%22KVM:%20nVMX:%20Fix%20trying%20to%20cancel%20vmlauch%22

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index de90b104a0dd..d624e4db704a 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1605,14 +1605,14 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
        }
 
        if (vcpu->arch.exception_vmexit.pending) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                         return -EBUSY;
                nested_svm_inject_exception_vmexit(vcpu);
                return 0;
        }
 
        if (vcpu->arch.exception.pending) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
                return 0;
        }
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 248635da6766..a223c5e86188 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4336,7 +4336,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
         */
        if (vcpu->arch.exception_vmexit.pending &&
            !vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
 
                nested_vmx_inject_exception_vmexit(vcpu);
@@ -4345,13 +4345,13 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 
        if (vcpu->arch.exception.pending &&
            !vmx_is_low_priority_db_trap(&vcpu->arch.exception)) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
                goto no_vmexit;
        }
 
        if (vmx->nested.mtf_pending) {
-               if (block_nested_events)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
                nested_vmx_update_pending_dbg(vcpu);
                nested_vmx_vmexit(vcpu, EXIT_REASON_MONITOR_TRAP_FLAG, 0, 0);
@@ -4359,7 +4359,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
        }
 
        if (vcpu->arch.exception_vmexit.pending) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
 
                nested_vmx_inject_exception_vmexit(vcpu);
@@ -4367,7 +4367,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
        }
 
        if (vcpu->arch.exception.pending) {
-               if (block_nested_exceptions)
+               if (WARN_ON_ONCE(block_nested_exceptions))
                        return -EBUSY;
                goto no_vmexit;
        }