From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 915FC31B123 for ; Tue, 3 Feb 2026 18:03:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770141820; cv=none; b=BWb9HvSMviLX4PazA1nERqHyz2TYCkE6MddvOhJZ5YTIH+LsGfmETIJ4hMOpYr3+/hdRchhtN3kEkVP1wGmMyeLgn32PDirdPV7Ea6B4eTeQU6MQfnEDM6MKCzPBPtLXWjhdAjhVppMA5vS+RuOLFndO3VqSEpwPofJ/RsTEnLs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770141820; c=relaxed/simple; bh=pM2rS/12lrlG+2AZ8m3xfZwPCId4Ey85d1MGnj91250=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=axE3bmC9ol2FdsIeZ5JVqOzmlM6y2+UG3H9ipTfbsO/cQXErmIs1WMOAi/fJqUV0Vv419v+bs0omWpbV3Nb1hVtEXsJxNuD34qJcecfRBV9UwszG/O0azFK1jMngI2RRj2i0+QVDLQ8FbGrMMxqrJQIiu/Veb461H4qW+9PsAAI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=2/HkZ3/b; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="2/HkZ3/b" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-354490889b6so7387441a91.3 for ; Tue, 03 Feb 2026 10:03:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1770141817; x=1770746617; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6XX4eVdJLaAaJXlWOLoKv/kWDfHyar969ebQM03BmoQ=; b=2/HkZ3/bBkvSf9OQIzBeIooL8LYGPDMJcqK3F5zP6CSLdutZ2MtD0Xuvj3FoKKeJlb q3I98fx1UVUbqMC07t7Fm74soz5S3k8lIiu8Z7tWzwsEqlqvYcgnmUD03x8Wsm9+u59w F/1ekuUtn9PUDnHK8up221WSE8oeFNDXUbAl1qCVckUj8QMuzh9xFDpQE//ZYxEEZq5+ OSh/AuWEQl6hLsirUwA153PJuyYKJ+hN7cLM5JxCIsGDkmkkLqsdMW2s0wOY0/qvyATH Q7dr3Nb67bgQeNenqRQyn1dSCCFJOhP8jrk8BMvPxl9bJm8yC/MvaalAgu3S66zp3n+9 tXSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770141817; x=1770746617; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6XX4eVdJLaAaJXlWOLoKv/kWDfHyar969ebQM03BmoQ=; b=kb2BOw9jbcUigmbZzsJPcZsibID9MDLf+zg37oqqfC+7LQ45//dItYsEg6tmus2hr6 SsP6AoVwKOFxXlZoTTPyuuAQFjDWDmPZwfBt93I2SLAA7g6frjn8TUSMz5EMF3d9hWKn T4+8BxyMwLQbaf57GoGeizKS4eynoCOuuxj3MyGiNCUy/zPIapsIF7zmpqJpGaUhmSG9 pezvpAHRZwdK83M6IWdKz+u6mtgi2FVN5HmA5p2/qdfZRKATk2//tzXAM6hjZHeOizH9 kB0P+XTWXf0YR4BOkdASHSsFyk2ImSB8zTd4s86CvU/ANZP7IqFx1/QYDjnc6hXtatFV mH0g== X-Forwarded-Encrypted: i=1; AJvYcCUIgB7I/V+LWP6NSIcwssR/PBVc5bFCamZbWlLdwKpTmmpr87BO+tYWh1bm5TdB+2nV5jPiUbj8pKQQGVk=@vger.kernel.org X-Gm-Message-State: AOJu0Yz0x3Xy/7Ib4owd6v4YedVeThKHT4j975Kd5/jIK5ZiuzJ36WqS ZHVkS85uH+jdQuRU9Qr/qdkgKnr3N/rNuekYZTgyJJHaIAuNYvKt4Gx/EzctX4XEHaxohjApV+p wRjYi2A== X-Received: from pjsg5.prod.google.com ([2002:a17:90a:7145:b0:352:d19a:6739]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:2752:b0:340:ff7d:c26 with SMTP id 98e67ed59e1d1-354870da124mr148942a91.16.1770141816880; Tue, 03 Feb 2026 10:03:36 -0800 (PST) Date: Tue, 3 Feb 2026 10:03:35 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260203011320.1314791-1-yosry.ahmed@linux.dev> Message-ID: Subject: Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT From: Sean Christopherson To: Yosry Ahmed Cc: Paolo Bonzini , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="us-ascii" On Tue, Feb 03, 2026, Yosry Ahmed wrote: > On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote: > > On Tue, Feb 03, 2026, Yosry Ahmed wrote: > > /* > > * If L2 is active, defer delivery of the payload until the > > * exception is actually injected to avoid clobbering state if > > * L1 wants to intercept the exception (the architectural state > > * is NOT updated if the exeption is morphed to a VM-Exit). > > */ > > It's not only about exceptions being morphed to a VM-Exit though, is it? > KVM should not update the payload (e.g. CR2) if a #PF is pending but was > not injected, because from L1's perspective CR2 was updated but > exit_int_info won't reflect a #PF. Right? Right, but that's got nothing to do with L2 being active. Take nested completely out of the picture, and the above statement holds true as well. "If a #PF is pending but was not injected, then the guest shouldn't see a change in CR2". > > But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending > > VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a > > VM-Exit. > > > > Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that > > commit. Because invoking kvm_deliver_exception_payload() when the exception was > > morphed to a VM-Exit is wrong. Oh, wait, this is the !exception_payload_enabled > > case. So never mind, that's simply an unfixable bug, as the second comment alludes > > to. > > Hmm for the #PF case I think delivering the payload is always wrong if > it was morphed to a VM-Exit, regardless of exception_payload_enabled, > because the payload should never reach CR2, right? Right. > Spoiler alert, there's another problem there. Even if the exception did > not morph to a VM-Exit, if userspace already did KVM_GET_SREGS then the > delivered payload is lost :/ > > > > > /* > > * KVM's ABI only allows for one exception to be migrated. Luckily, > > * the only time there can be two queued exceptions is if there's a > > * non-exiting _injected_ exception, and a pending exiting exception. > > * In that case, ignore the VM-Exiting exception as it's an extension > > * of the injected exception. > > */ > > if (vcpu->arch.exception_vmexit.pending && > > !vcpu->arch.exception.pending && > > !vcpu->arch.exception.injected) > > ex = &vcpu->arch.exception_vmexit; > > else > > ex = &vcpu->arch.exception; > > > > /* > > * In guest mode, payload delivery should be deferred if the exception > > * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1 > > * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability, > > * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not > > * propagate the payload and so it cannot be safely deferred. Deliver > > * the payload if the capability hasn't been requested. > > */ > > if (!vcpu->kvm->arch.exception_payload_enabled && > > ex->pending && ex->has_payload) > > kvm_deliver_exception_payload(vcpu, ex); > > > > So yeah, I _think_ we could drop the is_guest_mode() check. However, even better > > would be to drop this call *entirely*, i.e. > > Hmm I don't think so, because as I mentioned above, KVM shouldn't update > CR2 until the exception is actually injected, right? Me confused. What you're saying is what I'm suggesting: don't update CR2 until KVM actually stuffs the #PF into the VMCS/VMCB. Oh, or are you talking about the first sentence? If so, strike that from the record, I was wrong (see below). > It would actually be great to drop the is_guest_mode() check here but > leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS > and KVM_SET_SREGS goes away, and I *think* we can drop the > kvm_deliver_exception_payload() call in > kvm_vcpu_ioctl_x86_get_vcpu_events(). > > The only problem would be CR2 getting updated without a fault being > reflected in the vmcb12's exit_int_info AFAICT. No, that particular case is a non-issue, because the code immediately above has already verified that KVM will *not* morph the #PF to a nested VM-Exit. Note, the "queue:" path is just for non-contributory exceptions and doesn't change the VM-Exit change anyways. /* * If the exception is destined for L2, morph it to a VM-Exit if L1 * wants to intercept the exception. */ if (is_guest_mode(vcpu) && kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) { <==== kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code, has_payload, payload); return; } if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) { queue: vcpu->arch.exception.pending = true; vcpu->arch.exception.injected = false; vcpu->arch.exception.has_error_code = has_error; vcpu->arch.exception.vector = nr; vcpu->arch.exception.error_code = error_code; vcpu->arch.exception.has_payload = has_payload; vcpu->arch.exception.payload = payload; if (!is_guest_mode(vcpu)) kvm_deliver_exception_payload(vcpu, &vcpu->arch.exception); return; } > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index b0112c515584..00a39c95631d 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, > > vcpu->arch.exception.error_code = error_code; > > vcpu->arch.exception.has_payload = has_payload; > > vcpu->arch.exception.payload = payload; > > - if (!is_guest_mode(vcpu)) > > - kvm_deliver_exception_payload(vcpu, > > - &vcpu->arch.exception); > > return; > > } > > > > Because KVM really shouldn't update CR2 until the excpetion is actually injected > > (or the state is at risk of being lost because exception_payload_enabled==false). > > E.g. in the (extremely) unlikely (and probably impossible to configure reliably) > > scenario that userspace deliberately drops a pending exception, arch state shouldn't > > be updated. > > I think if we drop it there might be a problem. With > exception_payload_enabled==false, pending exceptions becomes injected > after save/restore. If the payload is not delivered here, then after > restore we have an injected event with no payload. > > I guess the kvm_deliver_exception_payload() call in > kvm_vcpu_ioctl_x86_get_vcpu_events() is supposed to handle this, but it > only works if userspace does KVM_GET_SREGS *after* KVM_GET_VCPU_EVENTS. > Removing the call here will regress VMM's doing KVM_GET_SREGS AFAICT. Drat, QEMU does KVM_GET_VCPU_EVENTS before KVM_GET_SREGS{,2}, so I was hopeful we wouldn't actually break anyone, but Firecracker at least gets sregs before events. Of course Firecracker is already broken due to not enabling KVM_CAP_EXCEPTION_PAYLOAD... Ugh, and it's not just KVM_GET_SREGS, it's also KVM_GET_DEBUGREGS thanks to DR6. If we're being _super_ pedantic, then delivering the payload anywhere but injection or KVM_GET_VCPU_EVENTS is "wrong" (well, "more wrong", since any behavior without KVM_CAP_EXCEPTION_PAYLOAD is inherently wrong). E.g. very hypothetically, userspace that saves/restores sregs but not vCPU events would see an unexpected CR2 change. *sigh* Ewwwwwwww. And we *definitely* don't want to drop the is_guest_mode() check, because nested_vmx_update_pending_dbg() relies on the payload to still be valid. Ah, right, and that's also why commit a06230b62b89 ("KVM: x86: Deliver exception payload on KVM_GET_VCPU_EVENTS") from MTF series deferred the update: KVM needs to keep the #DB pending so that a VM-Exit that occurs before the #DB is injected gets recorded in vmcs12. So, with all of that in mind, I believe the best we can do is fully defer delivery of the exception until it's actually injected, and then apply the quirk to the relevant GET APIs. --- arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++----------------- 1 file changed, 39 insertions(+), 23 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b0112c515584..e000521dfc8b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, vcpu->arch.exception.error_code = error_code; vcpu->arch.exception.has_payload = has_payload; vcpu->arch.exception.payload = payload; - if (!is_guest_mode(vcpu)) - kvm_deliver_exception_payload(vcpu, - &vcpu->arch.exception); return; } @@ -5532,18 +5529,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu, return 0; } -static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, - struct kvm_vcpu_events *events) +static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu) { - struct kvm_queued_exception *ex; - - process_nmi(vcpu); - -#ifdef CONFIG_KVM_SMM - if (kvm_check_request(KVM_REQ_SMI, vcpu)) - process_smi(vcpu); -#endif - /* * KVM's ABI only allows for one exception to be migrated. Luckily, * the only time there can be two queued exceptions is if there's a @@ -5554,21 +5541,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, if (vcpu->arch.exception_vmexit.pending && !vcpu->arch.exception.pending && !vcpu->arch.exception.injected) - ex = &vcpu->arch.exception_vmexit; - else - ex = &vcpu->arch.exception; + return &vcpu->arch.exception_vmexit; + + return &vcpu->arch.exception; +} + +static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu) +{ + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); /* - * In guest mode, payload delivery should be deferred if the exception - * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1 - * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability, - * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not - * propagate the payload and so it cannot be safely deferred. Deliver - * the payload if the capability hasn't been requested. + * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver + * the pending exception payload when userspace saves *any* vCPU state + * that interacts with exception payloads to avoid breaking userspace. + * + * Architecturally, KVM must not deliver an exception payload until the + * exception is actually injected, e.g. to avoid losing pending #DB + * information (which VMX tracks in the VMCS), and to avoid clobbering + * state if the exception is never injected for whatever reason. But + * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or + * may not propagate the payload across save+restore, and so KVM can't + * safely defer delivery of the payload. */ if (!vcpu->kvm->arch.exception_payload_enabled && ex->pending && ex->has_payload) kvm_deliver_exception_payload(vcpu, ex); +} + +static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, + struct kvm_vcpu_events *events) +{ + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); + + process_nmi(vcpu); + +#ifdef CONFIG_KVM_SMM + if (kvm_check_request(KVM_REQ_SMI, vcpu)) + process_smi(vcpu); +#endif + + kvm_handle_exception_payload_quirk(vcpu); memset(events, 0, sizeof(*events)); @@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, vcpu->arch.guest_state_protected) return -EINVAL; + kvm_handle_exception_payload_quirk(vcpu); + memset(dbgregs, 0, sizeof(*dbgregs)); BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db)); @@ -12123,6 +12137,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) if (vcpu->arch.guest_state_protected) goto skip_protected_regs; + kvm_handle_exception_payload_quirk(vcpu); + kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS); kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS); kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES); base-commit: 55671237401edd1ec59276b852b9361cc170915b --