* [PATCH 1/2] KVM: nVMX: Move vTPR vs. TPR Threshold consistency check into "normal" checks
2026-06-03 22:34 [PATCH 0/2] KVM: nVMX: Fix ept=n bugs where KVM runs L2 with guest CR3 Sean Christopherson
@ 2026-06-03 22:34 ` Sean Christopherson
2026-06-03 22:34 ` [PATCH 2/2] KVM: nVMX: Don't use vmcs01.GUEST_CR3 to snapshot L1's CR3 when EPT is disabled Sean Christopherson
2026-06-04 6:23 ` [PATCH 0/2] KVM: nVMX: Fix ept=n bugs where KVM runs L2 with guest CR3 Jim Mattson
2 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2026-06-03 22:34 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jim Mattson
Move the off-by-default consistency check for vmcs12.tpr_threshold vs.
the virtual APIC vTPR into the "normal" controls checks, as waiting until
KVM has loaded some amount of state is unnecessary and actively dangerous.
Specifically, failure to unwind vmcs01.GUEST_CR3 to KVM's value when EPT
is disabled results in KVM running L1 with an L1-controlled CR3, not with
KVM's CR3!
Alternatively, KVM could simply reset the MMU to force a reload of
vmcs01.GUEST_CR3, but the _only_ reason the check was shoved into a "late"
flow was to wait until the vmcs12 pages were retrieved. Rather than build
up more crusty code, simply access vTPR using a regular guest memory access
(performance isn't a concern). To circumvent the restrictions that led to
KVM deferring nested_get_vmcs12_pages(), (a) use a VM-scoped API to read
guest memory so that it always hits non-SMM memslots (for RSM), and (b)
skip the check (since its off-by-default anyways) when the vCPU doesn't
want to run, i.e. when userspace is restoring/stuffing state.
Fixes: 1100e4910ad2 ("KVM: nVMX: Add an off-by-default module param to WARN on missed consistency checks")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/vmx/nested.c | 65 +++++++++++++++++----------------------
1 file changed, 28 insertions(+), 37 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b2c851cc7d5c..039e234e7d2b 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -582,6 +582,8 @@ static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,
static int nested_vmx_check_tpr_shadow_controls(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12)
{
+ u32 vtpr;
+
if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW))
return 0;
@@ -591,6 +593,32 @@ static int nested_vmx_check_tpr_shadow_controls(struct kvm_vcpu *vcpu,
if (CC(!nested_cpu_has_vid(vmcs12) && vmcs12->tpr_threshold >> 4))
return -EINVAL;
+ /*
+ * Do the illegal vTPR vs. TPR Threshold consistency check if and only
+ * if KVM is configured to WARN on missed consistency checks, otherwise
+ * it's a waste of time. KVM needs to rely on hardware to fully detect
+ * an illegal combination due to the vTPR being writable by L1 at all
+ * times (it's an in-memory value, not a VMCS field). I.e. even if the
+ * check passes now, it might fail at the actual VM-Enter.
+ *
+ * Keying off the module param also allows treating an invalid vAPIC
+ * page as a consistency check failure without increasing the risk of
+ * breaking a "real" VM.
+ *
+ * Note! Deliberately use the VM-scoped API when reading guest memory,
+ * to ensure the read doesn't hit SMRAM when restoring L2 state on RSM,
+ * and only perform the check when in KVM_RUN, to avoid a false failure
+ * if userspace hasn't yet configured memslots during state restore.
+ */
+ if (warn_on_missed_cc && vcpu->wants_to_run &&
+ nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW) &&
+ !nested_cpu_has_vid(vmcs12) &&
+ !nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) &&
+ (CC(kvm_read_guest(vcpu->kvm, vmcs12->virtual_apic_page_addr + APIC_TASKPRI,
+ &vtpr, sizeof(vtpr))) ||
+ CC((vmcs12->tpr_threshold & GENMASK(3, 0)) > ((vtpr >> 4) & GENMASK(3, 0)))))
+ return -EINVAL;
+
return 0;
}
@@ -3115,38 +3143,6 @@ static int nested_vmx_check_controls(struct kvm_vcpu *vcpu,
return 0;
}
-static int nested_vmx_check_controls_late(struct kvm_vcpu *vcpu,
- struct vmcs12 *vmcs12)
-{
- void *vapic = to_vmx(vcpu)->nested.virtual_apic_map.hva;
- u32 vtpr = vapic ? (*(u32 *)(vapic + APIC_TASKPRI)) >> 4 : 0;
-
- /*
- * Don't bother with the consistency checks if KVM isn't configured to
- * WARN on missed consistency checks, as KVM needs to rely on hardware
- * to fully detect an illegal vTPR vs. TRP Threshold combination due to
- * the vTPR being writable by L1 at all times (it's an in-memory value,
- * not a VMCS field). I.e. even if the check passes now, it might fail
- * at the actual VM-Enter.
- *
- * Keying off the module param also allows treating an invalid vAPIC
- * mapping as a consistency check failure without increasing the risk
- * of breaking a "real" VM.
- */
- if (!warn_on_missed_cc)
- return 0;
-
- if ((exec_controls_get(to_vmx(vcpu)) & CPU_BASED_TPR_SHADOW) &&
- nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW) &&
- !nested_cpu_has_vid(vmcs12) &&
- !nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) &&
- (CC(!vapic) ||
- CC((vmcs12->tpr_threshold & GENMASK(3, 0)) > (vtpr & GENMASK(3, 0)))))
- return -EINVAL;
-
- return 0;
-}
-
static int nested_vmx_check_address_space_size(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12)
{
@@ -3696,11 +3692,6 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
return NVMX_VMENTRY_KVM_INTERNAL_ERROR;
}
- if (nested_vmx_check_controls_late(vcpu, vmcs12)) {
- vmx_switch_vmcs(vcpu, &vmx->vmcs01);
- return NVMX_VMENTRY_VMFAIL;
- }
-
if (nested_vmx_check_guest_state(vcpu, vmcs12,
&entry_failure_code)) {
exit_reason.basic = EXIT_REASON_INVALID_STATE;
--
2.54.0.1032.g2f8565e1d1-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH 2/2] KVM: nVMX: Don't use vmcs01.GUEST_CR3 to snapshot L1's CR3 when EPT is disabled
2026-06-03 22:34 [PATCH 0/2] KVM: nVMX: Fix ept=n bugs where KVM runs L2 with guest CR3 Sean Christopherson
2026-06-03 22:34 ` [PATCH 1/2] KVM: nVMX: Move vTPR vs. TPR Threshold consistency check into "normal" checks Sean Christopherson
@ 2026-06-03 22:34 ` Sean Christopherson
2026-06-03 22:50 ` sashiko-bot
2026-06-04 6:23 ` [PATCH 0/2] KVM: nVMX: Fix ept=n bugs where KVM runs L2 with guest CR3 Jim Mattson
2 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2026-06-03 22:34 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jim Mattson
Add a dedicated field in "struct nested_vmx" to track L1's pre-VM-Enter CR3
instead of using vmcs01.GUEST_CR3, which isn't anywhere near as safe as the
comment purports it to be. E.g. in addition to the warn_on_missed_cc bug
(that was fixed by relocating the consistency check), if getting vmcs12
pages (during actual nested VM-Entry) fails and EPT is disabled (in KVM),
KVM will return control to userspace with vmcs01.GUEST_CR3 holding a guest-
controlled value.
Alternatively, KVM could force a reload of vmcs01.GUEST_CR3 by resetting
the MMU context in the error path, but as above, the safety of the vmcs01
approach is extremely questionable, e.g. it took all of ~4 months for the
code to break.
Fixes: 671ddc700fd0 ("KVM: nVMX: Don't leak L1 MMIO regions to L2")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/vmx/nested.c | 21 ++++++++-------------
arch/x86/kvm/vmx/vmx.h | 7 +++++++
2 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 039e234e7d2b..772b8090d06a 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3668,19 +3668,14 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
&vmx->nested.pre_vmenter_ssp_tbl);
/*
- * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled. In the
- * event of a "late" VM-Fail, i.e. a VM-Fail detected by hardware but
- * not KVM, KVM must unwind its software model to the pre-VM-Entry host
- * state. When EPT is disabled, GUEST_CR3 holds KVM's shadow CR3, not
- * L1's "real" CR3, which causes nested_vmx_restore_host_state() to
- * corrupt vcpu->arch.cr3. Stuffing vmcs01.GUEST_CR3 results in the
- * unwind naturally setting arch.cr3 to the correct value. Smashing
- * vmcs01.GUEST_CR3 is safe because nested VM-Exits, and the unwind,
- * reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is guaranteed to be
- * overwritten with a shadow CR3 prior to re-entering L1.
+ * Stash L1's CR3, so that in the event of a "late" VM-Fail, i.e. a
+ * VM-Fail detected by hardware but not KVM, KVM can unwind its
+ * software model to the pre-VM-Entry host state. When EPT is
+ * disabled, GUEST_CR3 holds KVM's shadow CR3, not L1's "real" CR3,
+ * and so simply restoring from vmcs01.GUEST_CR3 would corrupt
+ * vcpu->arch.cr3.
*/
- if (!enable_ept)
- vmcs_writel(GUEST_CR3, vcpu->arch.cr3);
+ vmx->nested.pre_vmenter_cr3 = vcpu->arch.cr3;
vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02);
@@ -4992,7 +4987,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
nested_ept_uninit_mmu_context(vcpu);
- vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+ vcpu->arch.cr3 = vmx->nested.pre_vmenter_cr3;
kvm_register_mark_available(vcpu, VCPU_REG_CR3);
/*
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index de9de0d2016c..dc8517f15bc4 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -159,6 +159,13 @@ struct nested_vmx {
bool has_preemption_timer_deadline;
bool preemption_timer_expired;
+ /*
+ * Used to restore L1's CR3 if hardware detects a VM-Fail Consistency
+ * Check that KVM does not, in which case KVM needs to unwind CR3 back
+ * to its pre-VM-Enter state, NOT to vmcs01.HOST_CR3.
+ */
+ unsigned long pre_vmenter_cr3;
+
/*
* Used to snapshot MSRs that are conditionally loaded on VM-Enter in
* order to propagate the guest's pre-VM-Enter value into vmcs02. For
--
2.54.0.1032.g2f8565e1d1-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH 2/2] KVM: nVMX: Don't use vmcs01.GUEST_CR3 to snapshot L1's CR3 when EPT is disabled
2026-06-03 22:34 ` [PATCH 2/2] KVM: nVMX: Don't use vmcs01.GUEST_CR3 to snapshot L1's CR3 when EPT is disabled Sean Christopherson
@ 2026-06-03 22:50 ` sashiko-bot
2026-06-03 22:57 ` Sean Christopherson
0 siblings, 1 reply; 8+ messages in thread
From: sashiko-bot @ 2026-06-03 22:50 UTC (permalink / raw)
To: Sean Christopherson; +Cc: kvm
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Reading `vcpu->arch.cr3` directly captures a stale CR3 value when EPT is enabled, corrupting L1's CR3 on late VM-Fail.
--
commit bc5d55e01f501ccb793937fda67ddf44633f6fda
Author: Sean Christopherson <seanjc@google.com>
KVM: nVMX: Don't use vmcs01.GUEST_CR3 to snapshot L1's CR3 when EPT is disabled
This commit adds a dedicated field in struct nested_vmx to track L1's
pre-VM-Enter CR3, avoiding the use of vmcs01.GUEST_CR3. It also modifies
nested_vmx_enter_non_root_mode() to save the CR3 value, and
nested_vmx_restore_host_state() to restore it.
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3651,19 +3651,14 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> &vmx->nested.pre_vmenter_ssp_tbl);
>
> /*
> - * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled. In the
> - * event of a "late" VM-Fail, i.e. a VM-Fail detected by hardware but
> - * not KVM, KVM must unwind its software model to the pre-VM-Entry host
> - * state. When EPT is disabled, GUEST_CR3 holds KVM's shadow CR3, not
> - * L1's "real" CR3, which causes nested_vmx_restore_host_state() to
> - * corrupt vcpu->arch.cr3. Stuffing vmcs01.GUEST_CR3 results in the
> - * unwind naturally setting arch.cr3 to the correct value. Smashing
> - * vmcs01.GUEST_CR3 is safe because nested VM-Exits, and the unwind,
> - * reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is guaranteed to be
> - * overwritten with a shadow CR3 prior to re-entering L1.
> + * Stash L1's CR3, so that in the event of a "late" VM-Fail, i.e. a
> + * VM-Fail detected by hardware but not KVM, KVM can unwind its
> + * software model to the pre-VM-Entry host state. When EPT is
> + * disabled, GUEST_CR3 holds KVM's shadow CR3, not L1's "real" CR3,
> + * and so simply restoring from vmcs01.GUEST_CR3 would corrupt
> + * vcpu->arch.cr3.
> */
> - if (!enable_ept)
> - vmcs_writel(GUEST_CR3, vcpu->arch.cr3);
> + vmx->nested.pre_vmenter_cr3 = vcpu->arch.cr3;
[Severity: High]
Does directly reading vcpu->arch.cr3 here capture a stale value when EPT is
enabled?
Because KVM does not intercept L1 CR3 writes when EPT is enabled, L1 can
update its CR3 without KVM immediately tracking it. On VM-Exit (such as for
VMLAUNCH), KVM invalidates the CR3 register cache by clearing the
VCPU_REG_CR3 availability bit.
If nested_vmx_enter_non_root_mode() directly assigns vmx->nested.pre_vmenter_cr3
from vcpu->arch.cr3 instead of using kvm_read_cr3(vcpu), it seems this would
bypass the cache refresh and stash a stale CR3 value.
>
> vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02);
>
[ ... ]
> @@ -4975,7 +4970,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
>
> nested_ept_uninit_mmu_context(vcpu);
> - vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> + vcpu->arch.cr3 = vmx->nested.pre_vmenter_cr3;
[Severity: High]
If pre_vmenter_cr3 stashed a stale value, will restoring it here corrupt L1's
CR3 on a late VM-Fail?
If a late VM-Fail occurs (detected by hardware but not KVM), writing this
potentially stale pre_vmenter_cr3 back into KVM's state (vcpu->arch.cr3) and
marking it as available could cause L1 to resume with an outdated CR3, which
might lead to guest memory corruption or a triple fault.
> kvm_register_mark_available(vcpu, VCPU_REG_CR3);
>
> /*
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260603223418.1720035-1-seanjc@google.com?part=2
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] KVM: nVMX: Don't use vmcs01.GUEST_CR3 to snapshot L1's CR3 when EPT is disabled
2026-06-03 22:50 ` sashiko-bot
@ 2026-06-03 22:57 ` Sean Christopherson
0 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2026-06-03 22:57 UTC (permalink / raw)
To: sashiko-reviews; +Cc: kvm
On Wed, Jun 03, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -3651,19 +3651,14 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> > &vmx->nested.pre_vmenter_ssp_tbl);
> >
> > /*
> > - * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled. In the
> > - * event of a "late" VM-Fail, i.e. a VM-Fail detected by hardware but
> > - * not KVM, KVM must unwind its software model to the pre-VM-Entry host
> > - * state. When EPT is disabled, GUEST_CR3 holds KVM's shadow CR3, not
> > - * L1's "real" CR3, which causes nested_vmx_restore_host_state() to
> > - * corrupt vcpu->arch.cr3. Stuffing vmcs01.GUEST_CR3 results in the
> > - * unwind naturally setting arch.cr3 to the correct value. Smashing
> > - * vmcs01.GUEST_CR3 is safe because nested VM-Exits, and the unwind,
> > - * reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is guaranteed to be
> > - * overwritten with a shadow CR3 prior to re-entering L1.
> > + * Stash L1's CR3, so that in the event of a "late" VM-Fail, i.e. a
> > + * VM-Fail detected by hardware but not KVM, KVM can unwind its
> > + * software model to the pre-VM-Entry host state. When EPT is
> > + * disabled, GUEST_CR3 holds KVM's shadow CR3, not L1's "real" CR3,
> > + * and so simply restoring from vmcs01.GUEST_CR3 would corrupt
> > + * vcpu->arch.cr3.
> > */
> > - if (!enable_ept)
> > - vmcs_writel(GUEST_CR3, vcpu->arch.cr3);
> > + vmx->nested.pre_vmenter_cr3 = vcpu->arch.cr3;
>
> [Severity: High]
> Does directly reading vcpu->arch.cr3 here capture a stale value when EPT is
> enabled?
>
> Because KVM does not intercept L1 CR3 writes when EPT is enabled, L1 can
> update its CR3 without KVM immediately tracking it. On VM-Exit (such as for
> VMLAUNCH), KVM invalidates the CR3 register cache by clearing the
> VCPU_REG_CR3 availability bit.
>
> If nested_vmx_enter_non_root_mode() directly assigns vmx->nested.pre_vmenter_cr3
> from vcpu->arch.cr3 instead of using kvm_read_cr3(vcpu), it seems this would
> bypass the cache refresh and stash a stale CR3 value.
Gah, yes.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] KVM: nVMX: Fix ept=n bugs where KVM runs L2 with guest CR3
2026-06-03 22:34 [PATCH 0/2] KVM: nVMX: Fix ept=n bugs where KVM runs L2 with guest CR3 Sean Christopherson
2026-06-03 22:34 ` [PATCH 1/2] KVM: nVMX: Move vTPR vs. TPR Threshold consistency check into "normal" checks Sean Christopherson
2026-06-03 22:34 ` [PATCH 2/2] KVM: nVMX: Don't use vmcs01.GUEST_CR3 to snapshot L1's CR3 when EPT is disabled Sean Christopherson
@ 2026-06-04 6:23 ` Jim Mattson
2026-06-04 13:15 ` Sean Christopherson
2 siblings, 1 reply; 8+ messages in thread
From: Jim Mattson @ 2026-06-04 6:23 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel
On Wed, Jun 3, 2026 at 3:34 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Fix two bugs where KVM can run L2 with a guest-controlled CR3. The underlying
> flaw dates back to commit f087a02941fe ("KVM: nVMX: Stash L1's CR3 in
> vmcs01.GUEST_CR3 on nested entry w/o EPT"). Past me claimed:
>
> Smashing vmcs01.GUEST_CR3 is safe because nested VM-Exits, and the unwind,
> reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is guaranteed to be overwritten with
> a shadow CR3 prior to re-entering L1.
>
> which was and is true, _if_ a nested VM-Exit or the unwind is reached. If KVM
> fails directly, vmcs01.GUEST_CR3 will be left pointing at L1's actual CR3, i.e.
> KVM will run with legacy shadow paging a guest-controlled CR3, which is... not
> good.
>
> Note, the vTPR fix will cause KVM-Unit-Test's TPR Threshold test to fail when
> run with warn_on_missed_cc=1:
>
> FAIL: Use TPR shadow enabled: virtual-APIC address = 8000000: vmlaunch succeeds
> FAIL: Use TPR shadow enabled: virtual-APIC address = 10000000: vmlaunch succeeds
> ...
> FAIL: Use TPR shadow enabled: virtual-APIC address = ffffffffff000: vmlaunch succeeds
>
> due to KVM actually accessing the virtual APIC page. Given that I'm pretty
> sure I'm the only person that runs with warn_on_missed_cc=1, and that IMO this
> is firmly a test bug (e.g. on bare metal I'm pretty sure there's guarantee
> VMLAUNCH will succeed), I don't see any reason to try and make it "work" in
> KVM. I might try to figure out a way to make the KUT testcase play nice, but
> for now I'll just ignore the failures in my test runs.
IIRC, the tests in question are confirming PCI bus error semantics.
Why would the VMLAUNCH not succeed on bare metal?
> Sean Christopherson (2):
> KVM: nVMX: Move vTPR vs. TPR Threshold consistency check into "normal"
> checks
> KVM: nVMX: Don't use vmcs01.GUEST_CR3 to snapshot L1's CR3 when EPT is
> disabled
>
> arch/x86/kvm/vmx/nested.c | 86 ++++++++++++++++-----------------------
> arch/x86/kvm/vmx/vmx.h | 7 ++++
> 2 files changed, 43 insertions(+), 50 deletions(-)
>
>
> base-commit: d4bfaa66fa171089b9b9fb2dc17af9245f2b9b34
> --
> 2.54.0.1032.g2f8565e1d1-goog
>
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH 0/2] KVM: nVMX: Fix ept=n bugs where KVM runs L2 with guest CR3
2026-06-04 6:23 ` [PATCH 0/2] KVM: nVMX: Fix ept=n bugs where KVM runs L2 with guest CR3 Jim Mattson
@ 2026-06-04 13:15 ` Sean Christopherson
2026-06-04 14:13 ` Jim Mattson
0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2026-06-04 13:15 UTC (permalink / raw)
To: Jim Mattson; +Cc: Paolo Bonzini, kvm, linux-kernel
On Wed, Jun 03, 2026, Jim Mattson wrote:
> On Wed, Jun 3, 2026 at 3:34 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Fix two bugs where KVM can run L2 with a guest-controlled CR3. The underlying
> > flaw dates back to commit f087a02941fe ("KVM: nVMX: Stash L1's CR3 in
> > vmcs01.GUEST_CR3 on nested entry w/o EPT"). Past me claimed:
> >
> > Smashing vmcs01.GUEST_CR3 is safe because nested VM-Exits, and the unwind,
> > reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is guaranteed to be overwritten with
> > a shadow CR3 prior to re-entering L1.
> >
> > which was and is true, _if_ a nested VM-Exit or the unwind is reached. If KVM
> > fails directly, vmcs01.GUEST_CR3 will be left pointing at L1's actual CR3, i.e.
> > KVM will run with legacy shadow paging a guest-controlled CR3, which is... not
> > good.
> >
> > Note, the vTPR fix will cause KVM-Unit-Test's TPR Threshold test to fail when
> > run with warn_on_missed_cc=1:
> >
> > FAIL: Use TPR shadow enabled: virtual-APIC address = 8000000: vmlaunch succeeds
> > FAIL: Use TPR shadow enabled: virtual-APIC address = 10000000: vmlaunch succeeds
> > ...
> > FAIL: Use TPR shadow enabled: virtual-APIC address = ffffffffff000: vmlaunch succeeds
> >
> > due to KVM actually accessing the virtual APIC page. Given that I'm pretty
> > sure I'm the only person that runs with warn_on_missed_cc=1, and that IMO this
> > is firmly a test bug (e.g. on bare metal I'm pretty sure there's guarantee
> > VMLAUNCH will succeed), I don't see any reason to try and make it "work" in
> > KVM. I might try to figure out a way to make the KUT testcase play nice, but
> > for now I'll just ignore the failures in my test runs.
>
> IIRC, the tests in question are confirming PCI bus error semantics.
> Why would the VMLAUNCH not succeed on bare metal?
Well, I was assuming it would fail because it couldn't actually guarantee PCI Bus
Errors, as it could very well stumble into actual device memory. But the test
leaves TPR_THRESHOLD as '0', and so regardless of what value the CPU gets back,
consistency check will still pass.
But! I'm pretty sure the test would generate #MCs, not PCI bus errors. The
SDM very, very strongly implies that the reads will use WB:
Bits 53:50 report the memory type that should be used for the VMCS, for
data structures referenced by pointers in the VMCS (I/O bitmaps,
virtual-APIC page, MSR areas for VMX transitions), and for the MSEG header.
^^^^^^^^^^^^^^^^^
If software needs to access these data structures (e.g., to modify the
contents of the MSR bitmaps), it can configure the paging structures to map
them into the linear-address space. If it does so, it should establish
mappings that use the memory type reported bits 53:50 in this MSR.
As of this writing, all processors that support VMX operation indicate the
write-back type. The values used are given in Table A-1.
And _that_ will definitely cause problems, especially if the read hits device
memory.
That said, KVM's de facto ABI is that VMX instructions get PCI Bus Error semantics
on accesses KVM can't handle, and it's just as easy to skip the consistency check.
Since a read of 0xff guarantees the vTPR >= TPR_THRESHOLD, the check will pass
regardless of TPR_THRESHOLD.
So, other than my stubbornness :-D, there's no reason to deliberately fail the
check if KVM can't read memory. I'll go with this for v2:
gpa_t vtpr_gpa = vmcs12->virtual_apic_page_addr + APIC_TASKPRI;
u32 vtpr;
if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW))
return 0;
if (CC(!page_address_valid(vcpu, vmcs12->virtual_apic_page_addr)))
return -EINVAL;
if (CC(!nested_cpu_has_vid(vmcs12) && vmcs12->tpr_threshold >> 4))
return -EINVAL;
/*
* Do the illegal vTPR vs. TPR Threshold consistency check if and only
* if KVM is configured to WARN on missed consistency checks, otherwise
* it's a waste of time. KVM needs to rely on hardware to fully detect
* an illegal combination due to the vTPR being writable by L1 at all
* times (it's an in-memory value, not a VMCS field). I.e. even if the
* check passes now, it might fail at the actual VM-Enter.
*
* If reading guest memory fails, skip the check as KVM's de facto ABI
* for VMX instruction accesses to non-existent memory is to provide
* PCI Bus Error semantics (reads return 0xFFs), in which case the vTPR
* is guaranteed to greater than or equal to the threshold.
*
* Note! Deliberately use the VM-scoped API when reading guest memory,
* to ensure the read doesn't hit SMRAM when restoring L2 state on RSM,
* and only perform the check when in KVM_RUN, to avoid a false failure
* if userspace hasn't yet configured memslots during state restore.
*/
if (warn_on_missed_cc && vcpu->wants_to_run &&
nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW) &&
!nested_cpu_has_vid(vmcs12) &&
!nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) &&
!kvm_read_guest(vcpu->kvm, vtpr_gpa, &vtpr, sizeof(vtpr)) &&
CC((vmcs12->tpr_threshold & GENMASK(3, 0)) > ((vtpr >> 4) & GENMASK(3, 0))))
return -EINVAL;
return 0;
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH 0/2] KVM: nVMX: Fix ept=n bugs where KVM runs L2 with guest CR3
2026-06-04 13:15 ` Sean Christopherson
@ 2026-06-04 14:13 ` Jim Mattson
0 siblings, 0 replies; 8+ messages in thread
From: Jim Mattson @ 2026-06-04 14:13 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel
On Thu, Jun 4, 2026 at 6:15 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Jun 03, 2026, Jim Mattson wrote:
> > ...
> > IIRC, the tests in question are confirming PCI bus error semantics.
> > Why would the VMLAUNCH not succeed on bare metal?
>
> Well, I was assuming it would fail because it couldn't actually guarantee PCI Bus
> Errors, as it could very well stumble into actual device memory. But the test
> leaves TPR_THRESHOLD as '0', and so regardless of what value the CPU gets back,
> consistency check will still pass.
>
> But! I'm pretty sure the test would generate #MCs, not PCI bus errors. The
> SDM very, very strongly implies that the reads will use WB:
>
> Bits 53:50 report the memory type that should be used for the VMCS, for
> data structures referenced by pointers in the VMCS (I/O bitmaps,
> virtual-APIC page, MSR areas for VMX transitions), and for the MSEG header.
> ^^^^^^^^^^^^^^^^^
>
> If software needs to access these data structures (e.g., to modify the
> contents of the MSR bitmaps), it can configure the paging structures to map
> them into the linear-address space. If it does so, it should establish
> mappings that use the memory type reported bits 53:50 in this MSR.
>
> As of this writing, all processors that support VMX operation indicate the
> write-back type. The values used are given in Table A-1.
>
> And _that_ will definitely cause problems, especially if the read hits device
> memory.
Good point. However, "problems" here are machine checks, right? Not VM
entry with invalid control field(s).
> That said, KVM's de facto ABI is that VMX instructions get PCI Bus Error semantics
> on accesses KVM can't handle, and it's just as easy to skip the consistency check.
> Since a read of 0xff guarantees the vTPR >= TPR_THRESHOLD, the check will pass
> regardless of TPR_THRESHOLD.
Hmmm...Is that an erratum? :)
> So, other than my stubbornness :-D, there's no reason to deliberately fail the
> check if KVM can't read memory. I'll go with this for v2:
>
> gpa_t vtpr_gpa = vmcs12->virtual_apic_page_addr + APIC_TASKPRI;
> u32 vtpr;
>
> if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW))
> return 0;
>
> if (CC(!page_address_valid(vcpu, vmcs12->virtual_apic_page_addr)))
> return -EINVAL;
>
> if (CC(!nested_cpu_has_vid(vmcs12) && vmcs12->tpr_threshold >> 4))
> return -EINVAL;
>
> /*
> * Do the illegal vTPR vs. TPR Threshold consistency check if and only
> * if KVM is configured to WARN on missed consistency checks, otherwise
> * it's a waste of time. KVM needs to rely on hardware to fully detect
> * an illegal combination due to the vTPR being writable by L1 at all
> * times (it's an in-memory value, not a VMCS field). I.e. even if the
> * check passes now, it might fail at the actual VM-Enter.
> *
> * If reading guest memory fails, skip the check as KVM's de facto ABI
> * for VMX instruction accesses to non-existent memory is to provide
> * PCI Bus Error semantics (reads return 0xFFs), in which case the vTPR
> * is guaranteed to greater than or equal to the threshold.
> *
> * Note! Deliberately use the VM-scoped API when reading guest memory,
> * to ensure the read doesn't hit SMRAM when restoring L2 state on RSM,
> * and only perform the check when in KVM_RUN, to avoid a false failure
> * if userspace hasn't yet configured memslots during state restore.
> */
I really wish that KVM dictated the save and restore sequences. :/
> if (warn_on_missed_cc && vcpu->wants_to_run &&
> nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW) &&
> !nested_cpu_has_vid(vmcs12) &&
> !nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) &&
> !kvm_read_guest(vcpu->kvm, vtpr_gpa, &vtpr, sizeof(vtpr)) &&
> CC((vmcs12->tpr_threshold & GENMASK(3, 0)) > ((vtpr >> 4) & GENMASK(3, 0))))
> return -EINVAL;
>
> return 0;
^ permalink raw reply [flat|nested] 8+ messages in thread