* [RFC PATCH 0/1] KVM: VMX: restore host CR2 after VM exit
@ 2026-04-22 17:49 Konstantin Khorenko
2026-04-22 17:50 ` [RFC PATCH 1/1] " Konstantin Khorenko
2026-04-22 18:56 ` [RFC PATCH 0/1] " Sean Christopherson
0 siblings, 2 replies; 3+ messages in thread
From: Konstantin Khorenko @ 2026-04-22 17:49 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, kvm
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H . Peter Anvin, x86, linux-kernel, Pavel Tikhomirov
Hi,
this is an RFC for a small change to arch/x86/kvm/vmx/vmx.c that
restores the host CR2 register after every VM exit, overwriting the
guest's CR2 value that Intel VMX intentionally leaves in the hardware
register.
Where we first saw the problem:
The panic that triggered this investigation was not observed on a
mainline kernel. It came from a RHEL9-based downstream Virtuozzo
kernel running a nested KVM workload: the host where the panic
happened is itself an L1 KVM guest of an outer hypervisor, and it runs
its own L2 KVM guests on top.
In the crash dump, four cascading kernel #PF oopses all reported the
same CR2 value, while the instructions at the reported faulting RIPs
could not have produced that fault:
add $0x8,%rsp (no memory access at all)
lea ...,%r12 (no memory access at all)
mov %rax,0x10(%rsp) (write to local stack, not to the reported CR2)
mov %rax,(%rbp) (write to local stack, not to the reported CR2)
All four oopses happened inside the L1 host itself: the original fault
plus three further faults taken inside the oops-reporting code
(dump_pagetable() -> copy_from_kernel_nofault(), vt_console_print() ->
lf(), vsnprintf() in the "Modules linked in" path).
They are not extra levels of guest nesting; the nesting stack in this
setup is just two deep (outer hypervisor, then this L1 host running its
own L2 guests).
The reported CR2 did not correspond to any register or operand in the
surrounding code, but it had the form of a kernel virtual address
belonging to an inner KVM guest that had run on that CPU.
That matched the well-known property of VMX: after a VM exit, the
hardware CR2 register still holds the guest's CR2 - KVM only copies it
into vcpu->arch.cr2, it does not restore the host value. In a real #PF
the CPU would overwrite CR2 with the new faulting address; the fact that
CR2 stays the same across all four oopses indicates that at least three
of them reached the #PF reporting path without the CPU having actually
updated CR2.
Why this mattered in practice:
What actually panicked this L1 host kernel was not the original fault
on its own - in isolation it would most likely have been a single,
locally contained oops. The fatal part was that the oops reporter
itself tripped over a CR2 pointing into L2-guest memory:
show_fault_info() / dump_pagetable() treated the stale guest CR2 as a
host faulting address and walked the L1 host's page tables for it,
which raised further #PFs inside the reporting code and escalated a
would-be local oops into a full panic of this L1 host. On top of
that, every oops header in the log carried that same guest-derived
CR2, which makes this class of crashes effectively undiagnosable from
the dump alone - the "faulting address" printed next to the RIP has
nothing to do with what the RIP was actually doing. Restoring the
host CR2 after VM exit removes both effects at the source.
Why this is an RFC rather than a straight PATCH:
The mechanical fact (VMX leaves the guest CR2 in the hardware register
after VM exit, and the rest of the kernel treats CR2 as "address of
the last host #PF") is easy to verify from the source. What I cannot
pin down from that one dump is which exact delivery path brought a #PF
handler into play with the CPU not having updated CR2 on that run.
The plausible candidates include:
- corner cases of outer-hypervisor event injection into this host;
- NMI/MCE entries racing with oops reporting;
- crash/__show_regs() invoked from contexts other than a freshly
taken #PF, where die()/oops code reads CR2 as if it were fresh.
All of these stop mattering the moment the host CR2 stops being a
guest-controlled value after a VM exit. The patch targets the
weakest link directly: the "CR2 on the host == address of the last
host #PF" invariant should hold across VM entry/exit on VMX, and
today it does not.
AMD SVM does not need this patch because CR2 lives in the VMCB save area
and the CPU handles host/guest CR2 automatically; KVM's SVM code only
ever touches svm->vmcb->save.cr2. I am happy to add a brief comment to
that effect in svm.c if it would help prevent a similar "optimization"
from being introduced there.
Verification against mainline:
To make sure the issue is not already fixed somewhere in mainline, I
checked the current:
v7.0-12635-g6596a02b20788 (base-commit of this patch)
In arch/x86/kvm/vmx/vmx.c::vmx_vcpu_enter_exit(), the code still
reads:
if (vcpu->arch.cr2 != native_read_cr2())
native_write_cr2(vcpu->arch.cr2);
vmx->fail = __vmx_vcpu_run(vmx, ...);
vcpu->arch.cr2 = native_read_cr2();
/* host CR2 is not restored */
A full-tree search confirms that vmx_vcpu_enter_exit() is the only
place in arch/x86/kvm/vmx that touches the hardware CR2 register:
vmenter.S does not touch it, and arch/x86/kvm/svm never accesses it.
So the same latent behaviour is present in mainline, regardless of
whether this exact crash has been reported against a mainline kernel.
Patch properties:
- Hot path impact: one extra register compare in the common case,
one extra MOV to CR2 under unlikely() when the guest modified CR2.
- Stays within the existing noinstr region. native_read_cr2() and
native_write_cr2() are plain inline asm with no instrumentation,
so noinstr constraints are preserved.
- Not a security fix for a user-triggerable issue per se, but it
removes a class of confusing "kernel CR2 points into guest memory"
oops reports and hardens the CR2 invariant for the whole kernel.
Thanks,
Konstantin
Konstantin Khorenko (1):
KVM: VMX: restore host CR2 after VM exit
arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
base-commit: 6596a02b207886e9e00bb0161c7fd59fea53c081
--
2.43.0
^ permalink raw reply [flat|nested] 3+ messages in thread
* [RFC PATCH 1/1] KVM: VMX: restore host CR2 after VM exit
2026-04-22 17:49 [RFC PATCH 0/1] KVM: VMX: restore host CR2 after VM exit Konstantin Khorenko
@ 2026-04-22 17:50 ` Konstantin Khorenko
2026-04-22 18:56 ` [RFC PATCH 0/1] " Sean Christopherson
1 sibling, 0 replies; 3+ messages in thread
From: Konstantin Khorenko @ 2026-04-22 17:50 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, kvm
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H . Peter Anvin, x86, linux-kernel, Pavel Tikhomirov
On Intel VMX, CR2 is not part of the VMCS guest/host state area. The
CPU does not save or restore it automatically across VM transitions,
so KVM manages it in software: before VM entry it writes vcpu->arch.cr2
into the hardware register if it differs from the current value, and
after VM exit it reads the hardware register back into vcpu->arch.cr2.
The host CR2 is intentionally left clobbered by the guest after VM
exit, as an optimization: the expectation is that the next host page
fault will overwrite it before anything else looks at it.
That expectation is fragile. The rest of the kernel treats CR2 as an
invariant.
- exc_page_fault() reads it at the very start of #PF handling, before
any instruction could have updated it.
- __show_regs() reads and prints it from die()/oops/crash paths.
Any flow that reaches a #PF handler, or that reads CR2 in an oops or
crash context, without the CPU having just taken a real host #PF, will
observe the guest's CR2 instead of the host's.
On nested setups the stale guest CR2 left in the hardware register
has the form of a kernel virtual address in the inner guest's address
space, which overlaps 1:1 with the outer-guest kernel layout. That
makes the stale value visually indistinguishable from a plausible
outer-guest fault address, which can lead to confusing oops reports
whose CR2 has no relation to the reported faulting RIP.
Fix: save the host CR2 before VM entry into a local variable. After
VM exit, compare the already-read vcpu->arch.cr2 against the saved
host value, and write the host CR2 back if the guest modified it.
In the common case where the guest did not touch CR2 this is a single
register compare with no write; the restore is placed under unlikely()
because most VM-entry/exit cycles do not involve a guest CR2 write.
The change stays within the existing noinstr region;
native_read_cr2()/native_write_cr2() are plain inline asm with no
instrumentation.
This brings VMX in line with the CR2 invariant the rest of the kernel
already relies on.
AMD SVM is not affected. On SVM, CR2 is part of the VMCB save area
and the CPU saves and restores host and guest CR2 automatically on
VMRUN and #VMEXIT. KVM's SVM code only accesses svm->vmcb->save.cr2
and never touches the hardware CR2 register.
Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
---
arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a29896a9ef145..dd441b90dfd4a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7458,6 +7458,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
unsigned int flags)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
+ unsigned long host_cr2;
guest_state_enter_irqoff();
@@ -7465,13 +7466,25 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
vmx_disable_fb_clear(vmx);
- if (vcpu->arch.cr2 != native_read_cr2())
+ host_cr2 = native_read_cr2();
+ if (vcpu->arch.cr2 != host_cr2)
native_write_cr2(vcpu->arch.cr2);
vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs,
flags);
vcpu->arch.cr2 = native_read_cr2();
+
+ /*
+ * Restore host CR2 if the guest modified it. The rest of the
+ * kernel relies on CR2 holding the address of the last host
+ * #PF; leaving the guest value there can mislead any code path
+ * that reads CR2 without the CPU having just taken a real host
+ * #PF (exc_page_fault(), __show_regs() from oops/crash paths,
+ * NMI/MCE report, nested-virt corner cases, etc.).
+ */
+ if (unlikely(vcpu->arch.cr2 != host_cr2))
+ native_write_cr2(host_cr2);
vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
vmx->idt_vectoring_info = 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [RFC PATCH 0/1] KVM: VMX: restore host CR2 after VM exit
2026-04-22 17:49 [RFC PATCH 0/1] KVM: VMX: restore host CR2 after VM exit Konstantin Khorenko
2026-04-22 17:50 ` [RFC PATCH 1/1] " Konstantin Khorenko
@ 2026-04-22 18:56 ` Sean Christopherson
1 sibling, 0 replies; 3+ messages in thread
From: Sean Christopherson @ 2026-04-22 18:56 UTC (permalink / raw)
To: Konstantin Khorenko
Cc: Paolo Bonzini, kvm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, x86, linux-kernel, Pavel Tikhomirov
On Wed, Apr 22, 2026, Konstantin Khorenko wrote:
> All four oopses happened inside the L1 host itself: the original fault
> plus three further faults taken inside the oops-reporting code
> (dump_pagetable() -> copy_from_kernel_nofault(), vt_console_print() ->
> lf(), vsnprintf() in the "Modules linked in" path).
> They are not extra levels of guest nesting; the nesting stack in this
> setup is just two deep (outer hypervisor, then this L1 host running its
> own L2 guests).
...
> The mechanical fact (VMX leaves the guest CR2 in the hardware register
> after VM exit, and the rest of the kernel treats CR2 as "address of
> the last host #PF") is easy to verify from the source. What I cannot
> pin down from that one dump is which exact delivery path brought a #PF
> handler into play with the CPU not having updated CR2 on that run.
> The plausible candidates include:
>
> - corner cases of outer-hypervisor event injection into this host;
> - NMI/MCE entries racing with oops reporting;
> - crash/__show_regs() invoked from contexts other than a freshly
> taken #PF, where die()/oops code reads CR2 as if it were fresh.
>
> All of these stop mattering the moment the host CR2 stops being a
> guest-controlled value after a VM exit. The patch targets the
> weakest link directly: the "CR2 on the host == address of the last
> host #PF" invariant should hold across VM entry/exit on VMX, and
> today it does not.
And it never will (barring a hardware/ucode change). This flaw is impossible to
completely fix on Intel. The best we can do is "restore" host CR2 within a few
instructions of VM-Exit. Intel doesn't provide a GIF equivalent, and so NMIs
can't be blocked in the entry/exit path. E.g. the kernel already needs to be
prepared to handle NMIs with guest CR2 loaded since VMX doesn't provide a way
to block NMIs.
More importantly, I just don't see the point; the host CR2 is _guaranteed_ to be
stale. KVM obviously doesn't do VM-Enter from #PF context. It'll probably be
less garbage than guest CR2, but it's still garbage.
I appreciate that seeing a bogus CR2 can make debug difficult, but IMO, the
benefit of making KVM moderately less painful on rare occasions where all hell
breaks loose isn't worth the cost of the extra CR2 writes. And practically
speaking, the kernel _must_ be hardened against bogus CR2 values when dealing
with OOPses and panicks, because pretty much by definition something has gone
sideways and so CR2 can't be assumed to be benign.
> Patch properties:
>
> - Hot path impact: one extra register compare in the common case,
> one extra MOV to CR2 under unlikely() when the guest modified CR2.
That's not unlikely. The odds of guest CR2 matching host CR2 are basically zero.
In practice, this likely adds two extra CR2 writes on the majority of entry/exit
transitions.
> - Stays within the existing noinstr region. native_read_cr2() and
> native_write_cr2() are plain inline asm with no instrumentation,
> so noinstr constraints are preserved.
>
> - Not a security fix for a user-triggerable issue per se, but it
> removes a class of confusing "kernel CR2 points into guest memory"
> oops reports and hardens the CR2 invariant for the whole kernel.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-04-22 18:56 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 17:49 [RFC PATCH 0/1] KVM: VMX: restore host CR2 after VM exit Konstantin Khorenko
2026-04-22 17:50 ` [RFC PATCH 1/1] " Konstantin Khorenko
2026-04-22 18:56 ` [RFC PATCH 0/1] " Sean Christopherson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox