public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/1] KVM: VMX: restore host CR2 after VM exit
@ 2026-04-22 17:49 Konstantin Khorenko
  2026-04-22 17:50 ` [RFC PATCH 1/1] " Konstantin Khorenko
  2026-04-22 18:56 ` [RFC PATCH 0/1] " Sean Christopherson
  0 siblings, 2 replies; 3+ messages in thread
From: Konstantin Khorenko @ 2026-04-22 17:49 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, kvm
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H . Peter Anvin, x86, linux-kernel, Pavel Tikhomirov

Hi,

this is an RFC for a small change to arch/x86/kvm/vmx/vmx.c that
restores the host CR2 register after every VM exit, overwriting the
guest's CR2 value that Intel VMX intentionally leaves in the hardware
register.


Where we first saw the problem:

The panic that triggered this investigation was not observed on a
mainline kernel.  It came from a RHEL9-based downstream Virtuozzo
kernel running a nested KVM workload: the host where the panic
happened is itself an L1 KVM guest of an outer hypervisor, and it runs
its own L2 KVM guests on top.

In the crash dump, four cascading kernel #PF oopses all reported the
same CR2 value, while the instructions at the reported faulting RIPs
could not have produced that fault:

  add $0x8,%rsp          (no memory access at all)
  lea ...,%r12           (no memory access at all)
  mov %rax,0x10(%rsp)    (write to local stack, not to the reported CR2)
  mov %rax,(%rbp)        (write to local stack, not to the reported CR2)

All four oopses happened inside the L1 host itself: the original fault
plus three further faults taken inside the oops-reporting code
(dump_pagetable() -> copy_from_kernel_nofault(), vt_console_print() ->
lf(), vsnprintf() in the "Modules linked in" path).
They are not extra levels of guest nesting; the nesting stack in this
setup is just two deep (outer hypervisor, then this L1 host running its
own L2 guests).

The reported CR2 did not correspond to any register or operand in the
surrounding code, but it had the form of a kernel virtual address
belonging to an inner KVM guest that had run on that CPU.

That matched the well-known property of VMX: after a VM exit, the
hardware CR2 register still holds the guest's CR2 - KVM only copies it
into vcpu->arch.cr2, it does not restore the host value.  In a real #PF
the CPU would overwrite CR2 with the new faulting address; the fact that
CR2 stays the same across all four oopses indicates that at least three
of them reached the #PF reporting path without the CPU having actually
updated CR2.


Why this mattered in practice:

What actually panicked this L1 host kernel was not the original fault
on its own - in isolation it would most likely have been a single,
locally contained oops.  The fatal part was that the oops reporter
itself tripped over a CR2 pointing into L2-guest memory:
show_fault_info() / dump_pagetable() treated the stale guest CR2 as a
host faulting address and walked the L1 host's page tables for it,
which raised further #PFs inside the reporting code and escalated a
would-be local oops into a full panic of this L1 host.  On top of
that, every oops header in the log carried that same guest-derived
CR2, which makes this class of crashes effectively undiagnosable from
the dump alone - the "faulting address" printed next to the RIP has
nothing to do with what the RIP was actually doing.  Restoring the
host CR2 after VM exit removes both effects at the source.


Why this is an RFC rather than a straight PATCH:

The mechanical fact (VMX leaves the guest CR2 in the hardware register
after VM exit, and the rest of the kernel treats CR2 as "address of
the last host #PF") is easy to verify from the source.  What I cannot
pin down from that one dump is which exact delivery path brought a #PF
handler into play with the CPU not having updated CR2 on that run.
The plausible candidates include:

  - corner cases of outer-hypervisor event injection into this host;
  - NMI/MCE entries racing with oops reporting;
  - crash/__show_regs() invoked from contexts other than a freshly
    taken #PF, where die()/oops code reads CR2 as if it were fresh.

All of these stop mattering the moment the host CR2 stops being a
guest-controlled value after a VM exit.  The patch targets the
weakest link directly: the "CR2 on the host == address of the last
host #PF" invariant should hold across VM entry/exit on VMX, and
today it does not.

AMD SVM does not need this patch because CR2 lives in the VMCB save area
and the CPU handles host/guest CR2 automatically; KVM's SVM code only
ever touches svm->vmcb->save.cr2.  I am happy to add a brief comment to
that effect in svm.c if it would help prevent a similar "optimization"
from being introduced there.


Verification against mainline:

To make sure the issue is not already fixed somewhere in mainline, I
checked the current:

  v7.0-12635-g6596a02b20788  (base-commit of this patch)

In arch/x86/kvm/vmx/vmx.c::vmx_vcpu_enter_exit(), the code still
reads:

        if (vcpu->arch.cr2 != native_read_cr2())
                native_write_cr2(vcpu->arch.cr2);

        vmx->fail = __vmx_vcpu_run(vmx, ...);

        vcpu->arch.cr2 = native_read_cr2();
        /* host CR2 is not restored */

A full-tree search confirms that vmx_vcpu_enter_exit() is the only
place in arch/x86/kvm/vmx that touches the hardware CR2 register:
vmenter.S does not touch it, and arch/x86/kvm/svm never accesses it.
So the same latent behaviour is present in mainline, regardless of
whether this exact crash has been reported against a mainline kernel.


Patch properties:

  - Hot path impact: one extra register compare in the common case,
    one extra MOV to CR2 under unlikely() when the guest modified CR2.

  - Stays within the existing noinstr region.  native_read_cr2() and
    native_write_cr2() are plain inline asm with no instrumentation,
    so noinstr constraints are preserved.

  - Not a security fix for a user-triggerable issue per se, but it
    removes a class of confusing "kernel CR2 points into guest memory"
    oops reports and hardens the CR2 invariant for the whole kernel.

Thanks,
Konstantin

Konstantin Khorenko (1):
  KVM: VMX: restore host CR2 after VM exit

 arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)


base-commit: 6596a02b207886e9e00bb0161c7fd59fea53c081
-- 
2.43.0


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-22 18:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 17:49 [RFC PATCH 0/1] KVM: VMX: restore host CR2 after VM exit Konstantin Khorenko
2026-04-22 17:50 ` [RFC PATCH 1/1] " Konstantin Khorenko
2026-04-22 18:56 ` [RFC PATCH 0/1] " Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox