* [Bug 216212] New: KVM does not handle nested guest enable PAE paging correctly when CR3 is not mapped in EPT
@ 2022-07-07 2:38 bugzilla-daemon
2022-07-07 15:24 ` [Bug 216212] " bugzilla-daemon
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: bugzilla-daemon @ 2022-07-07 2:38 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=216212
Bug ID: 216212
Summary: KVM does not handle nested guest enable PAE paging
correctly when CR3 is not mapped in EPT
Product: Virtualization
Version: unspecified
Kernel Version: 5.18.9
Hardware: All
OS: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: kvm
Assignee: virtualization_kvm@kernel-bugs.osdl.org
Reporter: ercli@ucdavis.edu
Regression: No
Created attachment 301352
--> https://bugzilla.kernel.org/attachment.cgi?id=301352&action=edit
LHV image used to reproduce this bug (lhv-231a25f7f.img)
CPU model: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Host kernel version: 5.18.9
Host kernel arch: x86_64
Guest: a micro-hypervisor (called LHV, 32-bits), which runs a 32-bit guest
(called "nested guest").
QEMU command line: qemu-system-x86_64 -m 256M -smp 1 -cpu Haswell,vmx=yes
-enable-kvm -serial stdio -drive media=disk,file=lhv-231a25f7f.img,index=1
This bug still exists if using -machine kernel_irqchip=off
This problem cannot be tested with -accel tcg , because the guest requires
nested virtualization
How to reproduce:
1. Download lhv-231a25f7f.img (attached with this bug). Source code of this LHV
image is in
https://github.com/lxylxy123456/uberxmhf/tree/231a25f7f49589618be0faac77a39bc593a62758
.
2. Run the QEMU command line above
3. See "BAD" printed in the VGA screen at row 20 column 0-2. The last line of
serial output is:
Fatal: Halting! Condition '0 && "Guest received #UD (incorrect behavior)"'
failed, line 26, file lhv-guest.c
Expected behavior (reproducible on real hardware and Bochs):
See "GOOD" printed in the VGA screen at row 21 column 0-3. The last line of
serial output should be:
Fatal: Halting! Condition '0 && "hypervisor receives CR3 EPT (correct
behavior)"' failed, line 375, file lhv-vmx.c
Explanation:
In KVM terms, KVM is L0, LHV is L1, nested guest is L2.
LHV runs the nested guest with:
* EPT enabled.
* Unrestricted guest enabled.
* CR0 guest/host mask (VMCS encoding 0x6000) does NOT set CR0_PG bit.
* Most of EPT is identity mapping, but the page pointed to by nested guest's
CR3 is not present in EPT.
* The nested guest uses PAE paging.
* Let the nested guest enable paging by setting CR0.PG.
When the nested guest enables paging, LHV should receive an EPT violation
(correct behavior), because enabling paging requires reading CR3. However, in
KVM, the nested guest receives an #GP exception, as if the MOV CR0 instruction
fails.
Likely stack trace and cause of this bug (Linux source code version is 5.18.9):
Stack trace:
handle_cr
kvm_set_cr0
load_pdptrs
kvm_translate_gpa
kvm_complete_insn_gp
kvm_inject_gp
What happened:
* When nested guest sets CR0.PG, handle_cr() in KVM is called.
* handle_cr() calls handle_set_cr0().
* is_guest_mode(vcpu) is true, so kvm_set_cr0() is called.
* kvm_set_cr0() calls load_pdptrs().
* load_pdptrs() calls kvm_translate_gpa().
* Since LHV does not set the page for CR3 in EPT, kvm_translate_gpa() fails.
* load_pdptrs() returns 0.
* kvm_set_cr0() returns 1.
* handle_set_cr0() returns 1.
* handle_cr() receives an error, so it injects #GP to the nested guest.
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug 216212] KVM does not handle nested guest enable PAE paging correctly when CR3 is not mapped in EPT
2022-07-07 2:38 [Bug 216212] New: KVM does not handle nested guest enable PAE paging correctly when CR3 is not mapped in EPT bugzilla-daemon
@ 2022-07-07 15:24 ` bugzilla-daemon
2022-07-08 0:07 ` [Bug 216212] New: " Sean Christopherson
2022-07-08 0:07 ` [Bug 216212] " bugzilla-daemon
2 siblings, 0 replies; 4+ messages in thread
From: bugzilla-daemon @ 2022-07-07 15:24 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=216212
Eric Li (ercli@ucdavis.edu) changed:
What |Removed |Added
----------------------------------------------------------------------------
Hardware|All |Intel
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Bug 216212] New: KVM does not handle nested guest enable PAE paging correctly when CR3 is not mapped in EPT
2022-07-07 2:38 [Bug 216212] New: KVM does not handle nested guest enable PAE paging correctly when CR3 is not mapped in EPT bugzilla-daemon
2022-07-07 15:24 ` [Bug 216212] " bugzilla-daemon
@ 2022-07-08 0:07 ` Sean Christopherson
2022-07-08 0:07 ` [Bug 216212] " bugzilla-daemon
2 siblings, 0 replies; 4+ messages in thread
From: Sean Christopherson @ 2022-07-08 0:07 UTC (permalink / raw)
To: bugzilla-daemon; +Cc: kvm
On Thu, Jul 07, 2022, bugzilla-daemon@kernel.org wrote:
> Likely stack trace and cause of this bug (Linux source code version is 5.18.9):
>
> Stack trace:
>
> handle_cr
> kvm_set_cr0
> load_pdptrs
> kvm_translate_gpa
Yeah, load_pdptrs() needs to call kvm_inject_emulated_page_fault() to inject a
TDP fault if translating the L2 GPA to an L1 GPA fails. That part is easy to fix,
but communicating up the stack that the instruction has already faulted is going
to be painful due to the use of kvm_complete_insn_gp(). Ugh, and the emulator gets
involved too.
Not that it makes things worse than they already are, but I'm pretty sure MOV CR3
(via the emulator) and MOV CR4 are also affected.
I suspect the least awful solution will be to use proper error codes instead of 0/1
so that kvm_complete_insn_gp() and friends can differentiate between "success",
"injected #GP", and "already exploded", but it's still going to require a lot of
churn.
A more drastic, but maybe less painful (though as I type this out, it's becoming
ridiculously painful) alternative would be to not intercept CR0/CR4 paging bits
when running L2 and TDP is enabled, which would in theory allow KVM to drop
the call to kvm_translate_gpa(). load_pdptrs() would still be reachable via the
emulator, but I think iff the guest is playing TLB, so KVM could probably just
resume the guest in that case.
The primary reason KVM intercepts CR0/CR4 paging bits even when using TDP is so
that KVM doesn't need to refresh state to do software gva->gpa walks, e.g. to
correctly emulate memory accesses and reserved PTE bits. The argument for
intercepting is that changing paging modes is a rare guest operation, whereas
emulating some form of memory access is relatively common. And it's also simpler
in the sense that KVM can use common code for TDP and !TDP (shadow paging heavily
depends on caching paging state).
But emulating on behalf of L2 is quite rare, and having to deal with this bug
counters the "it's simpler" argument to some extent. I _think_ ensuring the nested
MMU is properly initialized could be solved by adding a nested_gva_to_gpa() wrapper
instead of directly wiring mmu->gva_to_gpa() to the correct helper.
The messier part would be handling intercepts. VMX would have to adjust
vmcs02.CRx_READ_SHADOW and resume the guest to deal with incidental interception,
e.g. if the guest toggles both CR0.CD and CR0.PG. SVM is all or nothing for
intercepts, but PAE under NPT isn't required to load PDTRs at MOV CR, so we could
just drop that entire path for SVM+NPT. But that would rely on KVM correctly
handling L1 NPT faults during PDPTE accesses on behalf of L2, which of course KVM
doesn't get right.
So yeah, maybe KVM can avoid some of the PAE pain in the long term if KVM stops
intercepting CR0/CR4 paging bits, but it's probably a bad idea for an immediate
fix.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug 216212] KVM does not handle nested guest enable PAE paging correctly when CR3 is not mapped in EPT
2022-07-07 2:38 [Bug 216212] New: KVM does not handle nested guest enable PAE paging correctly when CR3 is not mapped in EPT bugzilla-daemon
2022-07-07 15:24 ` [Bug 216212] " bugzilla-daemon
2022-07-08 0:07 ` [Bug 216212] New: " Sean Christopherson
@ 2022-07-08 0:07 ` bugzilla-daemon
2 siblings, 0 replies; 4+ messages in thread
From: bugzilla-daemon @ 2022-07-08 0:07 UTC (permalink / raw)
To: kvm
https://bugzilla.kernel.org/show_bug.cgi?id=216212
--- Comment #1 from Sean Christopherson (seanjc@google.com) ---
On Thu, Jul 07, 2022, bugzilla-daemon@kernel.org wrote:
> Likely stack trace and cause of this bug (Linux source code version is
> 5.18.9):
>
> Stack trace:
>
> handle_cr
> kvm_set_cr0
> load_pdptrs
> kvm_translate_gpa
Yeah, load_pdptrs() needs to call kvm_inject_emulated_page_fault() to inject a
TDP fault if translating the L2 GPA to an L1 GPA fails. That part is easy to
fix,
but communicating up the stack that the instruction has already faulted is
going
to be painful due to the use of kvm_complete_insn_gp(). Ugh, and the emulator
gets
involved too.
Not that it makes things worse than they already are, but I'm pretty sure MOV
CR3
(via the emulator) and MOV CR4 are also affected.
I suspect the least awful solution will be to use proper error codes instead of
0/1
so that kvm_complete_insn_gp() and friends can differentiate between "success",
"injected #GP", and "already exploded", but it's still going to require a lot
of
churn.
A more drastic, but maybe less painful (though as I type this out, it's
becoming
ridiculously painful) alternative would be to not intercept CR0/CR4 paging bits
when running L2 and TDP is enabled, which would in theory allow KVM to drop
the call to kvm_translate_gpa(). load_pdptrs() would still be reachable via
the
emulator, but I think iff the guest is playing TLB, so KVM could probably just
resume the guest in that case.
The primary reason KVM intercepts CR0/CR4 paging bits even when using TDP is so
that KVM doesn't need to refresh state to do software gva->gpa walks, e.g. to
correctly emulate memory accesses and reserved PTE bits. The argument for
intercepting is that changing paging modes is a rare guest operation, whereas
emulating some form of memory access is relatively common. And it's also
simpler
in the sense that KVM can use common code for TDP and !TDP (shadow paging
heavily
depends on caching paging state).
But emulating on behalf of L2 is quite rare, and having to deal with this bug
counters the "it's simpler" argument to some extent. I _think_ ensuring the
nested
MMU is properly initialized could be solved by adding a nested_gva_to_gpa()
wrapper
instead of directly wiring mmu->gva_to_gpa() to the correct helper.
The messier part would be handling intercepts. VMX would have to adjust
vmcs02.CRx_READ_SHADOW and resume the guest to deal with incidental
interception,
e.g. if the guest toggles both CR0.CD and CR0.PG. SVM is all or nothing for
intercepts, but PAE under NPT isn't required to load PDTRs at MOV CR, so we
could
just drop that entire path for SVM+NPT. But that would rely on KVM correctly
handling L1 NPT faults during PDPTE accesses on behalf of L2, which of course
KVM
doesn't get right.
So yeah, maybe KVM can avoid some of the PAE pain in the long term if KVM stops
intercepting CR0/CR4 paging bits, but it's probably a bad idea for an immediate
fix.
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-07-08 0:07 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-07-07 2:38 [Bug 216212] New: KVM does not handle nested guest enable PAE paging correctly when CR3 is not mapped in EPT bugzilla-daemon
2022-07-07 15:24 ` [Bug 216212] " bugzilla-daemon
2022-07-08 0:07 ` [Bug 216212] New: " Sean Christopherson
2022-07-08 0:07 ` [Bug 216212] " bugzilla-daemon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox