* [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail
2026-06-04 16:07 [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload Paolo Bonzini
@ 2026-06-04 16:07 ` Paolo Bonzini
2026-06-09 3:31 ` Sean Christopherson
2026-06-04 16:07 ` [PATCH 2/3] KVM: MMU: unconditionally clear MMIO cache on root rebuild Paolo Bonzini
2026-06-04 16:07 ` [PATCH 3/3] KVM: nVMX: remove unnecessary unload on processor-detected VMFail Paolo Bonzini
2 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2026-06-04 16:07 UTC (permalink / raw)
To: linux-kernel, kvm; +Cc: stable
Upon a VM-entry failure that is caught by the processor rather than
KVM, nested_vmx_restore_host_state() restores L1's CR3 but not the
PDPTRs. If shadow paging is used (enable_ept is false), the L2
PDPTRs loaded during the aborted entry attempt remain in
vcpu->arch.mmu->pdptrs[].
Note that the fact that the PDPTRs are stored in the MMU does not
save the day, because KVM only uses root_mmu if enable_ept is false.
To fix this, use nested_vmx_load_cr3() instead of open coding
just the load of vcpu->arch.cr3, in the same guise as
load_vmcs12_host_state(). nested_vmx_load_cr3() will mark the
register as dirty rather than available, but this is only a
very minor pessimization.
If EPT *is* in use, do not load the PDPTRs and rely solely on
ept_save_pdptrs() to reload them from VMCS01. When vmx_load_mmu_pgd()
runs on the next entry, the PDPTRs are available---meaning they are
not incorrectly reloaded from memory.
kvm_mmu_unload() is preserved to keep the paths from the old
kvm_mmu_reset_context(), but is actually unnecessary. It can
be removed as a separate patch.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
arch/x86/kvm/vmx/nested.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 4690a4d23709..d612a5d071fc 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4947,6 +4947,7 @@ static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
{
+ enum vm_entry_failure_code ignored;
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmx_msr_entry g, h;
@@ -4984,20 +4985,19 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
nested_ept_uninit_mmu_context(vcpu);
- vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
- kvm_register_mark_available(vcpu, VCPU_REG_CR3);
/*
- * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
- * from vmcs01 (if necessary). The PDPTRs are not loaded on
- * VMFail, like everything else we just need to ensure our
- * software model is up-to-date.
+ * Now that nested EPT has been disabled, load the MMU's CR3 and
+ * possibly PDPTRs from vmcs01 (if necessary). This should not
+ * happen for VMFail, but we get here if the check was caught by
+ * the processor and therefore the guest CR3 was loaded prematurely.
*/
+ kvm_mmu_unload(vcpu);
+ if (nested_vmx_load_cr3(vcpu, vmcs_readl(GUEST_CR3), false, !enable_ept, &ignored))
+ nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
if (enable_ept && is_pae_paging(vcpu))
ept_save_pdptrs(vcpu);
- kvm_mmu_reset_context(vcpu);
-
/*
* This nasty bit of open coding is a compromise between blindly
* loading L1's MSRs using the exit load lists (incorrect emulation
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail
2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
@ 2026-06-09 3:31 ` Sean Christopherson
0 siblings, 0 replies; 5+ messages in thread
From: Sean Christopherson @ 2026-06-09 3:31 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: linux-kernel, kvm, stable
On Thu, Jun 04, 2026, Paolo Bonzini wrote:
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 4690a4d23709..d612a5d071fc 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4947,6 +4947,7 @@ static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
>
> static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> {
> + enum vm_entry_failure_code ignored;
> struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> struct vmx_msr_entry g, h;
> @@ -4984,20 +4985,19 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
>
> nested_ept_uninit_mmu_context(vcpu);
> - vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> - kvm_register_mark_available(vcpu, VCPU_REG_CR3);
>
> /*
> - * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> - * from vmcs01 (if necessary). The PDPTRs are not loaded on
> - * VMFail, like everything else we just need to ensure our
> - * software model is up-to-date.
> + * Now that nested EPT has been disabled, load the MMU's CR3 and
> + * possibly PDPTRs from vmcs01 (if necessary). This should not
> + * happen for VMFail, but we get here if the check was caught by
> + * the processor and therefore the guest CR3 was loaded prematurely.
> */
> + kvm_mmu_unload(vcpu);
> + if (nested_vmx_load_cr3(vcpu, vmcs_readl(GUEST_CR3), false, !enable_ept, &ignored))
> + nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
This isn't quite correct either. I mean, none of this is architecturally correct,
but this is less correct than the other incorrect code here :-)
To do this "right", KVM should snapshot the PDPTRs and shove them into the MMU,
without touching guest memory.
On a very related topic, I have a patch to stash CR3 in software instead of
abusing vmcs01.GUEST_CR3, as KVM fails to restore vmcs01.GUEST_CR3 to its proper
state if nested_vmx_enter_non_root_mode() bails after clobbering vmcs01.GUEST_CR3,
but before loading guest state. We could probably do the same thing for PDPTRs?
https://lore.kernel.org/all/20260603223418.1720035-3-seanjc@google.com
> if (enable_ept && is_pae_paging(vcpu))
> ept_save_pdptrs(vcpu);
>
> - kvm_mmu_reset_context(vcpu);
> -
> /*
> * This nasty bit of open coding is a compromise between blindly
> * loading L1's MSRs using the exit load lists (incorrect emulation
> --
> 2.52.0
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 2/3] KVM: MMU: unconditionally clear MMIO cache on root rebuild
2026-06-04 16:07 [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload Paolo Bonzini
2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
@ 2026-06-04 16:07 ` Paolo Bonzini
2026-06-04 16:07 ` [PATCH 3/3] KVM: nVMX: remove unnecessary unload on processor-detected VMFail Paolo Bonzini
2 siblings, 0 replies; 5+ messages in thread
From: Paolo Bonzini @ 2026-06-04 16:07 UTC (permalink / raw)
To: linux-kernel, kvm
Upon changing CR3, the MMIO cache becomes invalid because the
GVA->GPA mapping has changed. However, kvm_load_new_pgd() calls
vcpu_clear_mmio_info() call only if the fast switch succeeded.
The early-return path instead leaves the root invalid; the next entry
then calls kvm_mmu_reload() and from there kvm_mmu_load().
kvm_mmu_load() calls kvm_mmu_sync_roots(), which clears the MMIO
cache, but one combination that falls through is root_role.direct==1,
i.e. CR0.PG=0, for which kvm_mmu_sync_roots() bails before reaching the
call to vcpu_clear_mmio_info().
That combination is barely reachable: a valid direct root is pretty much
always a fast-switch success because it does not check the PGD for a
match. The early return for a direct root thus requires the current root
to already be invalid, and kvm_mmu_unload() itself clears the MMIO cache.
That said, doing an independent clear in the style of kvm_mmu_new_pgd()
is more obviously correct and basically free, so harden it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
arch/x86/kvm/mmu/mmu.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f8aa7eda661e..6689c9f8ae16 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6138,6 +6138,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
if (r)
goto out;
+ vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
kvm_mmu_sync_roots(vcpu);
kvm_mmu_load_pgd(vcpu);
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 3/3] KVM: nVMX: remove unnecessary unload on processor-detected VMFail
2026-06-04 16:07 [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload Paolo Bonzini
2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
2026-06-04 16:07 ` [PATCH 2/3] KVM: MMU: unconditionally clear MMIO cache on root rebuild Paolo Bonzini
@ 2026-06-04 16:07 ` Paolo Bonzini
2 siblings, 0 replies; 5+ messages in thread
From: Paolo Bonzini @ 2026-06-04 16:07 UTC (permalink / raw)
To: linux-kernel, kvm
nested_vmx_restore_host_state() is following a similar scheme to
load_vmcs12_host_state() which does not need a kvm_mmu_unload().
So, does nested_vmx_restore_host_state() need it?
The answer is no. In the shadow case, kvm_init_mmu()
in nested_vmx_load_cr3() is enough to set a root_role
with guest_mode==0. kvm_mmu_new_pgd() then is now
able to reuse an old root. In the EPT case, root_mmu
still holds L1's valid root because L2 used guest_mmu.
Removing kvm_mmu_unload() thus is marginally more
efficient and it makes the two host state restore paths
identical.
The other thing that kvm_mmu_unload() does is clearing
the MMIO GVA cache. This was ensured previously by
calling vcpu_clear_mmio_info() from kvm_mmu_load()
rather than just kvm_mmu_new_pgd().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
arch/x86/kvm/vmx/nested.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d612a5d071fc..8b20a5eac1c9 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4992,7 +4992,6 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
* happen for VMFail, but we get here if the check was caught by
* the processor and therefore the guest CR3 was loaded prematurely.
*/
- kvm_mmu_unload(vcpu);
if (nested_vmx_load_cr3(vcpu, vmcs_readl(GUEST_CR3), false, !enable_ept, &ignored))
nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
if (enable_ept && is_pae_paging(vcpu))
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread