[PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload

Kernel KVM virtualization development
 help / color / mirror / Atom feed

* [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload
@ 2026-06-04 16:07 Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Paolo Bonzini @ 2026-06-04 16:07 UTC (permalink / raw)
  To: linux-kernel, kvm

Nested VMX tries to detect all possible VMFail cases before committing
to execute VMLAUNCH/VMRESUME, because the steps leading to the VMCS02
VM entry need to load various bits of the L2 processor state into
KVM's software model (vcpu->arch, essentially).

However, there are cases that are not caught or in some cases even
racy because the data comes from memory rather than from the VMCS;
in that case the earlier load of L2 state needs to be unwound, and
nested_vmx_restore_host_state() exists for that purpose.  Sashiko
found a hole where L1's CR3 is restored there upon a VM-Entry failure,
but the PDPTRs are only restored if EPT is enabled. If shadow paging
is used, the L2 PDPTRs from the aborted entry attempt will remain in
vcpu->arch.root_mmu->pdptrs.

Fix this by forcing use of nested_vmx_load_cr3(), in the same guise
as load_vmcs12_host_state().

The other two patches are respectively another minor
hardening/clarification, and a small optimization.

Paolo Bonzini (3):
  KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail
  KVM: MMU: unconditionally clear MMIO cache on root rebuild
  KVM: nVMX: remove unnecessary unload on processor-detected VMFail

 arch/x86/kvm/mmu/mmu.c    |  1 +
 arch/x86/kvm/vmx/nested.c | 15 +++++++--------
 2 files changed, 8 insertions(+), 8 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail
  2026-06-04 16:07 [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload Paolo Bonzini
@ 2026-06-04 16:07 ` Paolo Bonzini
  2026-06-09  3:31   ` Sean Christopherson
  2026-06-04 16:07 ` [PATCH 2/3] KVM: MMU: unconditionally clear MMIO cache on root rebuild Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 3/3] KVM: nVMX: remove unnecessary unload on processor-detected VMFail Paolo Bonzini
  2 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2026-06-04 16:07 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: stable

Upon a VM-entry failure that is caught by the processor rather than
KVM, nested_vmx_restore_host_state() restores L1's CR3 but not the
PDPTRs.  If shadow paging is used (enable_ept is false), the L2
PDPTRs loaded during the aborted entry attempt remain in
vcpu->arch.mmu->pdptrs[].

Note that the fact that the PDPTRs are stored in the MMU does not
save the day, because KVM only uses root_mmu if enable_ept is false.

To fix this, use nested_vmx_load_cr3() instead of open coding
just the load of vcpu->arch.cr3, in the same guise as
load_vmcs12_host_state().  nested_vmx_load_cr3() will mark the
register as dirty rather than available, but this is only a
very minor pessimization.

If EPT *is* in use, do not load the PDPTRs and rely solely on
ept_save_pdptrs() to reload them from VMCS01.  When vmx_load_mmu_pgd()
runs on the next entry, the PDPTRs are available---meaning they are
not incorrectly reloaded from memory.

kvm_mmu_unload() is preserved to keep the paths from the old
kvm_mmu_reset_context(), but is actually unnecessary.  It can
be removed as a separate patch.

Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 4690a4d23709..d612a5d071fc 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4947,6 +4947,7 @@ static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
 
 static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
 {
+	enum vm_entry_failure_code ignored;
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmx_msr_entry g, h;
@@ -4984,20 +4985,19 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
 	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
 
 	nested_ept_uninit_mmu_context(vcpu);
-	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
-	kvm_register_mark_available(vcpu, VCPU_REG_CR3);
 
 	/*
-	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
-	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
-	 * VMFail, like everything else we just need to ensure our
-	 * software model is up-to-date.
+	 * Now that nested EPT has been disabled, load the MMU's CR3 and
+	 * possibly PDPTRs from vmcs01 (if necessary).  This should not
+	 * happen for VMFail, but we get here if the check was caught by
+	 * the processor and therefore the guest CR3 was loaded prematurely.
 	 */
+	kvm_mmu_unload(vcpu);
+	if (nested_vmx_load_cr3(vcpu, vmcs_readl(GUEST_CR3), false, !enable_ept, &ignored))
+		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
 	if (enable_ept && is_pae_paging(vcpu))
 		ept_save_pdptrs(vcpu);
 
-	kvm_mmu_reset_context(vcpu);
-
 	/*
 	 * This nasty bit of open coding is a compromise between blindly
 	 * loading L1's MSRs using the exit load lists (incorrect emulation
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/3] KVM: MMU: unconditionally clear MMIO cache on root rebuild
  2026-06-04 16:07 [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
@ 2026-06-04 16:07 ` Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 3/3] KVM: nVMX: remove unnecessary unload on processor-detected VMFail Paolo Bonzini
  2 siblings, 0 replies; 5+ messages in thread
From: Paolo Bonzini @ 2026-06-04 16:07 UTC (permalink / raw)
  To: linux-kernel, kvm

Upon changing CR3, the MMIO cache becomes invalid because the
GVA->GPA mapping has changed.  However, kvm_load_new_pgd() calls
vcpu_clear_mmio_info() call only if the fast switch succeeded.
The early-return path instead leaves the root invalid; the next entry
then calls kvm_mmu_reload() and from there kvm_mmu_load().

kvm_mmu_load() calls kvm_mmu_sync_roots(), which clears the MMIO
cache, but one combination that falls through is root_role.direct==1,
i.e. CR0.PG=0, for which kvm_mmu_sync_roots() bails before reaching the
call to vcpu_clear_mmio_info().

That combination is barely reachable: a valid direct root is pretty much
always a fast-switch success because it does not check the PGD for a
match.  The early return for a direct root thus requires the current root
to already be invalid, and kvm_mmu_unload() itself clears the MMIO cache.

That said, doing an independent clear in the style of kvm_mmu_new_pgd()
is more obviously correct and basically free, so harden it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f8aa7eda661e..6689c9f8ae16 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6138,6 +6138,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	if (r)
 		goto out;

+	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 	kvm_mmu_sync_roots(vcpu);

 	kvm_mmu_load_pgd(vcpu);
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/3] KVM: nVMX: remove unnecessary unload on processor-detected VMFail
  2026-06-04 16:07 [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 2/3] KVM: MMU: unconditionally clear MMIO cache on root rebuild Paolo Bonzini
@ 2026-06-04 16:07 ` Paolo Bonzini
  2 siblings, 0 replies; 5+ messages in thread
From: Paolo Bonzini @ 2026-06-04 16:07 UTC (permalink / raw)
  To: linux-kernel, kvm

nested_vmx_restore_host_state() is following a similar scheme to
load_vmcs12_host_state() which does not need a kvm_mmu_unload().
So, does nested_vmx_restore_host_state() need it?

The answer is no.  In the shadow case, kvm_init_mmu()
in nested_vmx_load_cr3() is enough to set a root_role
with guest_mode==0.  kvm_mmu_new_pgd() then is now
able to reuse an old root.  In the EPT case, root_mmu
still holds L1's valid root because L2 used guest_mmu.

Removing kvm_mmu_unload() thus is marginally more
efficient and it makes the two host state restore paths
identical.

The other thing that kvm_mmu_unload() does is clearing
the MMIO GVA cache.  This was ensured previously by
calling vcpu_clear_mmio_info() from kvm_mmu_load()
rather than just kvm_mmu_new_pgd().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d612a5d071fc..8b20a5eac1c9 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4992,7 +4992,6 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
 	 * happen for VMFail, but we get here if the check was caught by
 	 * the processor and therefore the guest CR3 was loaded prematurely.
 	 */
-	kvm_mmu_unload(vcpu);
 	if (nested_vmx_load_cr3(vcpu, vmcs_readl(GUEST_CR3), false, !enable_ept, &ignored))
 		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
 	if (enable_ept && is_pae_paging(vcpu))
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail
  2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
@ 2026-06-09  3:31   ` Sean Christopherson
  0 siblings, 0 replies; 5+ messages in thread
From: Sean Christopherson @ 2026-06-09  3:31 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, stable

On Thu, Jun 04, 2026, Paolo Bonzini wrote:
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 4690a4d23709..d612a5d071fc 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4947,6 +4947,7 @@ static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
>  
>  static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
>  {
> +	enum vm_entry_failure_code ignored;
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	struct vmx_msr_entry g, h;
> @@ -4984,20 +4985,19 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
>  	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
>  
>  	nested_ept_uninit_mmu_context(vcpu);
> -	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> -	kvm_register_mark_available(vcpu, VCPU_REG_CR3);
>  
>  	/*
> -	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> -	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
> -	 * VMFail, like everything else we just need to ensure our
> -	 * software model is up-to-date.
> +	 * Now that nested EPT has been disabled, load the MMU's CR3 and
> +	 * possibly PDPTRs from vmcs01 (if necessary).  This should not
> +	 * happen for VMFail, but we get here if the check was caught by
> +	 * the processor and therefore the guest CR3 was loaded prematurely.
>  	 */
> +	kvm_mmu_unload(vcpu);
> +	if (nested_vmx_load_cr3(vcpu, vmcs_readl(GUEST_CR3), false, !enable_ept, &ignored))
> +		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);

This isn't quite correct either.  I mean, none of this is architecturally correct,
but this is less correct than the other incorrect code here :-)

To do this "right", KVM should snapshot the PDPTRs and shove them into the MMU,
without touching guest memory.

On a very related topic, I have a patch to stash CR3 in software instead of
abusing vmcs01.GUEST_CR3, as KVM fails to restore vmcs01.GUEST_CR3 to its proper
state if nested_vmx_enter_non_root_mode() bails after clobbering vmcs01.GUEST_CR3,
but before loading guest state.  We could probably do the same thing for PDPTRs?

https://lore.kernel.org/all/20260603223418.1720035-3-seanjc@google.com

>  	if (enable_ept && is_pae_paging(vcpu))
>  		ept_save_pdptrs(vcpu);
>  
> -	kvm_mmu_reset_context(vcpu);
> -
>  	/*
>  	 * This nasty bit of open coding is a compromise between blindly
>  	 * loading L1's MSRs using the exit load lists (incorrect emulation
> -- 
> 2.52.0
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-09  3:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04 16:07 [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload Paolo Bonzini
2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
2026-06-09  3:31   ` Sean Christopherson
2026-06-04 16:07 ` [PATCH 2/3] KVM: MMU: unconditionally clear MMIO cache on root rebuild Paolo Bonzini
2026-06-04 16:07 ` [PATCH 3/3] KVM: nVMX: remove unnecessary unload on processor-detected VMFail Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox