[PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload
@ 2026-06-04 16:07 Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Paolo Bonzini @ 2026-06-04 16:07 UTC (permalink / raw)
  To: linux-kernel, kvm

Nested VMX tries to detect all possible VMFail cases before committing
to execute VMLAUNCH/VMRESUME, because the steps leading to the VMCS02
VM entry need to load various bits of the L2 processor state into
KVM's software model (vcpu->arch, essentially).

However, there are cases that are not caught or in some cases even
racy because the data comes from memory rather than from the VMCS;
in that case the earlier load of L2 state needs to be unwound, and
nested_vmx_restore_host_state() exists for that purpose.  Sashiko
found a hole where L1's CR3 is restored there upon a VM-Entry failure,
but the PDPTRs are only restored if EPT is enabled. If shadow paging
is used, the L2 PDPTRs from the aborted entry attempt will remain in
vcpu->arch.root_mmu->pdptrs.

Fix this by forcing use of nested_vmx_load_cr3(), in the same guise
as load_vmcs12_host_state().

The other two patches are respectively another minor
hardening/clarification, and a small optimization.

Paolo Bonzini (3):
  KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail
  KVM: MMU: unconditionally clear MMIO cache on root rebuild
  KVM: nVMX: remove unnecessary unload on processor-detected VMFail

 arch/x86/kvm/mmu/mmu.c    |  1 +
 arch/x86/kvm/vmx/nested.c | 15 +++++++--------
 2 files changed, 8 insertions(+), 8 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail
  2026-06-04 16:07 [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload Paolo Bonzini
@ 2026-06-04 16:07 ` Paolo Bonzini
  2026-06-09  3:31   ` Sean Christopherson
  2026-06-04 16:07 ` [PATCH 2/3] KVM: MMU: unconditionally clear MMIO cache on root rebuild Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 3/3] KVM: nVMX: remove unnecessary unload on processor-detected VMFail Paolo Bonzini
  2 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2026-06-04 16:07 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: stable

Upon a VM-entry failure that is caught by the processor rather than
KVM, nested_vmx_restore_host_state() restores L1's CR3 but not the
PDPTRs.  If shadow paging is used (enable_ept is false), the L2
PDPTRs loaded during the aborted entry attempt remain in
vcpu->arch.mmu->pdptrs[].

Note that the fact that the PDPTRs are stored in the MMU does not
save the day, because KVM only uses root_mmu if enable_ept is false.

To fix this, use nested_vmx_load_cr3() instead of open coding
just the load of vcpu->arch.cr3, in the same guise as
load_vmcs12_host_state().  nested_vmx_load_cr3() will mark the
register as dirty rather than available, but this is only a
very minor pessimization.

If EPT *is* in use, do not load the PDPTRs and rely solely on
ept_save_pdptrs() to reload them from VMCS01.  When vmx_load_mmu_pgd()
runs on the next entry, the PDPTRs are available---meaning they are
not incorrectly reloaded from memory.

kvm_mmu_unload() is preserved to keep the paths from the old
kvm_mmu_reset_context(), but is actually unnecessary.  It can
be removed as a separate patch.

Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 4690a4d23709..d612a5d071fc 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4947,6 +4947,7 @@ static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
 
 static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
 {
+	enum vm_entry_failure_code ignored;
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmx_msr_entry g, h;
@@ -4984,20 +4985,19 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
 	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
 
 	nested_ept_uninit_mmu_context(vcpu);
-	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
-	kvm_register_mark_available(vcpu, VCPU_REG_CR3);
 
 	/*
-	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
-	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
-	 * VMFail, like everything else we just need to ensure our
-	 * software model is up-to-date.
+	 * Now that nested EPT has been disabled, load the MMU's CR3 and
+	 * possibly PDPTRs from vmcs01 (if necessary).  This should not
+	 * happen for VMFail, but we get here if the check was caught by
+	 * the processor and therefore the guest CR3 was loaded prematurely.
 	 */
+	kvm_mmu_unload(vcpu);
+	if (nested_vmx_load_cr3(vcpu, vmcs_readl(GUEST_CR3), false, !enable_ept, &ignored))
+		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
 	if (enable_ept && is_pae_paging(vcpu))
 		ept_save_pdptrs(vcpu);
 
-	kvm_mmu_reset_context(vcpu);
-
 	/*
 	 * This nasty bit of open coding is a compromise between blindly
 	 * loading L1's MSRs using the exit load lists (incorrect emulation
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail
  2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
@ 2026-06-09  3:31   ` Sean Christopherson
  0 siblings, 0 replies; 5+ messages in thread
From: Sean Christopherson @ 2026-06-09  3:31 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, stable

On Thu, Jun 04, 2026, Paolo Bonzini wrote:
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 4690a4d23709..d612a5d071fc 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4947,6 +4947,7 @@ static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
>  
>  static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
>  {
> +	enum vm_entry_failure_code ignored;
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	struct vmx_msr_entry g, h;
> @@ -4984,20 +4985,19 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
>  	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
>  
>  	nested_ept_uninit_mmu_context(vcpu);
> -	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> -	kvm_register_mark_available(vcpu, VCPU_REG_CR3);
>  
>  	/*
> -	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> -	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
> -	 * VMFail, like everything else we just need to ensure our
> -	 * software model is up-to-date.
> +	 * Now that nested EPT has been disabled, load the MMU's CR3 and
> +	 * possibly PDPTRs from vmcs01 (if necessary).  This should not
> +	 * happen for VMFail, but we get here if the check was caught by
> +	 * the processor and therefore the guest CR3 was loaded prematurely.
>  	 */
> +	kvm_mmu_unload(vcpu);
> +	if (nested_vmx_load_cr3(vcpu, vmcs_readl(GUEST_CR3), false, !enable_ept, &ignored))
> +		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);

This isn't quite correct either.  I mean, none of this is architecturally correct,
but this is less correct than the other incorrect code here :-)

To do this "right", KVM should snapshot the PDPTRs and shove them into the MMU,
without touching guest memory.

On a very related topic, I have a patch to stash CR3 in software instead of
abusing vmcs01.GUEST_CR3, as KVM fails to restore vmcs01.GUEST_CR3 to its proper
state if nested_vmx_enter_non_root_mode() bails after clobbering vmcs01.GUEST_CR3,
but before loading guest state.  We could probably do the same thing for PDPTRs?

https://lore.kernel.org/all/20260603223418.1720035-3-seanjc@google.com

>  	if (enable_ept && is_pae_paging(vcpu))
>  		ept_save_pdptrs(vcpu);
>  
> -	kvm_mmu_reset_context(vcpu);
> -
>  	/*
>  	 * This nasty bit of open coding is a compromise between blindly
>  	 * loading L1's MSRs using the exit load lists (incorrect emulation
> -- 
> 2.52.0
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 2/3] KVM: MMU: unconditionally clear MMIO cache on root rebuild
  2026-06-04 16:07 [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
@ 2026-06-04 16:07 ` Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 3/3] KVM: nVMX: remove unnecessary unload on processor-detected VMFail Paolo Bonzini
  2 siblings, 0 replies; 5+ messages in thread
From: Paolo Bonzini @ 2026-06-04 16:07 UTC (permalink / raw)
  To: linux-kernel, kvm

Upon changing CR3, the MMIO cache becomes invalid because the
GVA->GPA mapping has changed.  However, kvm_load_new_pgd() calls
vcpu_clear_mmio_info() call only if the fast switch succeeded.
The early-return path instead leaves the root invalid; the next entry
then calls kvm_mmu_reload() and from there kvm_mmu_load().

kvm_mmu_load() calls kvm_mmu_sync_roots(), which clears the MMIO
cache, but one combination that falls through is root_role.direct==1,
i.e. CR0.PG=0, for which kvm_mmu_sync_roots() bails before reaching the
call to vcpu_clear_mmio_info().

That combination is barely reachable: a valid direct root is pretty much
always a fast-switch success because it does not check the PGD for a
match.  The early return for a direct root thus requires the current root
to already be invalid, and kvm_mmu_unload() itself clears the MMIO cache.

That said, doing an independent clear in the style of kvm_mmu_new_pgd()
is more obviously correct and basically free, so harden it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f8aa7eda661e..6689c9f8ae16 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6138,6 +6138,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	if (r)
 		goto out;

+	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 	kvm_mmu_sync_roots(vcpu);

 	kvm_mmu_load_pgd(vcpu);
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/3] KVM: nVMX: remove unnecessary unload on processor-detected VMFail
  2026-06-04 16:07 [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
  2026-06-04 16:07 ` [PATCH 2/3] KVM: MMU: unconditionally clear MMIO cache on root rebuild Paolo Bonzini
@ 2026-06-04 16:07 ` Paolo Bonzini
  2 siblings, 0 replies; 5+ messages in thread
From: Paolo Bonzini @ 2026-06-04 16:07 UTC (permalink / raw)
  To: linux-kernel, kvm

nested_vmx_restore_host_state() is following a similar scheme to
load_vmcs12_host_state() which does not need a kvm_mmu_unload().
So, does nested_vmx_restore_host_state() need it?

The answer is no.  In the shadow case, kvm_init_mmu()
in nested_vmx_load_cr3() is enough to set a root_role
with guest_mode==0.  kvm_mmu_new_pgd() then is now
able to reuse an old root.  In the EPT case, root_mmu
still holds L1's valid root because L2 used guest_mmu.

Removing kvm_mmu_unload() thus is marginally more
efficient and it makes the two host state restore paths
identical.

The other thing that kvm_mmu_unload() does is clearing
the MMIO GVA cache.  This was ensured previously by
calling vcpu_clear_mmio_info() from kvm_mmu_load()
rather than just kvm_mmu_new_pgd().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d612a5d071fc..8b20a5eac1c9 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4992,7 +4992,6 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
 	 * happen for VMFail, but we get here if the check was caught by
 	 * the processor and therefore the guest CR3 was loaded prematurely.
 	 */
-	kvm_mmu_unload(vcpu);
 	if (nested_vmx_load_cr3(vcpu, vmcs_readl(GUEST_CR3), false, !enable_ept, &ignored))
 		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
 	if (enable_ept && is_pae_paging(vcpu))
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-09  3:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04 16:07 [PATCH 0/3] KVM: harden and cleanup PDPTR load on forced L1 reload Paolo Bonzini
2026-06-04 16:07 ` [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail Paolo Bonzini
2026-06-09  3:31   ` Sean Christopherson
2026-06-04 16:07 ` [PATCH 2/3] KVM: MMU: unconditionally clear MMIO cache on root rebuild Paolo Bonzini
2026-06-04 16:07 ` [PATCH 3/3] KVM: nVMX: remove unnecessary unload on processor-detected VMFail Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.