* [PATCH 0/4] KVM: x86: small MMU-adjacent cleanups
@ 2026-05-27 12:05 Paolo Bonzini
2026-05-27 12:05 ` [PATCH 1/4] KVM: x86: remove nested_mmu from mmu_is_nested() Paolo Bonzini
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Paolo Bonzini @ 2026-05-27 12:05 UTC (permalink / raw)
To: linux-kernel, kvm
The non-MMU changes/preliminary cleanups from my "separate kvm_mmu"
series; fixed, test, and extracted for inclusion in 7.2. The rest of
the series can be included later.
The meat here is in patches 2 and 3, which prepare for having a single
copy of the PDPTRs (in vcpu->arch) instead of two (in root_mmu and
nested_mmu). Please take your time to appreciate nested virtualization
as an endless source of en{joy,ploy}ment.
Thanks,
Paolo
Paolo Bonzini (4):
KVM: x86: remove nested_mmu from mmu_is_nested()
KVM: nVMX: remove unnecessary code in prepare_vmcs02
KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions
KVM: x86: check that kvm_handle_invpcid is only invoked with shadow
paging
arch/x86/kvm/kvm_cache_regs.h | 8 ++++++++
arch/x86/kvm/svm/nested.c | 27 ++++++++++++++++++---------
arch/x86/kvm/vmx/nested.c | 13 -------------
arch/x86/kvm/x86.c | 3 +++
arch/x86/kvm/x86.h | 2 +-
5 files changed, 30 insertions(+), 23 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/4] KVM: x86: remove nested_mmu from mmu_is_nested()
2026-05-27 12:05 [PATCH 0/4] KVM: x86: small MMU-adjacent cleanups Paolo Bonzini
@ 2026-05-27 12:05 ` Paolo Bonzini
2026-05-27 12:05 ` [PATCH 2/4] KVM: nVMX: remove unnecessary code in prepare_vmcs02 Paolo Bonzini
` (2 subsequent siblings)
3 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2026-05-27 12:05 UTC (permalink / raw)
To: linux-kernel, kvm
nested_mmu is always stored into vcpu->arch.walk_mmu at the same time as
guest_mmu is stored into vcpu->arch.mmu. But nested_mmu is not even
a proper MMU, it is only used for page walking; plus the fact that
walk_mmu has to be switched at all is just an implementation detail.
In the end what matters here is whether the guest is using nested
page tables; vmx/nested.c and svm/nested.c check it to see if they
are in nEPT or nNPT context respectively. So switch to checking
root_mmu vs. guest_mmu, which is a more cogent test.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
arch/x86/kvm/x86.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 38a905fa86de..60ff064de12f 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -290,7 +290,7 @@ static inline bool x86_exception_has_error_code(unsigned int vector)
static inline bool mmu_is_nested(struct kvm_vcpu *vcpu)
{
- return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu;
+ return vcpu->arch.mmu == &vcpu->arch.guest_mmu;
}
static inline bool is_pae(struct kvm_vcpu *vcpu)
--
2.52.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/4] KVM: nVMX: remove unnecessary code in prepare_vmcs02
2026-05-27 12:05 [PATCH 0/4] KVM: x86: small MMU-adjacent cleanups Paolo Bonzini
2026-05-27 12:05 ` [PATCH 1/4] KVM: x86: remove nested_mmu from mmu_is_nested() Paolo Bonzini
@ 2026-05-27 12:05 ` Paolo Bonzini
2026-05-28 0:14 ` Sean Christopherson
2026-05-27 12:05 ` [PATCH 3/4] KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions Paolo Bonzini
2026-05-27 12:06 ` [PATCH 4/4] KVM: x86: check that kvm_handle_invpcid is only invoked with shadow paging Paolo Bonzini
3 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2026-05-27 12:05 UTC (permalink / raw)
To: linux-kernel, kvm
The late vmwrite of the PDPTRs in prepare_vmcs02() is redundant, because
every write it does was already performed by prepare_vmcs02_rare earlier
in the same prepare_vmcs02 call.
In particular:
- load_guest_pdptrs_vmcs12 == true implies that prepare_vmcs02_rare() ran
- the assignment to load_guest_pdptrs_vmcs12 matches the gate in
prepare_vmcs02_rare(), other than having one that checks
!hv_evmcs and the other !nested_vmx_is_evmptr12_valid(vmx)
- the condition for the late move is a strict subset of the one in
prepare_vmcs02_rare(), because the former checks
nested_cpu_has_ept(vmcs12), which implies enable_ept, and on top
it narrows it further by ANDing is_pae_paging(vcpu)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
arch/x86/kvm/vmx/nested.c | 13 -------------
1 file changed, 13 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c1be8ef882b8..f86a12fc29ec 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2721,14 +2721,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx);
- bool load_guest_pdptrs_vmcs12 = false;
if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) {
prepare_vmcs02_rare(vmx, vmcs12);
vmx->nested.dirty_vmcs12 = false;
-
- load_guest_pdptrs_vmcs12 = !nested_vmx_is_evmptr12_valid(vmx) ||
- !(evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1);
}
if (vcpu->arch.nested_run_pending &&
@@ -2831,15 +2827,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
if (enable_ept)
vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
- /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
- if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
- is_pae_paging(vcpu)) {
- vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
- vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
- vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
- vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
- }
-
if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)) &&
WARN_ON_ONCE(__kvm_emulate_msr_write(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
--
2.52.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/4] KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions
2026-05-27 12:05 [PATCH 0/4] KVM: x86: small MMU-adjacent cleanups Paolo Bonzini
2026-05-27 12:05 ` [PATCH 1/4] KVM: x86: remove nested_mmu from mmu_is_nested() Paolo Bonzini
2026-05-27 12:05 ` [PATCH 2/4] KVM: nVMX: remove unnecessary code in prepare_vmcs02 Paolo Bonzini
@ 2026-05-27 12:05 ` Paolo Bonzini
2026-05-27 23:53 ` Sean Christopherson
2026-05-27 12:06 ` [PATCH 4/4] KVM: x86: check that kvm_handle_invpcid is only invoked with shadow paging Paolo Bonzini
3 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2026-05-27 12:05 UTC (permalink / raw)
To: linux-kernel, kvm
When L2 runs under nested NPT and uses PAE paging, KVM's cached PDPTRs
in mmu->pdptrs[] can hold stale or wrong values after nested
transitions and across migration restore, because both
nested_svm_load_cr3() and svm_get_nested_state_pages() only refresh
PDPTRs on the !nested_npt path.
The user-visible bug is on migration restore of an L2 running with nested
NPT and 32-bit PAE paging, if userspace uses KVM_SET_SREGS rather than
KVM_SET_SREGS2. In that case, load_pdptrs() leaves VCPU_EXREG_PDPTR
marked as available, and kvm_pdptr_read() will use a stale translation
that used L1 GPAs instead of L2 nGPAs. svm_get_nested_state_pages()
runs on first KVM_RUN but skips the refresh because nested_npt_enabled()
is true. The CPU itself reads L2's PDPTRs correctly from memory via
L1's NPT, but KVM-side walking of guest PAE page tables uses the bogus
cached values.
Unlike Intel's GUEST_PDPTR0..3 fields in the VMCS, SVM has no
VMCB-cached PDPTR state: the in-memory PDPTEs at the current CR3 are
the only source of truth, and svm_cache_reg(VCPU_EXREG_PDPTR) simply
reloads them from memory via load_pdptrs(). Clearing the avail
bit (and the dirty bit because !avail/dirty is invalid) to force
a reload when PDPTRs as needed fixes the bug.
Do the same for nested_svm_load_cr3()'s nested_npt branch, so that
the invariant "PDPTRs need reloading" is handled similarly for both
immediate and deferred loading.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
arch/x86/kvm/kvm_cache_regs.h | 8 ++++++++
arch/x86/kvm/svm/nested.c | 27 ++++++++++++++++++---------
2 files changed, 26 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 2ae492ad6412..6bae5db5a54e 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -77,6 +77,14 @@ static inline bool kvm_register_is_dirty(struct kvm_vcpu *vcpu,
return test_bit(reg, vcpu->arch.regs_dirty);
}
+static inline void kvm_register_mark_for_reload(struct kvm_vcpu *vcpu,
+ enum kvm_reg reg)
+{
+ kvm_assert_register_caching_allowed(vcpu);
+ __clear_bit(reg, vcpu->arch.regs_avail);
+ __clear_bit(reg, vcpu->arch.regs_dirty);
+}
+
static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu,
enum kvm_reg reg)
{
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 3d1fd1776e19..aa5a1d8ea136 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -680,9 +680,12 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
if (CC(!kvm_vcpu_is_legal_cr3(vcpu, cr3)))
return -EINVAL;
- if (reload_pdptrs && !nested_npt && is_pae_paging(vcpu) &&
- CC(!load_pdptrs(vcpu, cr3)))
- return -EINVAL;
+ if (reload_pdptrs && is_pae_paging(vcpu)) {
+ if (nested_npt)
+ kvm_register_mark_for_reload(vcpu, VCPU_REG_PDPTR);
+ else if (CC(!load_pdptrs(vcpu, cr3)))
+ return -EINVAL;
+ }
vcpu->arch.cr3 = cr3;
@@ -2055,15 +2058,21 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
if (WARN_ON(!is_guest_mode(vcpu)))
return true;
- if (!vcpu->arch.pdptrs_from_userspace &&
- !nested_npt_enabled(to_svm(vcpu)) && is_pae_paging(vcpu))
+ if (is_pae_paging(vcpu)) {
/*
- * Reload the guest's PDPTRs since after a migration
- * the guest CR3 might be restored prior to setting the nested
- * state which can lead to a load of wrong PDPTRs.
+ * After migration, CR3 may have been restored before
+ * KVM_SET_NESTED_STATE, so the PDPTR load into mmu->pdptrs[]
+ * may have treated CR3 as an L1 GPA. For nNPT, drop the
+ * cache so the next access reloads them with the proper
+ * nGPA translation. For !nNPT, reload eagerly unless userspace
+ * already supplied authoritative PDPTRs via KVM_SET_SREGS2.
*/
- if (CC(!load_pdptrs(vcpu, vcpu->arch.cr3)))
+ if (nested_npt_enabled(to_svm(vcpu)))
+ kvm_register_mark_for_reload(vcpu, VCPU_REG_PDPTR);
+ else if (!vcpu->arch.pdptrs_from_userspace &&
+ CC(!load_pdptrs(vcpu, vcpu->arch.cr3)))
return false;
+ }
if (!nested_svm_merge_msrpm(vcpu)) {
vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
--
2.52.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 4/4] KVM: x86: check that kvm_handle_invpcid is only invoked with shadow paging
2026-05-27 12:05 [PATCH 0/4] KVM: x86: small MMU-adjacent cleanups Paolo Bonzini
` (2 preceding siblings ...)
2026-05-27 12:05 ` [PATCH 3/4] KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions Paolo Bonzini
@ 2026-05-27 12:06 ` Paolo Bonzini
3 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2026-05-27 12:06 UTC (permalink / raw)
To: linux-kernel, kvm
This is true for both Intel and AMD. On Intel, "enable INVPCID" is
set unconditionally if supported, but the vmexit is triggered by the
"INVLPG exiting" control which is disabled by enable_ept. On AMD, KVM
can intercept INVPCID if NPT is enabled but only in order to inject #UD
in the guest.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
arch/x86/kvm/x86.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fc0924389398..1913efef6c39 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14350,6 +14350,9 @@ int kvm_handle_invpcid(struct kvm_vcpu *vcpu, unsigned long type, gva_t gva)
return 1;
}
+ if (WARN_ON_ONCE(tdp_enabled))
+ return 0;
+
pcid_enabled = kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE);
switch (type) {
--
2.52.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 3/4] KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions
2026-05-27 12:05 ` [PATCH 3/4] KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions Paolo Bonzini
@ 2026-05-27 23:53 ` Sean Christopherson
2026-05-28 8:33 ` Paolo Bonzini
0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2026-05-27 23:53 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: linux-kernel, kvm
On Wed, May 27, 2026, Paolo Bonzini wrote:
> When L2 runs under nested NPT and uses PAE paging, KVM's cached PDPTRs
> in mmu->pdptrs[] can hold stale or wrong values after nested
> transitions and across migration restore, because both
> nested_svm_load_cr3() and svm_get_nested_state_pages() only refresh
> PDPTRs on the !nested_npt path.
>
> The user-visible bug is on migration restore of an L2 running with nested
> NPT and 32-bit PAE paging, if userspace uses KVM_SET_SREGS rather than
> KVM_SET_SREGS2. In that case, load_pdptrs() leaves VCPU_EXREG_PDPTR
> marked as available, and kvm_pdptr_read() will use a stale translation
> that used L1 GPAs instead of L2 nGPAs. svm_get_nested_state_pages()
> runs on first KVM_RUN but skips the refresh because nested_npt_enabled()
> is true. The CPU itself reads L2's PDPTRs correctly from memory via
> L1's NPT, but KVM-side walking of guest PAE page tables uses the bogus
> cached values.
>
> Unlike Intel's GUEST_PDPTR0..3 fields in the VMCS, SVM has no
> VMCB-cached PDPTR state: the in-memory PDPTEs at the current CR3 are
> the only source of truth, and svm_cache_reg(VCPU_EXREG_PDPTR) simply
> reloads them from memory via load_pdptrs(). Clearing the avail
> bit (and the dirty bit because !avail/dirty is invalid) to force
> a reload when PDPTRs as needed fixes the bug.
>
> Do the same for nested_svm_load_cr3()'s nested_npt branch, so that
> the invariant "PDPTRs need reloading" is handled similarly for both
> immediate and deferred loading.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> arch/x86/kvm/kvm_cache_regs.h | 8 ++++++++
> arch/x86/kvm/svm/nested.c | 27 ++++++++++++++++++---------
> 2 files changed, 26 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index 2ae492ad6412..6bae5db5a54e 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -77,6 +77,14 @@ static inline bool kvm_register_is_dirty(struct kvm_vcpu *vcpu,
> return test_bit(reg, vcpu->arch.regs_dirty);
> }
>
> +static inline void kvm_register_mark_for_reload(struct kvm_vcpu *vcpu,
> + enum kvm_reg reg)
I would rather use kvm_clear_available_registers() than add yet another API,
which isn't even a good fit here since SVM never expects the PDPTRs to be dirty.
Though I think it's a moot point, because nSVM should be clearing *all*
lazy-loaded registers. It just so happens that PDPTRs are the only such "register".
I haven't checked to see if this would actually be correct, I'm just mimicking
the nVMX code. But conceptually, I think we want something like so:
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e74fcde6155e..0c6ab00766b1 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1303,6 +1303,8 @@ void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb)
{
svm->current_vmcb = target_vmcb;
svm->vmcb = target_vmcb->ptr;
+
+ kvm_clear_available_registers(&svm->vcpu, SVM_REGS_LAZY_LOAD_SET);
}
static int svm_vcpu_precreate(struct kvm *kvm)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/4] KVM: nVMX: remove unnecessary code in prepare_vmcs02
2026-05-27 12:05 ` [PATCH 2/4] KVM: nVMX: remove unnecessary code in prepare_vmcs02 Paolo Bonzini
@ 2026-05-28 0:14 ` Sean Christopherson
2026-05-28 7:56 ` Paolo Bonzini
0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2026-05-28 0:14 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: linux-kernel, kvm
On Wed, May 27, 2026, Paolo Bonzini wrote:
> The late vmwrite of the PDPTRs in prepare_vmcs02() is redundant, because
> every write it does was already performed by prepare_vmcs02_rare earlier
> in the same prepare_vmcs02 call.
>
> In particular:
>
> - load_guest_pdptrs_vmcs12 == true implies that prepare_vmcs02_rare() ran
>
> - the assignment to load_guest_pdptrs_vmcs12 matches the gate in
> prepare_vmcs02_rare(), other than having one that checks
> !hv_evmcs and the other !nested_vmx_is_evmptr12_valid(vmx)
>
> - the condition for the late move is a strict subset of the one in
> prepare_vmcs02_rare(), because the former checks
> nested_cpu_has_ept(vmcs12), which implies enable_ept, and on top
> it narrows it further by ANDing is_pae_paging(vcpu)
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> arch/x86/kvm/vmx/nested.c | 13 -------------
> 1 file changed, 13 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index c1be8ef882b8..f86a12fc29ec 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2721,14 +2721,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx);
> - bool load_guest_pdptrs_vmcs12 = false;
>
> if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) {
> prepare_vmcs02_rare(vmx, vmcs12);
> vmx->nested.dirty_vmcs12 = false;
> -
> - load_guest_pdptrs_vmcs12 = !nested_vmx_is_evmptr12_valid(vmx) ||
> - !(evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1);
> }
>
> if (vcpu->arch.nested_run_pending &&
> @@ -2831,15 +2827,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> if (enable_ept)
> vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
>
> - /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
> - if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
> - is_pae_paging(vcpu)) {
> - vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
> - vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
> - vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
> - vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
Hah! Fixing your own mistakes, but making it look like it was my fault. :-D
In the original version[1] of what ended up being commit c7554efc8335 ("KVM: nVMX:
Copy PDPTRs to/from vmcs12 only when necessary"), the writes in what is now
prepare_vmcs02_rare() (and I at least am still amused by the meat pun[2]) were
removed. When the mega-collection of optimizations was posted[3], the removal
of that code got dropped.
AFAICT, it was completely unintenional. Maybe a rebase goof?
I don't care terribly about the performance, and I 100% agree the extra complexity
to do the "late" loading is ugly, but I still think I'd prefer to keep the "late"
loading so that there's symmetry with the vmcs02 => vmcs12 sync, and so that we
don't try to optimize the prepare_vmcs02_rare() code and consume incorrect state
via is_pae_paging().
Alternatively, expand the comment in prepare_vmcs02_rare()?
[1] https://lore.kernel.org/all/20190507160640.4812-16-sean.j.christopherson@intel.com
[2] https://lore.kernel.org/all/525f4ee1-c111-aaa0-2dcd-6c5ce26e3088@redhat.com
[3] https://lore.kernel.org/all/1560445409-17363-31-git-send-email-pbonzini@redhat.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/4] KVM: nVMX: remove unnecessary code in prepare_vmcs02
2026-05-28 0:14 ` Sean Christopherson
@ 2026-05-28 7:56 ` Paolo Bonzini
0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2026-05-28 7:56 UTC (permalink / raw)
To: Sean Christopherson; +Cc: linux-kernel, kvm
On 5/28/26 02:14, Sean Christopherson wrote:
> Hah! Fixing your own mistakes, but making it look like it was my fault. :-D
>
> In the original version[1] of what ended up being commit c7554efc8335 ("KVM: nVMX:
> Copy PDPTRs to/from vmcs12 only when necessary"), the writes in what is now
> prepare_vmcs02_rare() (and I at least am still amused by the meat pun[2]) were
> removed. When the mega-collection of optimizations was posted[3], the removal
> of that code got dropped.
>
> AFAICT, it was completely unintenional. Maybe a rebase goof?
Yes, probably.
> I don't care terribly about the performance, and I 100% agree the extra complexity
> to do the "late" loading is ugly, but I still think I'd prefer to keep the "late"
> loading so that there's symmetry with the vmcs02 => vmcs12 sync, and so that we
> don't try to optimize the prepare_vmcs02_rare() code and consume incorrect state
> via is_pae_paging().
>
> Alternatively, expand the comment in prepare_vmcs02_rare()?
Ok, I will keep the late loading.
Paolo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/4] KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions
2026-05-27 23:53 ` Sean Christopherson
@ 2026-05-28 8:33 ` Paolo Bonzini
2026-05-28 18:33 ` Jim Mattson
0 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2026-05-28 8:33 UTC (permalink / raw)
To: Sean Christopherson; +Cc: linux-kernel, kvm
On 5/28/26 01:53, Sean Christopherson wrote:
> I would rather use kvm_clear_available_registers() than add yet another API,
> which isn't even a good fit here since SVM never expects the PDPTRs to be dirty.
>
> Though I think it's a moot point, because nSVM should be clearing *all*
> lazy-loaded registers. It just so happens that PDPTRs are the only such "register".
>
> I haven't checked to see if this would actually be correct, I'm just mimicking
> the nVMX code. But conceptually, I think we want something like so:
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index e74fcde6155e..0c6ab00766b1 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1303,6 +1303,8 @@ void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb)
> {
> svm->current_vmcb = target_vmcb;
> svm->vmcb = target_vmcb->ptr;
> +
> + kvm_clear_available_registers(&svm->vcpu, SVM_REGS_LAZY_LOAD_SET);
> }
>
> static int svm_vcpu_precreate(struct kvm *kvm)
I actually started with something like that. nSVM should indeed be
clearing all other lazy-loaded registers too (except there isn't any).
I ended up doing it the other way for two reasons, both boiling down to
PDPTRs being weird.
First, the PDPTRs are cached processor state but not a field of the
VMCB; changing the VMCB should have no effect on them. For SVM, changes
to their cache state are purely a result of writes to CR3 or CR4.PAE.
Thus it seemed more explicit to single out the PAE+nNPT case when
loading CR3:
if (nested_npt)
kvm_register_mark_for_reload
else if (!load_pdptrs(...))
return -EINVAL;
instead of the slightly more opaque
if (!nested_npt && !load_pdptrs(...))
return -EINVAL;
Second, with nNPT it's not even entirely correct to cache them at all.
Strictly speaking they should never even be marked available, that way
you'd go through load_pdptrs() on every PAE memory access. Thus
kvm_register_mark_for_reload() could also be tied to the CR4 write, but
in any case not to the VMCB switch.
I'd rather keep it like this but yeah, it can be done either way. If
there were other registers in the lazy load set, relying on
svm_switch_vmcb() would make sense but you'd want to add a comment above
the two instances of "if (!nested_npt && !load_pdptrs())".
Paolo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/4] KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions
2026-05-28 8:33 ` Paolo Bonzini
@ 2026-05-28 18:33 ` Jim Mattson
0 siblings, 0 replies; 10+ messages in thread
From: Jim Mattson @ 2026-05-28 18:33 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Sean Christopherson, linux-kernel, kvm
On Thu, May 28, 2026 at 1:44 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 5/28/26 01:53, Sean Christopherson wrote:
> > I would rather use kvm_clear_available_registers() than add yet another API,
> > which isn't even a good fit here since SVM never expects the PDPTRs to be dirty.
> >
> > Though I think it's a moot point, because nSVM should be clearing *all*
> > lazy-loaded registers. It just so happens that PDPTRs are the only such "register".
> >
> > I haven't checked to see if this would actually be correct, I'm just mimicking
> > the nVMX code. But conceptually, I think we want something like so:
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index e74fcde6155e..0c6ab00766b1 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -1303,6 +1303,8 @@ void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb)
> > {
> > svm->current_vmcb = target_vmcb;
> > svm->vmcb = target_vmcb->ptr;
> > +
> > + kvm_clear_available_registers(&svm->vcpu, SVM_REGS_LAZY_LOAD_SET);
> > }
> >
> > static int svm_vcpu_precreate(struct kvm *kvm)
>
> I actually started with something like that. nSVM should indeed be
> clearing all other lazy-loaded registers too (except there isn't any).
> I ended up doing it the other way for two reasons, both boiling down to
> PDPTRs being weird.
>
> First, the PDPTRs are cached processor state but not a field of the
> VMCB; changing the VMCB should have no effect on them. For SVM, changes
> to their cache state are purely a result of writes to CR3 or CR4.PAE.
Huh?
The SDM says, "The behavior of PAE mode in a nested-paging guest
differs slightly from the behavior of (host-only) legacy PAE mode, in
that the guest’s four PDPEs are not loaded into the processor at the
time CR3 is written. Instead, the PDPEs are accessed on demand as part
of a table walk. This has the side-effect that illegal bit
combinations in the PDPEs are not signaled at the time that CR3 is
written, but instead when the faulty PDPE is accessed as part of a
table walk.
So, they are only cached as part of a partial walk, just like the
entries at any other level. And, unlike Intel, changes to the PDPTRs
in memory *may* be visible in a future page walk without changing CR3.
> Thus it seemed more explicit to single out the PAE+nNPT case when
> loading CR3:
>
> if (nested_npt)
> kvm_register_mark_for_reload
> else if (!load_pdptrs(...))
> return -EINVAL;
>
> instead of the slightly more opaque
>
> if (!nested_npt && !load_pdptrs(...))
> return -EINVAL;
>
> Second, with nNPT it's not even entirely correct to cache them at all.
> Strictly speaking they should never even be marked available, that way
> you'd go through load_pdptrs() on every PAE memory access. Thus
> kvm_register_mark_for_reload() could also be tied to the CR4 write, but
> in any case not to the VMCB switch.
>
> I'd rather keep it like this but yeah, it can be done either way. If
> there were other registers in the lazy load set, relying on
> svm_switch_vmcb() would make sense but you'd want to add a comment above
> the two instances of "if (!nested_npt && !load_pdptrs())".
>
> Paolo
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-05-28 18:33 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-27 12:05 [PATCH 0/4] KVM: x86: small MMU-adjacent cleanups Paolo Bonzini
2026-05-27 12:05 ` [PATCH 1/4] KVM: x86: remove nested_mmu from mmu_is_nested() Paolo Bonzini
2026-05-27 12:05 ` [PATCH 2/4] KVM: nVMX: remove unnecessary code in prepare_vmcs02 Paolo Bonzini
2026-05-28 0:14 ` Sean Christopherson
2026-05-28 7:56 ` Paolo Bonzini
2026-05-27 12:05 ` [PATCH 3/4] KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions Paolo Bonzini
2026-05-27 23:53 ` Sean Christopherson
2026-05-28 8:33 ` Paolo Bonzini
2026-05-28 18:33 ` Jim Mattson
2026-05-27 12:06 ` [PATCH 4/4] KVM: x86: check that kvm_handle_invpcid is only invoked with shadow paging Paolo Bonzini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox