* [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
@ 2026-02-19 0:22 Sean Christopherson
2026-02-19 0:23 ` Sean Christopherson
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Sean Christopherson @ 2026-02-19 0:22 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini
Cc: kvm, linux-kernel, Rick Edgecombe, Yosry Ahmed, Yan Zhao
Track the mask of guest physical address bits that can actually be mapped
by a given MMU instance that utilizes TDP, and either exit to userspace
with -EFAULT or go straight to emulation without creating an SPTE (for
emulated MMIO) if KVM can't map the address. Attempting to create an SPTE
can cause KVM to drop the unmappable bits, and thus install a bad SPTE.
E.g. when starting a walk, the TDP MMU will round the GFN based on the
root level, and drop the upper bits.
Exit with -EFAULT in the unlikely scenario userspace is misbehaving and
created a memslot that can't be addressed, e.g. if userspace installed
memory above the guest.MAXPHYADDR defined in CPUID, as there's nothing KVM
can do to make forward progress, and there _is_ a memslot for the address.
For emulated MMIO, KVM can at least kick the bad address out to userspace
via a normal MMIO exit.
The flaw has existed for a very long time, and was exposed by commit
988da7820206 ("KVM: x86/tdp_mmu: WARN if PFN changes for spurious faults")
thanks to a syzkaller program that prefaults memory at GPA 0x1000000000000
and then faults in memory at GPA 0x0 (the extra-large GPA gets wrapped to
'0').
WARNING: arch/x86/kvm/mmu/tdp_mmu.c:1183 at kvm_tdp_mmu_map+0x5c3/0xa30 [kvm], CPU#125: syz.5.22/18468
CPU: 125 UID: 0 PID: 18468 Comm: syz.5.22 Tainted: G S W 6.19.0-smp--23879af241d6-next #57 NONE
Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
Hardware name: Google Izumi-EMR/izumi, BIOS 0.20250917.0-0 09/17/2025
RIP: 0010:kvm_tdp_mmu_map+0x5c3/0xa30 [kvm]
Call Trace:
<TASK>
kvm_tdp_page_fault+0x107/0x140 [kvm]
kvm_mmu_do_page_fault+0x121/0x200 [kvm]
kvm_arch_vcpu_pre_fault_memory+0x18c/0x230 [kvm]
kvm_vcpu_pre_fault_memory+0x116/0x1e0 [kvm]
kvm_vcpu_ioctl+0x3a5/0x6b0 [kvm]
__se_sys_ioctl+0x6d/0xb0
do_syscall_64+0x8d/0x900
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
In practice, the flaw is benign (other than the new WARN) as it only
affects guests that ignore guest.MAXPHYADDR (e.g. on CPUs with 52-bit
physical addresses but only 4-level paging) or guests being run by a
misbehaving userspace VMM (e.g. a VMM that ignored allow_smaller_maxphyaddr
or is pre-faulting bad addresses).
For non-TDP shadow paging, always clear the unmappable mask as the flaw
only affects GPAs affected. For 32-bit paging, 64-bit virtual addresses
simply don't exist. Even when software can shove a 64-bit address
somewhere, e.g. into SYSENTER_EIP, the value is architecturally truncated
before it reaches the page table walker. And for 64-bit paging, KVM's use
of 4-level vs. 5-level paging is tied to the guest's CR4.LA57, i.e. KVM
won't observe a 57-bit virtual address with a 4-level MMU.
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/kvm_host.h | 6 +++++
arch/x86/kvm/mmu/mmu.c | 42 +++++++++++++++++++++++++++++++++
2 files changed, 48 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ff07c45e3c73..43b9777b896d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -493,6 +493,12 @@ struct kvm_mmu {
*/
u8 permissions[16];
+ /*
+ * Mask of address bits that KVM can't map with this MMU given the root
+ * level, e.g. 5-level EPT/NPT only consume bits 51:0.
+ */
+ gpa_t unmappable_mask;
+
u64 *pae_root;
u64 *pml4_root;
u64 *pml5_root;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3911ac9bddfd..2dc9a297e6ed 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3540,6 +3540,14 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
return RET_PF_EMULATE;
+ /*
+ * Similarly, if KVM can't map the faulting address, don't attempt to
+ * install a SPTE because KVM will effectively truncate the address
+ * when walking KVM's page tables.
+ */
+ if (unlikely(fault->addr & vcpu->arch.mmu->unmappable_mask))
+ return RET_PF_EMULATE;
+
return RET_PF_CONTINUE;
}
@@ -4681,6 +4689,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
return RET_PF_RETRY;
}
+ if (fault->addr & vcpu->arch.mmu->unmappable_mask) {
+ kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+ return -EFAULT;
+ }
+
if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
/*
* Don't map L1's APIC access page into L2, KVM doesn't support
@@ -5772,6 +5785,29 @@ u8 kvm_mmu_get_max_tdp_level(void)
return tdp_root_level ? tdp_root_level : max_tdp_level;
}
+static void reset_tdp_unmappable_mask(struct kvm_mmu *mmu)
+{
+ int max_addr_bit;
+
+ switch (mmu->root_role.level) {
+ case PT64_ROOT_5LEVEL:
+ max_addr_bit = 52;
+ break;
+ case PT64_ROOT_4LEVEL:
+ max_addr_bit = 48;
+ break;
+ case PT32E_ROOT_LEVEL:
+ max_addr_bit = 32;
+ break;
+ default:
+ WARN_ONCE(1, "Unhandled root level %u\n", mmu->root_role.level);
+ mmu->unmappable_mask = 0;
+ return;
+ }
+
+ mmu->unmappable_mask = rsvd_bits(max_addr_bit, 63);
+}
+
static union kvm_mmu_page_role
kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
union kvm_cpu_role cpu_role)
@@ -5816,6 +5852,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
else
context->gva_to_gpa = paging32_gva_to_gpa;
+ reset_tdp_unmappable_mask(context);
reset_guest_paging_metadata(vcpu, context);
reset_tdp_shadow_zero_bits_mask(context);
}
@@ -5889,6 +5926,8 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
root_role.passthrough = 1;
shadow_mmu_init_context(vcpu, context, cpu_role, root_role);
+ reset_tdp_unmappable_mask(context);
+
kvm_mmu_new_pgd(vcpu, nested_cr3);
}
EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_init_shadow_npt_mmu);
@@ -5939,6 +5978,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
update_permission_bitmask(context, true);
context->pkru_mask = 0;
+ reset_tdp_unmappable_mask(context);
reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
reset_ept_shadow_zero_bits_mask(context, execonly);
}
@@ -5954,6 +5994,8 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu,
kvm_init_shadow_mmu(vcpu, cpu_role);
+ context->unmappable_mask = 0;
+
context->get_guest_pgd = get_guest_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;
base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
--
2.53.0.345.g96ddfc5eaa-goog
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-19 0:22 [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable Sean Christopherson @ 2026-02-19 0:23 ` Sean Christopherson [not found] ` <c06466c636da3fc1dc14dc09260981a2554c7cc2.camel@intel.com> ` (3 subsequent siblings) 4 siblings, 0 replies; 16+ messages in thread From: Sean Christopherson @ 2026-02-19 0:23 UTC (permalink / raw) To: Paolo Bonzini, kvm, linux-kernel, Rick Edgecombe, Yosry Ahmed, Yan Zhao On Wed, Feb 18, 2026, Sean Christopherson wrote: > Track the mask of guest physical address bits that can actually be mapped > by a given MMU instance that utilizes TDP, and either exit to userspace > with -EFAULT or go straight to emulation without creating an SPTE (for > emulated MMIO) if KVM can't map the address. Attempting to create an SPTE > can cause KVM to drop the unmappable bits, and thus install a bad SPTE. > E.g. when starting a walk, the TDP MMU will round the GFN based on the > root level, and drop the upper bits. > > Exit with -EFAULT in the unlikely scenario userspace is misbehaving and > created a memslot that can't be addressed, e.g. if userspace installed > memory above the guest.MAXPHYADDR defined in CPUID, as there's nothing KVM > can do to make forward progress, and there _is_ a memslot for the address. > For emulated MMIO, KVM can at least kick the bad address out to userspace > via a normal MMIO exit. > > The flaw has existed for a very long time, and was exposed by commit > 988da7820206 ("KVM: x86/tdp_mmu: WARN if PFN changes for spurious faults") > thanks to a syzkaller program that prefaults memory at GPA 0x1000000000000 > and then faults in memory at GPA 0x0 (the extra-large GPA gets wrapped to > '0'). > > WARNING: arch/x86/kvm/mmu/tdp_mmu.c:1183 at kvm_tdp_mmu_map+0x5c3/0xa30 [kvm], CPU#125: syz.5.22/18468 > CPU: 125 UID: 0 PID: 18468 Comm: syz.5.22 Tainted: G S W 6.19.0-smp--23879af241d6-next #57 NONE > Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN > Hardware name: Google Izumi-EMR/izumi, BIOS 0.20250917.0-0 09/17/2025 > RIP: 0010:kvm_tdp_mmu_map+0x5c3/0xa30 [kvm] > Call Trace: > <TASK> > kvm_tdp_page_fault+0x107/0x140 [kvm] > kvm_mmu_do_page_fault+0x121/0x200 [kvm] > kvm_arch_vcpu_pre_fault_memory+0x18c/0x230 [kvm] > kvm_vcpu_pre_fault_memory+0x116/0x1e0 [kvm] > kvm_vcpu_ioctl+0x3a5/0x6b0 [kvm] > __se_sys_ioctl+0x6d/0xb0 > do_syscall_64+0x8d/0x900 > entry_SYSCALL_64_after_hwframe+0x4b/0x53 > </TASK> > > In practice, the flaw is benign (other than the new WARN) as it only > affects guests that ignore guest.MAXPHYADDR (e.g. on CPUs with 52-bit > physical addresses but only 4-level paging) or guests being run by a > misbehaving userspace VMM (e.g. a VMM that ignored allow_smaller_maxphyaddr > or is pre-faulting bad addresses). > > For non-TDP shadow paging, always clear the unmappable mask as the flaw > only affects GPAs affected. For 32-bit paging, 64-bit virtual addresses > simply don't exist. Even when software can shove a 64-bit address > somewhere, e.g. into SYSENTER_EIP, the value is architecturally truncated > before it reaches the page table walker. And for 64-bit paging, KVM's use > of 4-level vs. 5-level paging is tied to the guest's CR4.LA57, i.e. KVM > won't observe a 57-bit virtual address with a 4-level MMU. > > Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> > Cc: Yosry Ahmed <yosry.ahmed@linux.dev> > Cc: Yan Zhao <yan.y.zhao@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- Here's the full syzkaller reproducer (found on a manual run internally, so syzbot didn't do the work for us). FWIW, I don't love this approach, but I couldn't come up with anything better. ioctl$KVM_SET_LAPIC(0xffffffffffffffff, 0x4400ae8f, &(0x7f0000000100)={"b46474f815e8d5535f0887c44335cc824dc6121bc72a77f532ff5dad4d643a9cab29d2310e04be14eb26c0af4985fe45e3b3b0680b3ec92725d74b9716e0f7c3119a2c9a0ae65ff4772e2e12733cb013c4308fe40863480747c0a7ddb9361b1578015ca1bb2c1677ebae096f08345476f567443842946ed946434c75916d1db83fe305920de65bfaf9bd940672216846cb16b8ae67cd3affc61375381f91b3b9f1cc5e38cafe5239aee71dcd481fbe1ecd2547ffbaad4469a74697c28fb9beefa6a5d736712a55eb9110c2cf7964062ba8cbc1c038e84f0f5db7fc7053118bf5221e3efa6fc3edb5d0ca3cde7054dd0751a332520aa8478b1775d552c5cc24d3c2df9eb333e5ca3aa06c1c2cf8526714f5caff2f55b41976fc20b64f1fc61d5b44f50953582a1825d32130a31abfeafd1987317879e29ac51b93c9659e023fff3ddb5e39dd19cc3ef1d883c78b9e073d08a9197fb3717df238b9831831214b186693be9dd2568bb77272e80df5dfed03e8c467627bedfbd93359a9f79a3aa37e873dc1357b37b43d813ea85267b0dc8b1c4cc51bd985328833beb2679b7fb762555bbea2da936b36f8f1673fd5f606b2b6eb23b72bf947206e8dbfeb40ca6f265a3485c8446e0f0da652860b88328073d2282c14b48a7774e62754a968b60e92205e8fafcdd70a55c3c4d1a4821ff44e6e3681f15ae091262e3a3290a24d8ceae30ebbf9d24287bb8a5d73c608d47d287f9e716cf02b4796a83fb0c05e45b89de9ef8bce834e6d7a0be6e30d2c66cb6e640cb01898454ad361bc0701d8fe56113335ae6adec59300db04691cc4a689034272a8e086a32ce7061b4f79fa8afbb48a6ce4b62bdc44af013d78980457e1fa61eb9204818606f4c3b03c0f33cd2a841ac9bc2b73151a96e31ab99e6ec969b5f2c3edd5f9abc69845e487af992758ba445368da93dae1d44360d52a534a88276b8aaf349841d8a4788c60408618437c442308dbf70efeda2e54e9b9e4fe5f76997c9dcb945a26bd75748c85d19ca8b99264dce50580e8d4dbda401dad7df31e9a7a6a3a83bfbdfb5394abd581ac0824fbcd75d2f5205c0b7c9188e6f26bfd97734d9a20433f6cdba9d14a5f32a4d97a57f4603b21146fd1aebf082e863d463c224ad623c17d8043d3bf083f0322408dd6ead6915ac6a4222ab51480eb6e11a8913348219515170d9df90d72d7363bbda3e327d19f98c0a856f98076380e788e602e8a2ae0a1930786874dc21a2e99abda15f35457cf1dcb440c4b41350d0eda352aad7f57a0adc8a6914da06460635ed21c4c11cd1a8ec778064c9f62efba2927828b23f94b16619a5520731c2c40ab8583c9f2e73233d74b84f4877ce6b35bb1180300"}) r0 = openat$kvm(0xffffff9c, &(0x7f00000000c0), 0x0, 0x0) r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0) r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) ioctl$KVM_SET_USER_MEMORY_REGION(r1, 0x4020ae46, &(0x7f0000000080)={0x0, 0x0, 0x0, 0x2000, &(0x7f0000000000/0x2000)=nil}) ioctl$KVM_SET_REGS(r2, 0x4090ae82, &(0x7f0000000200)={[0x0, 0x6, 0xfffffffffffffffd, 0x0, 0xfffd, 0x1, 0x4002004c4, 0x1000, 0x0, 0x0, 0x0, 0x0, 0x3], 0x25000, 0x2011c0}) ioctl$KVM_RUN(r2, 0xae80, 0x0) ioctl$KVM_PRE_FAULT_MEMORY(r2, 0xc040aed5, &(0x7f0000000000)={0x0, 0x18000}) ioctl$KVM_SET_PIT2(0xffffffffffffffff, 0x4070aea0, &(0x7f0000000100)={[{0x7ff, 0x93, 0x0, 0xc0, 0xc0, 0x92, 0x85, 0x8, 0x6, 0xa, 0x0, 0x7, 0x8001}, {0x5, 0x2, 0xf9, 0x8, 0x7c, 0xf, 0xd, 0x1, 0x5, 0x3, 0x7, 0xa, 0x7}, {0x7, 0x71b0, 0x3, 0x3, 0xf8, 0x1, 0x8, 0x3, 0x8, 0x82, 0xc, 0xa4, 0x6}], 0xfffffffa}) ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <c06466c636da3fc1dc14dc09260981a2554c7cc2.camel@intel.com>]
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable [not found] ` <c06466c636da3fc1dc14dc09260981a2554c7cc2.camel@intel.com> @ 2026-02-20 16:54 ` Sean Christopherson 2026-02-21 0:01 ` Edgecombe, Rick P 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2026-02-20 16:54 UTC (permalink / raw) To: Edgecombe, Rick P; +Cc: kvm, linux-kernel +lists, because I'm confident there's no host attack. On Thu, Feb 19, 2026, Edgecombe, Rick P wrote: > On Wed, 2026-02-18 at 16:22 -0800, Sean Christopherson wrote: > > In practice, the flaw is benign (other than the new WARN) as it only > > affects guests that ignore guest.MAXPHYADDR (e.g. on CPUs with 52-bit > > physical addresses but only 4-level paging) or guests being run by a > > misbehaving userspace VMM (e.g. a VMM that ignored allow_smaller_maxphyaddr > > or is pre-faulting bad addresses). > > I tried to look at whether this is true from a hurt-the-host perspective. > > Did you consider the potential mismatch between the GFN passed to > kvm_flush_remote_tlbs_range() and the PTE's for different GFNs that actually got > touched. For example in recover_huge_pages_range(), if it flushed the wrong > range then the page table that got freed could still be in the intermediate > translation caches? I hadn't thought about this before you mentioned it, but I audited all code paths and all paths that lead to kvm_flush_remote_tlbs_range() use a "sanitized" gfn, i.e. KVM never emits a flush for the gfn reported by the fault. Which meshes with a logical analysis as well: KVM only needs to flush when removing/changing an entry, and so should always derive the to-be-flushed ranges using the gfn that was used to make the change. And the "bad" gfn can never have TLB entries, because KVM never creates mappings. FWIW, even if KVM screwed up something like recover_huge_pages_range(), it wouldn't hurt the _host_. Because from a host safety perspective, KVM x86 only needs to get it right in three paths: kvm_flush_shadow_all(), __kvm_gmem_invalidate_begin(), and kvm_mmu_notifier_invalidate_range_start(). > I'm not sure how this HV flush stuff actually works in practice, especially on > those details. So not raising any red flags. Just thought maybe worth > considering. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-20 16:54 ` Sean Christopherson @ 2026-02-21 0:01 ` Edgecombe, Rick P 2026-02-21 0:07 ` Sean Christopherson 0 siblings, 1 reply; 16+ messages in thread From: Edgecombe, Rick P @ 2026-02-21 0:01 UTC (permalink / raw) To: seanjc@google.com; +Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org On Fri, 2026-02-20 at 16:54 +0000, Sean Christopherson wrote: > > > > Did you consider the potential mismatch between the GFN passed to > > > > kvm_flush_remote_tlbs_range() and the PTE's for different GFNs that > > > > actually > > got touched. For example in recover_huge_pages_range(), if > > > > it flushed the > > wrong range then the page table that got freed could > > > > still be in the > > intermediate translation caches? > > > > I hadn't thought about this before you mentioned it, but I audited all code > > > paths and all paths that lead to kvm_flush_remote_tlbs_range() use a > > > "sanitized" gfn, i.e. KVM never emits a flush for the gfn reported by the > > > fault. Doh, sorry. > > Which meshes with a logical analysis as well: KVM only needs to flush when > > > removing/changing an entry, and so should always derive the to-be-flushed > > > ranges using the gfn that was used to make the change. > > > > And the "bad" gfn can never have TLB entries, because KVM never creates > > > mappings. Oh. I was under the impression that the fault gets its GPA bits stripped and ends up mapping the page table mapping at a wrong different GPA. So if some optimized GFN targeting flush was pointed at the unstripped GPA then it could miss the GPA that actually got mapped and made it into the TLB. Anyway, it seems moot. > > > > FWIW, even if KVM screwed up something like recover_huge_pages_range(), it > > > wouldn't hurt the _host_. Because from a host safety perspective, KVM x86 > > > only needs to get it right in three paths: kvm_flush_shadow_all(), > > __kvm_gmem_invalidate_begin(), and > > kvm_mmu_notifier_invalidate_range_start(). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-21 0:01 ` Edgecombe, Rick P @ 2026-02-21 0:07 ` Sean Christopherson 0 siblings, 0 replies; 16+ messages in thread From: Sean Christopherson @ 2026-02-21 0:07 UTC (permalink / raw) To: Rick P Edgecombe; +Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org On Sat, Feb 21, 2026, Rick P Edgecombe wrote: > On Fri, 2026-02-20 at 16:54 +0000, Sean Christopherson wrote: > > > Which meshes with a logical analysis as well: KVM only needs to flush when > > > > removing/changing an entry, and so should always derive the to-be-flushed > > > > ranges using the gfn that was used to make the change. > > > > > > And the "bad" gfn can never have TLB entries, because KVM never creates > > > > mappings. > > Oh. I was under the impression that the fault gets its GPA bits stripped and > ends up mapping the page table mapping at a wrong different GPA. It does (by KVM, not by hardware). The above is juyst trying to clarify that we don't have to worry about the GFN from the fault, either. > So if some optimized GFN targeting flush was pointed at the unstripped GPA > then it could miss the GPA that actually got mapped and made it into the TLB. > Anyway, it seems moot. Yeah, we're on the same page. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-19 0:22 [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable Sean Christopherson 2026-02-19 0:23 ` Sean Christopherson [not found] ` <c06466c636da3fc1dc14dc09260981a2554c7cc2.camel@intel.com> @ 2026-02-21 0:08 ` Edgecombe, Rick P 2026-02-21 0:49 ` Sean Christopherson 2026-02-23 11:12 ` Huang, Kai 2026-03-05 7:55 ` Yan Zhao 4 siblings, 1 reply; 16+ messages in thread From: Edgecombe, Rick P @ 2026-02-21 0:08 UTC (permalink / raw) To: pbonzini@redhat.com, seanjc@google.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Zhao, Yan Y, yosry.ahmed@linux.dev On Wed, 2026-02-18 at 16:22 -0800, Sean Christopherson wrote: > +static void reset_tdp_unmappable_mask(struct kvm_mmu *mmu) > +{ > + int max_addr_bit; > + > + switch (mmu->root_role.level) { > + case PT64_ROOT_5LEVEL: > + max_addr_bit = 52; > + break; > + case PT64_ROOT_4LEVEL: > + max_addr_bit = 48; > + break; > + case PT32E_ROOT_LEVEL: > + max_addr_bit = 32; > + break; > + default: > + WARN_ONCE(1, "Unhandled root level %u\n", mmu->root_role.level); > + mmu->unmappable_mask = 0; Would it be better to set max_addr_bit to 0 and let rsvd_bits() set it below? Then the unknown case is safer about rejecting things. > + return; > + } > + > + mmu->unmappable_mask = rsvd_bits(max_addr_bit, 63); > +} > + Gosh, this forced me to expand my understanding of how the guest and host page levels get glued together. Hopefully this is not too far off... In the patch this function is passed both guest_mmu and root_mmu. So sometimes it's going to be L1 GPA address, and sometimes (for AMD nested?) it's going to be an L2 GVA. For the GVA case I don't see how PT32_ROOT_LEVEL can be omitted. It would hit the warning? But also the '5' case is weird because as a GVA the max addresse bits should be 57 and a GPA is should be 54. And that the TDP side uses 4 and 5 specifically, so the PT64_ just happens to match. So I'd think this needs a version for GVA and one for GPA. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-21 0:08 ` Edgecombe, Rick P @ 2026-02-21 0:49 ` Sean Christopherson 2026-02-23 23:23 ` Edgecombe, Rick P 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2026-02-21 0:49 UTC (permalink / raw) To: Rick P Edgecombe Cc: pbonzini@redhat.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Yan Y Zhao, yosry.ahmed@linux.dev On Sat, Feb 21, 2026, Rick P Edgecombe wrote: > On Wed, 2026-02-18 at 16:22 -0800, Sean Christopherson wrote: > > +static void reset_tdp_unmappable_mask(struct kvm_mmu *mmu) > > +{ > > + int max_addr_bit; > > + > > + switch (mmu->root_role.level) { > > + case PT64_ROOT_5LEVEL: > > + max_addr_bit = 52; > > + break; > > + case PT64_ROOT_4LEVEL: > > + max_addr_bit = 48; > > + break; > > + case PT32E_ROOT_LEVEL: > > + max_addr_bit = 32; > > + break; > > + default: > > + WARN_ONCE(1, "Unhandled root level %u\n", mmu->root_role.level); > > + mmu->unmappable_mask = 0; > > Would it be better to set max_addr_bit to 0 and let rsvd_bits() set it below? > Then the unknown case is safer about rejecting things. No, because speaking from experience, rejecting isn't safer (I had a brain fart and thought legacy shadow paging was also affected). There's no danger to the host (other than the WARN itself), and so safety here is all about the guest. Setting unmappable_mask to -1ull is all but guaranteed to kill the guest, because KVM will reject all faults. Setting unmappable_mask to 0 is only problematic if the guest and/or userspace is misbehaving, and even then, the worst case scenario isn't horrific, all things considered. > > + return; > > + } > > + > > + mmu->unmappable_mask = rsvd_bits(max_addr_bit, 63); > > +} > > + > > Gosh, this forced me to expand my understanding of how the guest and host page > levels get glued together. Hopefully this is not too far off... > > In the patch this function is passed both guest_mmu and root_mmu. So sometimes > it's going to be L1 GPA address, and sometimes (for AMD nested?) it's going to > be an L2 GVA. For the GVA case I don't see how PT32_ROOT_LEVEL can be omitted. > It would hit the warning? No, it's always a GPA. root_mmu translates L1 GPA => L0 GPA and L1 GVA => GPA*; guest_mmu translates L2 GPA => L0 GPA; nested_mmu translates L2 GVA => L2 GPA. Note! The asterisk is that root_mmu is also used when L2 is active if L1 is NOT using TDP, either because KVM isn't using TDP, or because the L1 hypervisor decided not to. In those cases, L2 GPA == L1 GPA from KVM's perspective, because the L1 hypervisor is responsible for shadowing L2 GVA => L1 GPA. And root_mmu can also translate L2 GPA => L0 GPA and L2 GVA => L2 GPA (again, L1 GPA == L2 GPA). > But also the '5' case is weird because as a GVA the max addresse bits should be > 57 and a GPA is should be 54. 52, i.e. the architectural max MAXPHYADDR. > And that the TDP side uses 4 and 5 specifically, so the PT64_ just happens to > match. No, it's not a coincidence. The "truncation" to 52 bits is an architectural quirk. Long ago, people decided 52 bits of PA were enough for anyone, and so repurposed bits 63:52 for e.g. NX, SUPPRESS_VE, and software-available bits. I.e. conceptually, 5-level paging allows for 57 bits of addressing, but EPT and NPT and NPT define bits 63:52 to be other things. > So I'd think this needs a version for GVA and one for GPA. No, see the last paragraph in the changelog. Side topic, if you have _any_ idea for better names than guest_mmu vs. nested_mmu, speak up. This is like the fifth? time I've had a discussion about how awful those names are, but we've yet to come up with names that suck less. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-21 0:49 ` Sean Christopherson @ 2026-02-23 23:23 ` Edgecombe, Rick P 2026-02-24 1:49 ` Sean Christopherson 0 siblings, 1 reply; 16+ messages in thread From: Edgecombe, Rick P @ 2026-02-23 23:23 UTC (permalink / raw) To: seanjc@google.com Cc: kvm@vger.kernel.org, pbonzini@redhat.com, linux-kernel@vger.kernel.org, Zhao, Yan Y, yosry.ahmed@linux.dev On Fri, 2026-02-20 at 16:49 -0800, Sean Christopherson wrote: > On Sat, Feb 21, 2026, Rick P Edgecombe wrote: > > On Wed, 2026-02-18 at 16:22 -0800, Sean Christopherson wrote: > > > +static void reset_tdp_unmappable_mask(struct kvm_mmu *mmu) > > > +{ > > > + int max_addr_bit; > > > + > > > + switch (mmu->root_role.level) { > > > + case PT64_ROOT_5LEVEL: > > > + max_addr_bit = 52; > > > + break; > > > + case PT64_ROOT_4LEVEL: > > > + max_addr_bit = 48; > > > + break; > > > + case PT32E_ROOT_LEVEL: > > > + max_addr_bit = 32; > > > + break; > > > + default: > > > + WARN_ONCE(1, "Unhandled root level %u\n", mmu- > > > >root_role.level); > > > + mmu->unmappable_mask = 0; > > > > Would it be better to set max_addr_bit to 0 and let rsvd_bits() set > > it below? Then the unknown case is safer about rejecting things. > > No, because speaking from experience, rejecting isn't safer (I had a > brain fart and thought legacy shadow paging was also affected). > There's no danger to the host (other than the WARN itself), and so > safety here is all about the guest. > > Setting unmappable_mask to -1ull is all but guaranteed to kill the > guest, because KVM will reject all faults. Setting unmappable_mask > to 0 is only problematic if the guest and/or userspace is > misbehaving, and even then, the worst case scenario isn't horrific, > all things considered. Confused MM code makes me nervous, but fair enough. > > > > + return; > > > + } > > > + > > > + mmu->unmappable_mask = rsvd_bits(max_addr_bit, 63); > > > +} > > > + > > > > Gosh, this forced me to expand my understanding of how the guest > > and host page levels get glued together. Hopefully this is not too > > far off... > > > > In the patch this function is passed both guest_mmu and root_mmu. > > So sometimes it's going to be L1 GPA address, and sometimes (for > > AMD nested?) it's going to be an L2 GVA. For the GVA case I don't > > see how PT32_ROOT_LEVEL can be omitted. It would hit the warning? > > No, it's always a GPA. root_mmu translates L1 GPA => L0 GPA and L1 > GVA => GPA*; guest_mmu translates L2 GPA => L0 GPA; nested_mmu > translates L2 GVA => L2 GPA. > > Note! The asterisk is that root_mmu is also used when L2 is active > if L1 is NOT using TDP, either because KVM isn't using TDP, or > because the L1 hypervisor decided not to. In those cases, L2 GPA == > L1 GPA from KVM's perspective, because the L1 hypervisor is > responsible for shadowing L2 GVA => L1 GPA. And root_mmu can also > translate L2 GPA => L0 GPA and L2 GVA => L2 GPA (again, L1 GPA == L2 > GPA). I appreciate you taking the time to explain. Tracing through with the above I realize I was under the wrong impression about how nested SVM worked. > > > But also the '5' case is weird because as a GVA the max addresse > > bits should be 57 and a GPA is should be 54. > > 52, i.e. the architectural max MAXPHYADDR. Oops yes I meant 52. But if it is always max physical address and not trying to handle VA's too, why is PT32E_ROOT_LEVEL 32 instead of 36? That also send me down the path of assuming GVAs were in the mix, but now I see it is used for 32 bit SVM. > [snip] > > > > So I'd think this needs a version for GVA and one for GPA. > > No, see the last paragraph in the changelog. > > Side topic, if you have _any_ idea for better names than guest_mmu > vs. nested_mmu, speak up. This is like the fifth? time I've had a > discussion about how awful those names are, but we've yet to come up > with names that suck less. I don't. As above, I got confused with some wrong assumptions. The names seem reasonable. The short notes about the translation input and for each MMU might be nice to have somewhere. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-23 23:23 ` Edgecombe, Rick P @ 2026-02-24 1:49 ` Sean Christopherson 0 siblings, 0 replies; 16+ messages in thread From: Sean Christopherson @ 2026-02-24 1:49 UTC (permalink / raw) To: Rick P Edgecombe Cc: kvm@vger.kernel.org, pbonzini@redhat.com, linux-kernel@vger.kernel.org, Yan Y Zhao, yosry.ahmed@linux.dev On Mon, Feb 23, 2026, Rick P Edgecombe wrote: > On Fri, 2026-02-20 at 16:49 -0800, Sean Christopherson wrote: > > > But also the '5' case is weird because as a GVA the max addresse > > > bits should be 57 and a GPA is should be 54. > > > > 52, i.e. the architectural max MAXPHYADDR. > > Oops yes I meant 52. But if it is always max physical address and not > trying to handle VA's too, why is PT32E_ROOT_LEVEL 32 instead of > 36? Setting aside how any nNPT with a 32-bit kernel works for the moment, it would be 52, not 36. PT32E_ROOT_LEVEL is PAE, which per the SDM can address 52 bits of physical address space: PAE paging translates 32-bit linear addresses to 52-bit physical addresses. PSE-36, a.k.a. 2-level 32-bit paging with CR4.PSE=1, is the horror that can address 36 bits of physical address space by abusing reserved bits in the "offset" portion of a huge 4MiB page. Somewhat of an aside, KVM always uses 64-bit paging or PAE paging for its MMU (or EPT, but that's basically 64-bit), and so when running on 32-bit kernel, KVM requires a PAE-enabled kernel to enable NPT, because hCR4 isn't changed on VMRUN, i.e. the paging mode for KVM's MMU is tightly coupled to the host kernel's paging mode. Which is one of several reasons why nNPT is a mess. /* * KVM's MMU doesn't support using 2-level paging for itself, and thus * NPT isn't supported if the host is using 2-level paging since host * CR4 is unchanged on VMRUN. */ if (!IS_ENABLED(CONFIG_X86_64) && !IS_ENABLED(CONFIG_X86_PAE)) npt_enabled = false; As for how running a 32-bit PAE nNPT "works", I suspect it simply doesn't from an architectural perspective. 32-bit KVM-on-KVM works (though I haven't check in a few years...) because Linux doesn't allocate kernel memory out of high memory, i.e. L1 KVM won't feed "bad" addresses to L0 KVM, and presumably QEMU doesn't manage to either. I might be forgetting something though? If I get bored, or more likely when my curiousity gets the best of me, I'll see how hardware behaves. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-19 0:22 [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable Sean Christopherson ` (2 preceding siblings ...) 2026-02-21 0:08 ` Edgecombe, Rick P @ 2026-02-23 11:12 ` Huang, Kai 2026-02-23 16:54 ` Sean Christopherson 2026-03-05 7:55 ` Yan Zhao 4 siblings, 1 reply; 16+ messages in thread From: Huang, Kai @ 2026-02-23 11:12 UTC (permalink / raw) To: pbonzini@redhat.com, seanjc@google.com Cc: Zhao, Yan Y, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Edgecombe, Rick P, yosry.ahmed@linux.dev > @@ -3540,6 +3540,14 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu, > if (unlikely(fault->gfn > kvm_mmu_max_gfn())) > return RET_PF_EMULATE; > > + /* > + * Similarly, if KVM can't map the faulting address, don't attempt to > + * install a SPTE because KVM will effectively truncate the address > + * when walking KVM's page tables. > + */ > + if (unlikely(fault->addr & vcpu->arch.mmu->unmappable_mask)) > + return RET_PF_EMULATE; > + > return RET_PF_CONTINUE; > } > > @@ -4681,6 +4689,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, > return RET_PF_RETRY; > } > > + if (fault->addr & vcpu->arch.mmu->unmappable_mask) { > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); > + return -EFAULT; > + } > + If we forget the case of shadow paging, do you think we should explicitly strip the shared bit? I think the MMU code currently always treats the shared bit as "mappable" (as long as the real GPA is mappable), so logically it's better to strip the shared bit first before checking the GPA. But in practice there's no problem because only TDX uses shared bit and it is within the 'mappable' bits. But the odd is if the fault->addr is L2 GPA or L2 GVA, then the shared bit (which is concept of L1 guest) doesn't apply to it. Btw, from hardware's point of view, does EPT/NPT silently drops high unmappable bits of GPA or it generates some kinda EPT violation/misconfig? I tried to confirm from the spec but seems not sure. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-23 11:12 ` Huang, Kai @ 2026-02-23 16:54 ` Sean Christopherson 2026-02-23 20:48 ` Huang, Kai 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2026-02-23 16:54 UTC (permalink / raw) To: Kai Huang Cc: pbonzini@redhat.com, Yan Y Zhao, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Rick P Edgecombe, yosry.ahmed@linux.dev On Mon, Feb 23, 2026, Kai Huang wrote: > > > @@ -3540,6 +3540,14 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu, > > if (unlikely(fault->gfn > kvm_mmu_max_gfn())) > > return RET_PF_EMULATE; > > > > + /* > > + * Similarly, if KVM can't map the faulting address, don't attempt to > > + * install a SPTE because KVM will effectively truncate the address > > + * when walking KVM's page tables. > > + */ > > + if (unlikely(fault->addr & vcpu->arch.mmu->unmappable_mask)) > > + return RET_PF_EMULATE; > > + > > return RET_PF_CONTINUE; > > } > > > > @@ -4681,6 +4689,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, > > return RET_PF_RETRY; > > } > > > > + if (fault->addr & vcpu->arch.mmu->unmappable_mask) { > > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); > > + return -EFAULT; > > + } > > + > > If we forget the case of shadow paging, do you think we should explicitly > strip the shared bit? > > I think the MMU code currently always treats the shared bit as "mappable" > (as long as the real GPA is mappable), so logically it's better to strip the > shared bit first before checking the GPA. But in practice there's no > problem because only TDX uses shared bit and it is within the 'mappable' > bits. I don't think so? Because even though the SHARED bit has special semantics, it's still very much an address bit in the current architecture. > But the odd is if the fault->addr is L2 GPA or L2 GVA, then the shared bit > (which is concept of L1 guest) doesn't apply to it. > > Btw, from hardware's point of view, does EPT/NPT silently drops high > unmappable bits of GPA or it generates some kinda EPT violation/misconfig? EPT violation. The SDM says: With 4-level EPT, bits 51:48 of the guest-physical address must all be zero; otherwise, an EPT violation occurs (see Section 30.3.3). I can't find anything in the APM (shocker, /s) that clarifies the exact NPT behavior. It barely even alludes to the use of hCR4.LA57 for controlling the depth of the walk. But I'm fairly certain NPT behaves identically. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-23 16:54 ` Sean Christopherson @ 2026-02-23 20:48 ` Huang, Kai 2026-02-23 21:25 ` Sean Christopherson 0 siblings, 1 reply; 16+ messages in thread From: Huang, Kai @ 2026-02-23 20:48 UTC (permalink / raw) To: seanjc@google.com Cc: kvm@vger.kernel.org, pbonzini@redhat.com, linux-kernel@vger.kernel.org, Zhao, Yan Y, Edgecombe, Rick P, yosry.ahmed@linux.dev On Mon, 2026-02-23 at 08:54 -0800, Sean Christopherson wrote: > On Mon, Feb 23, 2026, Kai Huang wrote: > > > > > @@ -3540,6 +3540,14 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu, > > > if (unlikely(fault->gfn > kvm_mmu_max_gfn())) > > > return RET_PF_EMULATE; > > > > > > + /* > > > + * Similarly, if KVM can't map the faulting address, don't attempt to > > > + * install a SPTE because KVM will effectively truncate the address > > > + * when walking KVM's page tables. > > > + */ > > > + if (unlikely(fault->addr & vcpu->arch.mmu->unmappable_mask)) > > > + return RET_PF_EMULATE; > > > + > > > return RET_PF_CONTINUE; > > > } > > > > > > @@ -4681,6 +4689,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, > > > return RET_PF_RETRY; > > > } > > > > > > + if (fault->addr & vcpu->arch.mmu->unmappable_mask) { > > > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); > > > + return -EFAULT; > > > + } > > > + > > > > If we forget the case of shadow paging, do you think we should explicitly > > strip the shared bit? > > > > I think the MMU code currently always treats the shared bit as "mappable" > > (as long as the real GPA is mappable), so logically it's better to strip the > > shared bit first before checking the GPA. But in practice there's no > > problem because only TDX uses shared bit and it is within the 'mappable' > > bits. > > I don't think so? Because even though the SHARED bit has special semantics, it's > still very much an address bit in the current architecture. I guess we can safely assume this is true for Intel. Not sure for AMD, though, but AMD doesn't use shared bit in KVM. Anyway it's safe in practice. > > > But the odd is if the fault->addr is L2 GPA or L2 GVA, then the shared bit > > (which is concept of L1 guest) doesn't apply to it. > > > > Btw, from hardware's point of view, does EPT/NPT silently drops high > > unmappable bits of GPA or it generates some kinda EPT violation/misconfig? > > EPT violation. The SDM says: > > With 4-level EPT, bits 51:48 of the guest-physical address must all be zero; > otherwise, an EPT violation occurs (see Section 30.3.3). > > I can't find anything in the APM (shocker, /s) that clarifies the exact NPT > behavior. It barely even alludes to the use of hCR4.LA57 for controlling the > depth of the walk. But I'm fairly certain NPT behaves identically. Then in case of nested EPT (ditto for NPT), shouldn't L0 emulate an VMEXIT to L1 if fault->addr exceeds mappable bits? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-23 20:48 ` Huang, Kai @ 2026-02-23 21:25 ` Sean Christopherson 2026-02-23 21:44 ` Huang, Kai 0 siblings, 1 reply; 16+ messages in thread From: Sean Christopherson @ 2026-02-23 21:25 UTC (permalink / raw) To: Kai Huang Cc: kvm@vger.kernel.org, pbonzini@redhat.com, linux-kernel@vger.kernel.org, Yan Y Zhao, Rick P Edgecombe, yosry.ahmed@linux.dev On Mon, Feb 23, 2026, Kai Huang wrote: > On Mon, 2026-02-23 at 08:54 -0800, Sean Christopherson wrote: > > On Mon, Feb 23, 2026, Kai Huang wrote: > > > But the odd is if the fault->addr is L2 GPA or L2 GVA, then the shared bit > > > (which is concept of L1 guest) doesn't apply to it. > > > > > > Btw, from hardware's point of view, does EPT/NPT silently drops high > > > unmappable bits of GPA or it generates some kinda EPT violation/misconfig? > > > > EPT violation. The SDM says: > > > > With 4-level EPT, bits 51:48 of the guest-physical address must all be zero; > > otherwise, an EPT violation occurs (see Section 30.3.3). > > > > I can't find anything in the APM (shocker, /s) that clarifies the exact NPT > > behavior. It barely even alludes to the use of hCR4.LA57 for controlling the > > depth of the walk. But I'm fairly certain NPT behaves identically. > > Then in case of nested EPT (ditto for NPT), shouldn't L0 emulate an VMEXIT > to L1 if fault->addr exceeds mappable bits? Huh. Yes, for sure. I was expecting FNAME(walk_addr_generic) to handle that, but AFAICT it doesn't. Assuming I'm not missing something, that should be fixed before landing this patch, otherwise I believe KVM would terminate the entire VM if L2 accesses memory that L1 can't map. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-23 21:25 ` Sean Christopherson @ 2026-02-23 21:44 ` Huang, Kai 0 siblings, 0 replies; 16+ messages in thread From: Huang, Kai @ 2026-02-23 21:44 UTC (permalink / raw) To: seanjc@google.com Cc: kvm@vger.kernel.org, pbonzini@redhat.com, linux-kernel@vger.kernel.org, Zhao, Yan Y, Edgecombe, Rick P, yosry.ahmed@linux.dev On Mon, 2026-02-23 at 13:25 -0800, Sean Christopherson wrote: > On Mon, Feb 23, 2026, Kai Huang wrote: > > On Mon, 2026-02-23 at 08:54 -0800, Sean Christopherson wrote: > > > On Mon, Feb 23, 2026, Kai Huang wrote: > > > > But the odd is if the fault->addr is L2 GPA or L2 GVA, then the shared bit > > > > (which is concept of L1 guest) doesn't apply to it. > > > > > > > > Btw, from hardware's point of view, does EPT/NPT silently drops high > > > > unmappable bits of GPA or it generates some kinda EPT violation/misconfig? > > > > > > EPT violation. The SDM says: > > > > > > With 4-level EPT, bits 51:48 of the guest-physical address must all be zero; > > > otherwise, an EPT violation occurs (see Section 30.3.3). > > > > > > I can't find anything in the APM (shocker, /s) that clarifies the exact NPT > > > behavior. It barely even alludes to the use of hCR4.LA57 for controlling the > > > depth of the walk. But I'm fairly certain NPT behaves identically. > > > > Then in case of nested EPT (ditto for NPT), shouldn't L0 emulate an VMEXIT > > to L1 if fault->addr exceeds mappable bits? > > Huh. Yes, for sure. I was expecting FNAME(walk_addr_generic) to handle that, > but AFAICT it doesn't. > AFAICT too. It goes to page table walk directly w/o checking beforehand. > Assuming I'm not missing something, that should be fixed > before landing this patch, otherwise I believe KVM would terminate the entire VM > if L2 accesses memory that L1 can't map. Yeah agreed. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-02-19 0:22 [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable Sean Christopherson ` (3 preceding siblings ...) 2026-02-23 11:12 ` Huang, Kai @ 2026-03-05 7:55 ` Yan Zhao 2026-03-06 22:22 ` Sean Christopherson 4 siblings, 1 reply; 16+ messages in thread From: Yan Zhao @ 2026-03-05 7:55 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, kvm, linux-kernel, Rick Edgecombe, Yosry Ahmed On Wed, Feb 18, 2026 at 04:22:41PM -0800, Sean Christopherson wrote: > Track the mask of guest physical address bits that can actually be mapped > by a given MMU instance that utilizes TDP, and either exit to userspace > with -EFAULT or go straight to emulation without creating an SPTE (for > emulated MMIO) if KVM can't map the address. Attempting to create an SPTE > can cause KVM to drop the unmappable bits, and thus install a bad SPTE. > E.g. when starting a walk, the TDP MMU will round the GFN based on the > root level, and drop the upper bits. > > Exit with -EFAULT in the unlikely scenario userspace is misbehaving and > created a memslot that can't be addressed, e.g. if userspace installed > memory above the guest.MAXPHYADDR defined in CPUID, as there's nothing KVM > can do to make forward progress, and there _is_ a memslot for the address. > For emulated MMIO, KVM can at least kick the bad address out to userspace > via a normal MMIO exit. > > The flaw has existed for a very long time, and was exposed by commit > 988da7820206 ("KVM: x86/tdp_mmu: WARN if PFN changes for spurious faults") > thanks to a syzkaller program that prefaults memory at GPA 0x1000000000000 > and then faults in memory at GPA 0x0 (the extra-large GPA gets wrapped to > '0'). If the scenario is: when ad bit is disabled, prefault memory at GPA 0x0, then guest reads memory at GPA 0x1000000000000, would fast_page_fault() fix a wrong wrapped sptep for GPA 0x1000000000000? Do we need to check fault->addr in fast_page_fault() as well? > WARNING: arch/x86/kvm/mmu/tdp_mmu.c:1183 at kvm_tdp_mmu_map+0x5c3/0xa30 [kvm], CPU#125: syz.5.22/18468 > CPU: 125 UID: 0 PID: 18468 Comm: syz.5.22 Tainted: G S W 6.19.0-smp--23879af241d6-next #57 NONE > Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN > Hardware name: Google Izumi-EMR/izumi, BIOS 0.20250917.0-0 09/17/2025 > RIP: 0010:kvm_tdp_mmu_map+0x5c3/0xa30 [kvm] > Call Trace: > <TASK> > kvm_tdp_page_fault+0x107/0x140 [kvm] > kvm_mmu_do_page_fault+0x121/0x200 [kvm] > kvm_arch_vcpu_pre_fault_memory+0x18c/0x230 [kvm] > kvm_vcpu_pre_fault_memory+0x116/0x1e0 [kvm] > kvm_vcpu_ioctl+0x3a5/0x6b0 [kvm] > __se_sys_ioctl+0x6d/0xb0 > do_syscall_64+0x8d/0x900 > entry_SYSCALL_64_after_hwframe+0x4b/0x53 > </TASK> > > In practice, the flaw is benign (other than the new WARN) as it only > affects guests that ignore guest.MAXPHYADDR (e.g. on CPUs with 52-bit > physical addresses but only 4-level paging) or guests being run by a > misbehaving userspace VMM (e.g. a VMM that ignored allow_smaller_maxphyaddr > or is pre-faulting bad addresses). > > For non-TDP shadow paging, always clear the unmappable mask as the flaw > only affects GPAs affected. For 32-bit paging, 64-bit virtual addresses > simply don't exist. Even when software can shove a 64-bit address > somewhere, e.g. into SYSENTER_EIP, the value is architecturally truncated > before it reaches the page table walker. And for 64-bit paging, KVM's use > of 4-level vs. 5-level paging is tied to the guest's CR4.LA57, i.e. KVM > won't observe a 57-bit virtual address with a 4-level MMU. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable 2026-03-05 7:55 ` Yan Zhao @ 2026-03-06 22:22 ` Sean Christopherson 0 siblings, 0 replies; 16+ messages in thread From: Sean Christopherson @ 2026-03-06 22:22 UTC (permalink / raw) To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel, Rick Edgecombe, Yosry Ahmed On Thu, Mar 05, 2026, Yan Zhao wrote: > On Wed, Feb 18, 2026 at 04:22:41PM -0800, Sean Christopherson wrote: > > Track the mask of guest physical address bits that can actually be mapped > > by a given MMU instance that utilizes TDP, and either exit to userspace > > with -EFAULT or go straight to emulation without creating an SPTE (for > > emulated MMIO) if KVM can't map the address. Attempting to create an SPTE > > can cause KVM to drop the unmappable bits, and thus install a bad SPTE. > > E.g. when starting a walk, the TDP MMU will round the GFN based on the > > root level, and drop the upper bits. > > > > Exit with -EFAULT in the unlikely scenario userspace is misbehaving and > > created a memslot that can't be addressed, e.g. if userspace installed > > memory above the guest.MAXPHYADDR defined in CPUID, as there's nothing KVM > > can do to make forward progress, and there _is_ a memslot for the address. > > For emulated MMIO, KVM can at least kick the bad address out to userspace > > via a normal MMIO exit. > > > > The flaw has existed for a very long time, and was exposed by commit > > 988da7820206 ("KVM: x86/tdp_mmu: WARN if PFN changes for spurious faults") > > thanks to a syzkaller program that prefaults memory at GPA 0x1000000000000 > > and then faults in memory at GPA 0x0 (the extra-large GPA gets wrapped to > > '0'). > If the scenario is: when ad bit is disabled, prefault memory at GPA 0x0, then > guest reads memory at GPA 0x1000000000000, would fast_page_fault() fix a wrong > wrapped sptep for GPA 0x1000000000000? > > Do we need to check fault->addr in fast_page_fault() as well? Ugh, yeah, good catch! ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-03-06 22:22 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-19 0:22 [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable Sean Christopherson
2026-02-19 0:23 ` Sean Christopherson
[not found] ` <c06466c636da3fc1dc14dc09260981a2554c7cc2.camel@intel.com>
2026-02-20 16:54 ` Sean Christopherson
2026-02-21 0:01 ` Edgecombe, Rick P
2026-02-21 0:07 ` Sean Christopherson
2026-02-21 0:08 ` Edgecombe, Rick P
2026-02-21 0:49 ` Sean Christopherson
2026-02-23 23:23 ` Edgecombe, Rick P
2026-02-24 1:49 ` Sean Christopherson
2026-02-23 11:12 ` Huang, Kai
2026-02-23 16:54 ` Sean Christopherson
2026-02-23 20:48 ` Huang, Kai
2026-02-23 21:25 ` Sean Christopherson
2026-02-23 21:44 ` Huang, Kai
2026-03-05 7:55 ` Yan Zhao
2026-03-06 22:22 ` Sean Christopherson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox