[PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
@ 2026-02-19  0:22 Sean Christopherson
  2026-02-19  0:23 ` Sean Christopherson
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Sean Christopherson @ 2026-02-19  0:22 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Rick Edgecombe, Yosry Ahmed, Yan Zhao

Track the mask of guest physical address bits that can actually be mapped
by a given MMU instance that utilizes TDP, and either exit to userspace
with -EFAULT or go straight to emulation without creating an SPTE (for
emulated MMIO) if KVM can't map the address.  Attempting to create an SPTE
can cause KVM to drop the unmappable bits, and thus install a bad SPTE.
E.g. when starting a walk, the TDP MMU will round the GFN based on the
root level, and drop the upper bits.

Exit with -EFAULT in the unlikely scenario userspace is misbehaving and
created a memslot that can't be addressed, e.g. if userspace installed
memory above the guest.MAXPHYADDR defined in CPUID, as there's nothing KVM
can do to make forward progress, and there _is_ a memslot for the address.
For emulated MMIO, KVM can at least kick the bad address out to userspace
via a normal MMIO exit.

The flaw has existed for a very long time, and was exposed by commit
988da7820206 ("KVM: x86/tdp_mmu: WARN if PFN changes for spurious faults")
thanks to a syzkaller program that prefaults memory at GPA 0x1000000000000
and then faults in memory at GPA 0x0 (the extra-large GPA gets wrapped to
'0').

  WARNING: arch/x86/kvm/mmu/tdp_mmu.c:1183 at kvm_tdp_mmu_map+0x5c3/0xa30 [kvm], CPU#125: syz.5.22/18468
  CPU: 125 UID: 0 PID: 18468 Comm: syz.5.22 Tainted: G S      W           6.19.0-smp--23879af241d6-next #57 NONE
  Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
  Hardware name: Google Izumi-EMR/izumi, BIOS 0.20250917.0-0 09/17/2025
  RIP: 0010:kvm_tdp_mmu_map+0x5c3/0xa30 [kvm]
  Call Trace:
   <TASK>
   kvm_tdp_page_fault+0x107/0x140 [kvm]
   kvm_mmu_do_page_fault+0x121/0x200 [kvm]
   kvm_arch_vcpu_pre_fault_memory+0x18c/0x230 [kvm]
   kvm_vcpu_pre_fault_memory+0x116/0x1e0 [kvm]
   kvm_vcpu_ioctl+0x3a5/0x6b0 [kvm]
   __se_sys_ioctl+0x6d/0xb0
   do_syscall_64+0x8d/0x900
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>

In practice, the flaw is benign (other than the new WARN) as it only
affects guests that ignore guest.MAXPHYADDR (e.g. on CPUs with 52-bit
physical addresses but only 4-level paging) or guests being run by a
misbehaving userspace VMM (e.g. a VMM that ignored allow_smaller_maxphyaddr
or is pre-faulting bad addresses).

For non-TDP shadow paging, always clear the unmappable mask as the flaw
only affects GPAs affected.  For 32-bit paging, 64-bit virtual addresses
simply don't exist.  Even when software can shove a 64-bit address
somewhere, e.g. into SYSENTER_EIP, the value is architecturally truncated
before it reaches the page table walker.  And for 64-bit paging, KVM's use
of 4-level vs. 5-level paging is tied to the guest's CR4.LA57, i.e. KVM
won't observe a 57-bit virtual address with a 4-level MMU.

Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  6 +++++
 arch/x86/kvm/mmu/mmu.c          | 42 +++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ff07c45e3c73..43b9777b896d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -493,6 +493,12 @@ struct kvm_mmu {
 	 */
 	u8 permissions[16];
 
+	/*
+	 * Mask of address bits that KVM can't map with this MMU given the root
+	 * level, e.g. 5-level EPT/NPT only consume bits 51:0.
+	 */
+	gpa_t unmappable_mask;
+
 	u64 *pae_root;
 	u64 *pml4_root;
 	u64 *pml5_root;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3911ac9bddfd..2dc9a297e6ed 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3540,6 +3540,14 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
 	if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
 		return RET_PF_EMULATE;
 
+	/*
+	 * Similarly, if KVM can't map the faulting address, don't attempt to
+	 * install a SPTE because KVM will effectively truncate the address
+	 * when walking KVM's page tables.
+	 */
+	if (unlikely(fault->addr & vcpu->arch.mmu->unmappable_mask))
+		return RET_PF_EMULATE;
+
 	return RET_PF_CONTINUE;
 }
 
@@ -4681,6 +4689,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 		return RET_PF_RETRY;
 	}
 
+	if (fault->addr & vcpu->arch.mmu->unmappable_mask) {
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		return -EFAULT;
+	}
+
 	if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
 		/*
 		 * Don't map L1's APIC access page into L2, KVM doesn't support
@@ -5772,6 +5785,29 @@ u8 kvm_mmu_get_max_tdp_level(void)
 	return tdp_root_level ? tdp_root_level : max_tdp_level;
 }
 
+static void reset_tdp_unmappable_mask(struct kvm_mmu *mmu)
+{
+	int max_addr_bit;
+
+	switch (mmu->root_role.level) {
+	case PT64_ROOT_5LEVEL:
+		max_addr_bit = 52;
+		break;
+	case PT64_ROOT_4LEVEL:
+		max_addr_bit = 48;
+		break;
+	case PT32E_ROOT_LEVEL:
+		max_addr_bit = 32;
+		break;
+	default:
+		WARN_ONCE(1, "Unhandled root level %u\n", mmu->root_role.level);
+		mmu->unmappable_mask = 0;
+		return;
+	}
+
+	mmu->unmappable_mask = rsvd_bits(max_addr_bit, 63);
+}
+
 static union kvm_mmu_page_role
 kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 				union kvm_cpu_role cpu_role)
@@ -5816,6 +5852,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
 	else
 		context->gva_to_gpa = paging32_gva_to_gpa;
 
+	reset_tdp_unmappable_mask(context);
 	reset_guest_paging_metadata(vcpu, context);
 	reset_tdp_shadow_zero_bits_mask(context);
 }
@@ -5889,6 +5926,8 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
 		root_role.passthrough = 1;
 
 	shadow_mmu_init_context(vcpu, context, cpu_role, root_role);
+	reset_tdp_unmappable_mask(context);
+
 	kvm_mmu_new_pgd(vcpu, nested_cr3);
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_init_shadow_npt_mmu);
@@ -5939,6 +5978,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 
 		update_permission_bitmask(context, true);
 		context->pkru_mask = 0;
+		reset_tdp_unmappable_mask(context);
 		reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
 		reset_ept_shadow_zero_bits_mask(context, execonly);
 	}
@@ -5954,6 +5994,8 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu,
 
 	kvm_init_shadow_mmu(vcpu, cpu_role);
 
+	context->unmappable_mask = 0;
+
 	context->get_guest_pgd     = get_guest_cr3;
 	context->get_pdptr         = kvm_pdptr_read;
 	context->inject_page_fault = kvm_inject_page_fault;

base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
-- 
2.53.0.345.g96ddfc5eaa-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-19  0:22 [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable Sean Christopherson
@ 2026-02-19  0:23 ` Sean Christopherson
       [not found] ` <c06466c636da3fc1dc14dc09260981a2554c7cc2.camel@intel.com>
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2026-02-19  0:23 UTC (permalink / raw)
  To: Paolo Bonzini, kvm, linux-kernel, Rick Edgecombe, Yosry Ahmed,
	Yan Zhao

On Wed, Feb 18, 2026, Sean Christopherson wrote:
> Track the mask of guest physical address bits that can actually be mapped
> by a given MMU instance that utilizes TDP, and either exit to userspace
> with -EFAULT or go straight to emulation without creating an SPTE (for
> emulated MMIO) if KVM can't map the address.  Attempting to create an SPTE
> can cause KVM to drop the unmappable bits, and thus install a bad SPTE.
> E.g. when starting a walk, the TDP MMU will round the GFN based on the
> root level, and drop the upper bits.
> 
> Exit with -EFAULT in the unlikely scenario userspace is misbehaving and
> created a memslot that can't be addressed, e.g. if userspace installed
> memory above the guest.MAXPHYADDR defined in CPUID, as there's nothing KVM
> can do to make forward progress, and there _is_ a memslot for the address.
> For emulated MMIO, KVM can at least kick the bad address out to userspace
> via a normal MMIO exit.
> 
> The flaw has existed for a very long time, and was exposed by commit
> 988da7820206 ("KVM: x86/tdp_mmu: WARN if PFN changes for spurious faults")
> thanks to a syzkaller program that prefaults memory at GPA 0x1000000000000
> and then faults in memory at GPA 0x0 (the extra-large GPA gets wrapped to
> '0').
> 
>   WARNING: arch/x86/kvm/mmu/tdp_mmu.c:1183 at kvm_tdp_mmu_map+0x5c3/0xa30 [kvm], CPU#125: syz.5.22/18468
>   CPU: 125 UID: 0 PID: 18468 Comm: syz.5.22 Tainted: G S      W           6.19.0-smp--23879af241d6-next #57 NONE
>   Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
>   Hardware name: Google Izumi-EMR/izumi, BIOS 0.20250917.0-0 09/17/2025
>   RIP: 0010:kvm_tdp_mmu_map+0x5c3/0xa30 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_tdp_page_fault+0x107/0x140 [kvm]
>    kvm_mmu_do_page_fault+0x121/0x200 [kvm]
>    kvm_arch_vcpu_pre_fault_memory+0x18c/0x230 [kvm]
>    kvm_vcpu_pre_fault_memory+0x116/0x1e0 [kvm]
>    kvm_vcpu_ioctl+0x3a5/0x6b0 [kvm]
>    __se_sys_ioctl+0x6d/0xb0
>    do_syscall_64+0x8d/0x900
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
> 
> In practice, the flaw is benign (other than the new WARN) as it only
> affects guests that ignore guest.MAXPHYADDR (e.g. on CPUs with 52-bit
> physical addresses but only 4-level paging) or guests being run by a
> misbehaving userspace VMM (e.g. a VMM that ignored allow_smaller_maxphyaddr
> or is pre-faulting bad addresses).
> 
> For non-TDP shadow paging, always clear the unmappable mask as the flaw
> only affects GPAs affected.  For 32-bit paging, 64-bit virtual addresses
> simply don't exist.  Even when software can shove a 64-bit address
> somewhere, e.g. into SYSENTER_EIP, the value is architecturally truncated
> before it reaches the page table walker.  And for 64-bit paging, KVM's use
> of 4-level vs. 5-level paging is tied to the guest's CR4.LA57, i.e. KVM
> won't observe a 57-bit virtual address with a 4-level MMU.
> 
> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Here's the full syzkaller reproducer (found on a manual run internally, so syzbot
didn't do the work for us).

FWIW, I don't love this approach, but I couldn't come up with anything better.

ioctl$KVM_SET_LAPIC(0xffffffffffffffff, 0x4400ae8f, &(0x7f0000000100)={"b46474f815e8d5535f0887c44335cc824dc6121bc72a77f532ff5dad4d643a9cab29d2310e04be14eb26c0af4985fe45e3b3b0680b3ec92725d74b9716e0f7c3119a2c9a0ae65ff4772e2e12733cb013c4308fe40863480747c0a7ddb9361b1578015ca1bb2c1677ebae096f08345476f567443842946ed946434c75916d1db83fe305920de65bfaf9bd940672216846cb16b8ae67cd3affc61375381f91b3b9f1cc5e38cafe5239aee71dcd481fbe1ecd2547ffbaad4469a74697c28fb9beefa6a5d736712a55eb9110c2cf7964062ba8cbc1c038e84f0f5db7fc7053118bf5221e3efa6fc3edb5d0ca3cde7054dd0751a332520aa8478b1775d552c5cc24d3c2df9eb333e5ca3aa06c1c2cf8526714f5caff2f55b41976fc20b64f1fc61d5b44f50953582a1825d32130a31abfeafd1987317879e29ac51b93c9659e023fff3ddb5e39dd19cc3ef1d883c78b9e073d08a9197fb3717df238b9831831214b186693be9dd2568bb77272e80df5dfed03e8c467627bedfbd93359a9f79a3aa37e873dc1357b37b43d813ea85267b0dc8b1c4cc51bd985328833beb2679b7fb762555bbea2da936b36f8f1673fd5f606b2b6eb23b72bf947206e8dbfeb40ca6f265a3485c8446e0f0da652860b88328073d2282c14b48a7774e62754a968b60e92205e8fafcdd70a55c3c4d1a4821ff44e6e3681f15ae091262e3a3290a24d8ceae30ebbf9d24287bb8a5d73c608d47d287f9e716cf02b4796a83fb0c05e45b89de9ef8bce834e6d7a0be6e30d2c66cb6e640cb01898454ad361bc0701d8fe56113335ae6adec59300db04691cc4a689034272a8e086a32ce7061b4f79fa8afbb48a6ce4b62bdc44af013d78980457e1fa61eb9204818606f4c3b03c0f33cd2a841ac9bc2b73151a96e31ab99e6ec969b5f2c3edd5f9abc69845e487af992758ba445368da93dae1d44360d52a534a88276b8aaf349841d8a4788c60408618437c442308dbf70efeda2e54e9b9e4fe5f76997c9dcb945a26bd75748c85d19ca8b99264dce50580e8d4dbda401dad7df31e9a7a6a3a83bfbdfb5394abd581ac0824fbcd75d2f5205c0b7c9188e6f26bfd97734d9a20433f6cdba9d14a5f32a4d97a57f4603b21146fd1aebf082e863d463c224ad623c17d8043d3bf083f0322408dd6ead6915ac6a4222ab51480eb6e11a8913348219515170d9df90d72d7363bbda3e327d19f98c0a856f98076380e788e602e8a2ae0a1930786874dc21a2e99abda15f35457cf1dcb440c4b41350d0eda352aad7f57a0adc8a6914da06460635ed21c4c11cd1a8ec778064c9f62efba2927828b23f94b16619a5520731c2c40ab8583c9f2e73233d74b84f4877ce6b35bb1180300"})
r0 = openat$kvm(0xffffff9c, &(0x7f00000000c0), 0x0, 0x0)
r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0)
ioctl$KVM_SET_USER_MEMORY_REGION(r1, 0x4020ae46, &(0x7f0000000080)={0x0, 0x0, 0x0, 0x2000, &(0x7f0000000000/0x2000)=nil})
ioctl$KVM_SET_REGS(r2, 0x4090ae82, &(0x7f0000000200)={[0x0, 0x6, 0xfffffffffffffffd, 0x0, 0xfffd, 0x1, 0x4002004c4, 0x1000, 0x0, 0x0, 0x0, 0x0, 0x3], 0x25000, 0x2011c0})
ioctl$KVM_RUN(r2, 0xae80, 0x0)
ioctl$KVM_PRE_FAULT_MEMORY(r2, 0xc040aed5, &(0x7f0000000000)={0x0, 0x18000})
ioctl$KVM_SET_PIT2(0xffffffffffffffff, 0x4070aea0, &(0x7f0000000100)={[{0x7ff, 0x93, 0x0, 0xc0, 0xc0, 0x92, 0x85, 0x8, 0x6, 0xa, 0x0, 0x7, 0x8001}, {0x5, 0x2, 0xf9, 0x8, 0x7c, 0xf, 0xd, 0x1, 0x5, 0x3, 0x7, 0xa, 0x7}, {0x7, 0x71b0, 0x3, 0x3, 0xf8, 0x1, 0x8, 0x3, 0x8, 0x82, 0xc, 0xa4, 0x6}], 0xfffffffa})

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <c06466c636da3fc1dc14dc09260981a2554c7cc2.camel@intel.com>]

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
       [not found] ` <c06466c636da3fc1dc14dc09260981a2554c7cc2.camel@intel.com>
@ 2026-02-20 16:54   ` Sean Christopherson
  2026-02-21  0:01     ` Edgecombe, Rick P
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2026-02-20 16:54 UTC (permalink / raw)
  To: Edgecombe, Rick P; +Cc: kvm, linux-kernel

+lists, because I'm confident there's no host attack.

On Thu, Feb 19, 2026, Edgecombe, Rick P wrote:
> On Wed, 2026-02-18 at 16:22 -0800, Sean Christopherson wrote:
> > In practice, the flaw is benign (other than the new WARN) as it only
> > affects guests that ignore guest.MAXPHYADDR (e.g. on CPUs with 52-bit
> > physical addresses but only 4-level paging) or guests being run by a
> > misbehaving userspace VMM (e.g. a VMM that ignored allow_smaller_maxphyaddr
> > or is pre-faulting bad addresses).
> 
> I tried to look at whether this is true from a hurt-the-host perspective.
> 
> Did you consider the potential mismatch between the GFN passed to
> kvm_flush_remote_tlbs_range() and the PTE's for different GFNs that actually got
> touched. For example in recover_huge_pages_range(), if it flushed the wrong
> range then the page table that got freed could still be in the intermediate
> translation caches?

I hadn't thought about this before you mentioned it, but I audited all code paths
and all paths that lead to kvm_flush_remote_tlbs_range() use a "sanitized" gfn,
i.e. KVM never emits a flush for the gfn reported by the fault.  Which meshes with
a logical analysis as well: KVM only needs to flush when removing/changing an
entry, and so should always derive the to-be-flushed ranges using the gfn that
was used to make the change.

And the "bad" gfn can never have TLB entries, because KVM never creates mappings.

FWIW, even if KVM screwed up something like recover_huge_pages_range(), it wouldn't
hurt the _host_.  Because from a host safety perspective, KVM x86 only needs to get
it right in three paths: kvm_flush_shadow_all(), __kvm_gmem_invalidate_begin(), and
kvm_mmu_notifier_invalidate_range_start().

> I'm not sure how this HV flush stuff actually works in practice, especially on
> those details. So not raising any red flags. Just thought maybe worth
> considering.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-20 16:54   ` Sean Christopherson
@ 2026-02-21  0:01     ` Edgecombe, Rick P
  2026-02-21  0:07       ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Edgecombe, Rick P @ 2026-02-21  0:01 UTC (permalink / raw)
  To: seanjc@google.com; +Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org

On Fri, 2026-02-20 at 16:54 +0000, Sean Christopherson wrote:
> > > > Did you consider the potential mismatch between the GFN passed to
> > > > kvm_flush_remote_tlbs_range() and the PTE's for different GFNs that
> > > > actually > > got touched. For example in recover_huge_pages_range(), if
> > > > it flushed the > > wrong range then the page table that got freed could
> > > > still be in the > > intermediate translation caches?
> > 
> > I hadn't thought about this before you mentioned it, but I audited all code
> > > paths and all paths that lead to kvm_flush_remote_tlbs_range() use a >
> > "sanitized" gfn, i.e. KVM never emits a flush for the gfn reported by the >
> > fault.

Doh, sorry.

> >   Which meshes with a logical analysis as well: KVM only needs to flush when
> > > removing/changing an entry, and so should always derive the to-be-flushed
> > > ranges using the gfn that was used to make the change.
> > 
> > And the "bad" gfn can never have TLB entries, because KVM never creates >
> > mappings.

Oh. I was under the impression that the fault gets its GPA bits stripped and
ends up mapping the page table mapping at a wrong different GPA. So if some
optimized GFN targeting flush was pointed at the unstripped GPA then it could
miss the GPA that actually got mapped and made it into the TLB. Anyway, it seems
moot.

> > 
> > FWIW, even if KVM screwed up something like recover_huge_pages_range(), it >
> > wouldn't hurt the _host_.  Because from a host safety perspective, KVM x86 >
> > only needs to get it right in three paths: kvm_flush_shadow_all(),
> > __kvm_gmem_invalidate_begin(), and
> > kvm_mmu_notifier_invalidate_range_start().



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-21  0:01     ` Edgecombe, Rick P
@ 2026-02-21  0:07       ` Sean Christopherson
  0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2026-02-21  0:07 UTC (permalink / raw)
  To: Rick P Edgecombe; +Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org

On Sat, Feb 21, 2026, Rick P Edgecombe wrote:
> On Fri, 2026-02-20 at 16:54 +0000, Sean Christopherson wrote:
> > >   Which meshes with a logical analysis as well: KVM only needs to flush when
> > > > removing/changing an entry, and so should always derive the to-be-flushed
> > > > ranges using the gfn that was used to make the change.
> > > 
> > > And the "bad" gfn can never have TLB entries, because KVM never creates >
> > > mappings.
> 
> Oh. I was under the impression that the fault gets its GPA bits stripped and
> ends up mapping the page table mapping at a wrong different GPA. 

It does (by KVM, not by hardware).  The above is juyst trying to clarify that we
don't have to worry about the GFN from the fault, either.

> So if some optimized GFN targeting flush was pointed at the unstripped GPA
> then it could miss the GPA that actually got mapped and made it into the TLB.
> Anyway, it seems moot.

Yeah, we're on the same page.  

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-19  0:22 [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable Sean Christopherson
  2026-02-19  0:23 ` Sean Christopherson
       [not found] ` <c06466c636da3fc1dc14dc09260981a2554c7cc2.camel@intel.com>
@ 2026-02-21  0:08 ` Edgecombe, Rick P
  2026-02-21  0:49   ` Sean Christopherson
  2026-02-23 11:12 ` Huang, Kai
  2026-03-05  7:55 ` Yan Zhao
  4 siblings, 1 reply; 16+ messages in thread
From: Edgecombe, Rick P @ 2026-02-21  0:08 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Zhao, Yan Y,
	yosry.ahmed@linux.dev

On Wed, 2026-02-18 at 16:22 -0800, Sean Christopherson wrote:
> +static void reset_tdp_unmappable_mask(struct kvm_mmu *mmu)
> +{
> +	int max_addr_bit;
> +
> +	switch (mmu->root_role.level) {
> +	case PT64_ROOT_5LEVEL:
> +		max_addr_bit = 52;
> +		break;
> +	case PT64_ROOT_4LEVEL:
> +		max_addr_bit = 48;
> +		break;
> +	case PT32E_ROOT_LEVEL:
> +		max_addr_bit = 32;
> +		break;
> +	default:
> +		WARN_ONCE(1, "Unhandled root level %u\n", mmu->root_role.level);
> +		mmu->unmappable_mask = 0;

Would it be better to set max_addr_bit to 0 and let rsvd_bits() set it below?
Then the unknown case is safer about rejecting things.

> +		return;
> +	}
> +
> +	mmu->unmappable_mask = rsvd_bits(max_addr_bit, 63);
> +}
> +

Gosh, this forced me to expand my understanding of how the guest and host page
levels get glued together. Hopefully this is not too far off...

In the patch this function is passed both guest_mmu and root_mmu. So sometimes
it's going to be L1 GPA address, and sometimes (for AMD nested?) it's going to
be an L2 GVA. For the GVA case I don't see how PT32_ROOT_LEVEL can be omitted.
It would hit the warning?

But also the '5' case is weird because as a GVA the max addresse bits should be
57 and a GPA is should be 54. And that the TDP side uses 4 and 5 specifically,
so the PT64_ just happens to match.

So I'd think this needs a version for GVA and one for GPA.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-21  0:08 ` Edgecombe, Rick P
@ 2026-02-21  0:49   ` Sean Christopherson
  2026-02-23 23:23     ` Edgecombe, Rick P
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2026-02-21  0:49 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yan Y Zhao, yosry.ahmed@linux.dev

On Sat, Feb 21, 2026, Rick P Edgecombe wrote:
> On Wed, 2026-02-18 at 16:22 -0800, Sean Christopherson wrote:
> > +static void reset_tdp_unmappable_mask(struct kvm_mmu *mmu)
> > +{
> > +	int max_addr_bit;
> > +
> > +	switch (mmu->root_role.level) {
> > +	case PT64_ROOT_5LEVEL:
> > +		max_addr_bit = 52;
> > +		break;
> > +	case PT64_ROOT_4LEVEL:
> > +		max_addr_bit = 48;
> > +		break;
> > +	case PT32E_ROOT_LEVEL:
> > +		max_addr_bit = 32;
> > +		break;
> > +	default:
> > +		WARN_ONCE(1, "Unhandled root level %u\n", mmu->root_role.level);
> > +		mmu->unmappable_mask = 0;
> 
> Would it be better to set max_addr_bit to 0 and let rsvd_bits() set it below?
> Then the unknown case is safer about rejecting things.

No, because speaking from experience, rejecting isn't safer (I had a brain fart
and thought legacy shadow paging was also affected).  There's no danger to the
host (other than the WARN itself), and so safety here is all about the guest.

Setting unmappable_mask to -1ull is all but guaranteed to kill the guest, because
KVM will reject all faults.  Setting unmappable_mask to 0 is only problematic if
the guest and/or userspace is misbehaving, and even then, the worst case scenario
isn't horrific, all things considered.

> > +		return;
> > +	}
> > +
> > +	mmu->unmappable_mask = rsvd_bits(max_addr_bit, 63);
> > +}
> > +
> 
> Gosh, this forced me to expand my understanding of how the guest and host page
> levels get glued together. Hopefully this is not too far off...
> 
> In the patch this function is passed both guest_mmu and root_mmu. So sometimes
> it's going to be L1 GPA address, and sometimes (for AMD nested?) it's going to
> be an L2 GVA. For the GVA case I don't see how PT32_ROOT_LEVEL can be omitted.
> It would hit the warning?

No, it's always a GPA.  root_mmu translates L1 GPA => L0 GPA and L1 GVA => GPA*;
guest_mmu translates L2 GPA => L0 GPA; nested_mmu translates L2 GVA => L2 GPA.

Note!  The asterisk is that root_mmu is also used when L2 is active if L1 is NOT
using TDP, either because KVM isn't using TDP, or because the L1 hypervisor
decided not to.  In those cases, L2 GPA == L1 GPA from KVM's perspective, because
the L1 hypervisor is responsible for shadowing L2 GVA => L1 GPA.  And root_mmu
can also translate L2 GPA => L0 GPA and L2 GVA => L2 GPA (again, L1 GPA == L2 GPA).

> But also the '5' case is weird because as a GVA the max addresse bits should be
> 57 and a GPA is should be 54.

52, i.e. the architectural max MAXPHYADDR.

> And that the TDP side uses 4 and 5 specifically, so the PT64_ just happens to
> match.

No, it's not a coincidence.  The "truncation" to 52 bits is an architectural
quirk.  Long ago, people decided 52 bits of PA were enough for anyone, and so
repurposed bits 63:52 for e.g. NX, SUPPRESS_VE, and software-available bits.

I.e. conceptually, 5-level paging allows for 57 bits of addressing, but EPT and
NPT and NPT define bits 63:52 to be other things.

> So I'd think this needs a version for GVA and one for GPA.

No, see the last paragraph in the changelog.

Side topic, if you have _any_ idea for better names than guest_mmu vs. nested_mmu,
speak up.  This is like the fifth? time I've had a discussion about how awful
those names are, but we've yet to come up with names that suck less.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-21  0:49   ` Sean Christopherson
@ 2026-02-23 23:23     ` Edgecombe, Rick P
  2026-02-24  1:49       ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Edgecombe, Rick P @ 2026-02-23 23:23 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, yosry.ahmed@linux.dev

On Fri, 2026-02-20 at 16:49 -0800, Sean Christopherson wrote:
> On Sat, Feb 21, 2026, Rick P Edgecombe wrote:
> > On Wed, 2026-02-18 at 16:22 -0800, Sean Christopherson wrote:
> > > +static void reset_tdp_unmappable_mask(struct kvm_mmu *mmu)
> > > +{
> > > +	int max_addr_bit;
> > > +
> > > +	switch (mmu->root_role.level) {
> > > +	case PT64_ROOT_5LEVEL:
> > > +		max_addr_bit = 52;
> > > +		break;
> > > +	case PT64_ROOT_4LEVEL:
> > > +		max_addr_bit = 48;
> > > +		break;
> > > +	case PT32E_ROOT_LEVEL:
> > > +		max_addr_bit = 32;
> > > +		break;
> > > +	default:
> > > +		WARN_ONCE(1, "Unhandled root level %u\n", mmu-
> > > >root_role.level);
> > > +		mmu->unmappable_mask = 0;
> > 
> > Would it be better to set max_addr_bit to 0 and let rsvd_bits() set
> > it below? Then the unknown case is safer about rejecting things.
> 
> No, because speaking from experience, rejecting isn't safer (I had a
> brain fart and thought legacy shadow paging was also affected). 
> There's no danger to the host (other than the WARN itself), and so
> safety here is all about the guest.
> 
> Setting unmappable_mask to -1ull is all but guaranteed to kill the
> guest, because KVM will reject all faults.  Setting unmappable_mask
> to 0 is only problematic if the guest and/or userspace is
> misbehaving, and even then, the worst case scenario isn't horrific,
> all things considered.

Confused MM code makes me nervous, but fair enough.

> 
> > > +		return;
> > > +	}
> > > +
> > > +	mmu->unmappable_mask = rsvd_bits(max_addr_bit, 63);
> > > +}
> > > +
> > 
> > Gosh, this forced me to expand my understanding of how the guest
> > and host page levels get glued together. Hopefully this is not too
> > far off...
> > 
> > In the patch this function is passed both guest_mmu and root_mmu.
> > So sometimes it's going to be L1 GPA address, and sometimes (for
> > AMD nested?) it's going to be an L2 GVA. For the GVA case I don't
> > see how PT32_ROOT_LEVEL can be omitted. It would hit the warning?
> 
> No, it's always a GPA.  root_mmu translates L1 GPA => L0 GPA and L1
> GVA => GPA*; guest_mmu translates L2 GPA => L0 GPA; nested_mmu
> translates L2 GVA => L2 GPA.
> 
> Note!  The asterisk is that root_mmu is also used when L2 is active
> if L1 is NOT using TDP, either because KVM isn't using TDP, or
> because the L1 hypervisor decided not to.  In those cases, L2 GPA ==
> L1 GPA from KVM's perspective, because the L1 hypervisor is
> responsible for shadowing L2 GVA => L1 GPA.  And root_mmu can also
> translate L2 GPA => L0 GPA and L2 GVA => L2 GPA (again, L1 GPA == L2
> GPA).

I appreciate you taking the time to explain. Tracing through with the
above I realize I was under the wrong impression about how nested SVM
worked.

> 
> > But also the '5' case is weird because as a GVA the max addresse
> > bits should be 57 and a GPA is should be 54.
> 
> 52, i.e. the architectural max MAXPHYADDR.

Oops yes I meant 52. But if it is always max physical address and not
trying to handle VA's too, why is PT32E_ROOT_LEVEL 32 instead of
36? That also send me down the path of assuming GVAs were in the mix,
but now I see it is used for 32 bit SVM.

> 
[snip]
> 
> 
> > So I'd think this needs a version for GVA and one for GPA.
> 
> No, see the last paragraph in the changelog.
> 
> Side topic, if you have _any_ idea for better names than guest_mmu
> vs. nested_mmu, speak up.  This is like the fifth? time I've had a
> discussion about how awful those names are, but we've yet to come up
> with names that suck less.

I don't. As above, I got confused with some wrong assumptions. The
names seem reasonable. The short notes about the translation input and
for each MMU might be nice to have somewhere.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-23 23:23     ` Edgecombe, Rick P
@ 2026-02-24  1:49       ` Sean Christopherson
  0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2026-02-24  1:49 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, Yan Y Zhao, yosry.ahmed@linux.dev

On Mon, Feb 23, 2026, Rick P Edgecombe wrote:
> On Fri, 2026-02-20 at 16:49 -0800, Sean Christopherson wrote:
> > > But also the '5' case is weird because as a GVA the max addresse
> > > bits should be 57 and a GPA is should be 54.
> > 
> > 52, i.e. the architectural max MAXPHYADDR.
> 
> Oops yes I meant 52. But if it is always max physical address and not
> trying to handle VA's too, why is PT32E_ROOT_LEVEL 32 instead of
> 36?

Setting aside how any nNPT with a 32-bit kernel works for the moment, it would
be 52, not 36.  PT32E_ROOT_LEVEL is PAE, which per the SDM can address 52 bits
of physical address space:

  PAE paging translates 32-bit linear addresses to 52-bit physical addresses.

PSE-36, a.k.a. 2-level 32-bit paging with CR4.PSE=1, is the horror that can
address 36 bits of physical address space by abusing reserved bits in the "offset"
portion of a huge 4MiB page.

Somewhat of an aside, KVM always uses 64-bit paging or PAE paging for its MMU
(or EPT, but that's basically 64-bit), and so when running on 32-bit kernel, KVM
requires a PAE-enabled kernel to enable NPT, because hCR4 isn't changed on VMRUN,
i.e. the paging mode for KVM's MMU is tightly coupled to the host kernel's paging
mode.  Which is one of several reasons why nNPT is a mess.

  	/*
	 * KVM's MMU doesn't support using 2-level paging for itself, and thus
	 * NPT isn't supported if the host is using 2-level paging since host
	 * CR4 is unchanged on VMRUN.
	 */
	if (!IS_ENABLED(CONFIG_X86_64) && !IS_ENABLED(CONFIG_X86_PAE))
		npt_enabled = false;

As for how running a 32-bit PAE nNPT "works", I suspect it simply doesn't from an
architectural perspective.  32-bit KVM-on-KVM works (though I haven't check in a
few years...) because Linux doesn't allocate kernel memory out of high memory,
i.e. L1 KVM won't feed "bad" addresses to L0 KVM, and presumably QEMU doesn't
manage to either.

I might be forgetting something though?  If I get bored, or more likely when my
curiousity gets the best of me, I'll see how hardware behaves.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-19  0:22 [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable Sean Christopherson
                   ` (2 preceding siblings ...)
  2026-02-21  0:08 ` Edgecombe, Rick P
@ 2026-02-23 11:12 ` Huang, Kai
  2026-02-23 16:54   ` Sean Christopherson
  2026-03-05  7:55 ` Yan Zhao
  4 siblings, 1 reply; 16+ messages in thread
From: Huang, Kai @ 2026-02-23 11:12 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Zhao, Yan Y, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Edgecombe, Rick P, yosry.ahmed@linux.dev

> @@ -3540,6 +3540,14 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
>  	if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
>  		return RET_PF_EMULATE;
>  
> +	/*
> +	 * Similarly, if KVM can't map the faulting address, don't attempt to
> +	 * install a SPTE because KVM will effectively truncate the address
> +	 * when walking KVM's page tables.
> +	 */
> +	if (unlikely(fault->addr & vcpu->arch.mmu->unmappable_mask))
> +		return RET_PF_EMULATE;
> +
>  	return RET_PF_CONTINUE;
>  }
>  
> @@ -4681,6 +4689,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
>  		return RET_PF_RETRY;
>  	}
>  
> +	if (fault->addr & vcpu->arch.mmu->unmappable_mask) {
> +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +		return -EFAULT;
> +	}
> +

If we forget the case of shadow paging, do you think we should explicitly
strip the shared bit?

I think the MMU code currently always treats the shared bit as "mappable"
(as long as the real GPA is mappable), so logically it's better to strip the
shared bit first before checking the GPA.  But in practice there's no
problem because only TDX uses shared bit and it is within the 'mappable'
bits.

But the odd is if the fault->addr is L2 GPA or L2 GVA, then the shared bit
(which is concept of L1 guest) doesn't apply to it.

Btw, from hardware's point of view, does EPT/NPT silently drops high
unmappable bits of GPA or it generates some kinda EPT violation/misconfig? 
I tried to confirm from the spec but seems not sure.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-23 11:12 ` Huang, Kai
@ 2026-02-23 16:54   ` Sean Christopherson
  2026-02-23 20:48     ` Huang, Kai
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2026-02-23 16:54 UTC (permalink / raw)
  To: Kai Huang
  Cc: pbonzini@redhat.com, Yan Y Zhao, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Rick P Edgecombe,
	yosry.ahmed@linux.dev

On Mon, Feb 23, 2026, Kai Huang wrote:
> 
> > @@ -3540,6 +3540,14 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
> >  	if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
> >  		return RET_PF_EMULATE;
> >  
> > +	/*
> > +	 * Similarly, if KVM can't map the faulting address, don't attempt to
> > +	 * install a SPTE because KVM will effectively truncate the address
> > +	 * when walking KVM's page tables.
> > +	 */
> > +	if (unlikely(fault->addr & vcpu->arch.mmu->unmappable_mask))
> > +		return RET_PF_EMULATE;
> > +
> >  	return RET_PF_CONTINUE;
> >  }
> >  
> > @@ -4681,6 +4689,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
> >  		return RET_PF_RETRY;
> >  	}
> >  
> > +	if (fault->addr & vcpu->arch.mmu->unmappable_mask) {
> > +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > +		return -EFAULT;
> > +	}
> > +
> 
> If we forget the case of shadow paging, do you think we should explicitly
> strip the shared bit?
> 
> I think the MMU code currently always treats the shared bit as "mappable"
> (as long as the real GPA is mappable), so logically it's better to strip the
> shared bit first before checking the GPA.  But in practice there's no
> problem because only TDX uses shared bit and it is within the 'mappable'
> bits.

I don't think so?  Because even though the SHARED bit has special semantics, it's
still very much an address bit in the current architecture.

> But the odd is if the fault->addr is L2 GPA or L2 GVA, then the shared bit
> (which is concept of L1 guest) doesn't apply to it.
> 
> Btw, from hardware's point of view, does EPT/NPT silently drops high
> unmappable bits of GPA or it generates some kinda EPT violation/misconfig?

EPT violation.  The SDM says:

  With 4-level EPT, bits 51:48 of the guest-physical address must all be zero;
  otherwise, an EPT violation occurs (see Section 30.3.3).

I can't find anything in the APM (shocker, /s) that clarifies the exact NPT
behavior.  It barely even alludes to the use of hCR4.LA57 for controlling the
depth of the walk.  But I'm fairly certain NPT behaves identically.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-23 16:54   ` Sean Christopherson
@ 2026-02-23 20:48     ` Huang, Kai
  2026-02-23 21:25       ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Huang, Kai @ 2026-02-23 20:48 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Edgecombe, Rick P,
	yosry.ahmed@linux.dev

On Mon, 2026-02-23 at 08:54 -0800, Sean Christopherson wrote:
> On Mon, Feb 23, 2026, Kai Huang wrote:
> > 
> > > @@ -3540,6 +3540,14 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
> > >  	if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
> > >  		return RET_PF_EMULATE;
> > >  
> > > +	/*
> > > +	 * Similarly, if KVM can't map the faulting address, don't attempt to
> > > +	 * install a SPTE because KVM will effectively truncate the address
> > > +	 * when walking KVM's page tables.
> > > +	 */
> > > +	if (unlikely(fault->addr & vcpu->arch.mmu->unmappable_mask))
> > > +		return RET_PF_EMULATE;
> > > +
> > >  	return RET_PF_CONTINUE;
> > >  }
> > >  
> > > @@ -4681,6 +4689,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
> > >  		return RET_PF_RETRY;
> > >  	}
> > >  
> > > +	if (fault->addr & vcpu->arch.mmu->unmappable_mask) {
> > > +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > +		return -EFAULT;
> > > +	}
> > > +
> > 
> > If we forget the case of shadow paging, do you think we should explicitly
> > strip the shared bit?
> > 
> > I think the MMU code currently always treats the shared bit as "mappable"
> > (as long as the real GPA is mappable), so logically it's better to strip the
> > shared bit first before checking the GPA.  But in practice there's no
> > problem because only TDX uses shared bit and it is within the 'mappable'
> > bits.
> 
> I don't think so?  Because even though the SHARED bit has special semantics, it's
> still very much an address bit in the current architecture.

I guess we can safely assume this is true for Intel.  Not sure for AMD,
though, but AMD doesn't use shared bit in KVM.  Anyway it's safe in
practice.

> 
> > But the odd is if the fault->addr is L2 GPA or L2 GVA, then the shared bit
> > (which is concept of L1 guest) doesn't apply to it.
> > 
> > Btw, from hardware's point of view, does EPT/NPT silently drops high
> > unmappable bits of GPA or it generates some kinda EPT violation/misconfig?
> 
> EPT violation.  The SDM says:
> 
>   With 4-level EPT, bits 51:48 of the guest-physical address must all be zero;
>   otherwise, an EPT violation occurs (see Section 30.3.3).
> 
> I can't find anything in the APM (shocker, /s) that clarifies the exact NPT
> behavior.  It barely even alludes to the use of hCR4.LA57 for controlling the
> depth of the walk.  But I'm fairly certain NPT behaves identically.

Then in case of nested EPT (ditto for NPT), shouldn't L0 emulate an VMEXIT
to L1 if fault->addr exceeds mappable bits?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-23 20:48     ` Huang, Kai
@ 2026-02-23 21:25       ` Sean Christopherson
  2026-02-23 21:44         ` Huang, Kai
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2026-02-23 21:25 UTC (permalink / raw)
  To: Kai Huang
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, Yan Y Zhao, Rick P Edgecombe,
	yosry.ahmed@linux.dev

On Mon, Feb 23, 2026, Kai Huang wrote:
> On Mon, 2026-02-23 at 08:54 -0800, Sean Christopherson wrote:
> > On Mon, Feb 23, 2026, Kai Huang wrote:
> > > But the odd is if the fault->addr is L2 GPA or L2 GVA, then the shared bit
> > > (which is concept of L1 guest) doesn't apply to it.
> > > 
> > > Btw, from hardware's point of view, does EPT/NPT silently drops high
> > > unmappable bits of GPA or it generates some kinda EPT violation/misconfig?
> > 
> > EPT violation.  The SDM says:
> > 
> >   With 4-level EPT, bits 51:48 of the guest-physical address must all be zero;
> >   otherwise, an EPT violation occurs (see Section 30.3.3).
> > 
> > I can't find anything in the APM (shocker, /s) that clarifies the exact NPT
> > behavior.  It barely even alludes to the use of hCR4.LA57 for controlling the
> > depth of the walk.  But I'm fairly certain NPT behaves identically.
> 
> Then in case of nested EPT (ditto for NPT), shouldn't L0 emulate an VMEXIT
> to L1 if fault->addr exceeds mappable bits?

Huh.  Yes, for sure.  I was expecting FNAME(walk_addr_generic) to handle that,
but AFAICT it doesn't.  Assuming I'm not missing something, that should be fixed
before landing this patch, otherwise I believe KVM would terminate the entire VM
if L2 accesses memory that L1 can't map.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-23 21:25       ` Sean Christopherson
@ 2026-02-23 21:44         ` Huang, Kai
  0 siblings, 0 replies; 16+ messages in thread
From: Huang, Kai @ 2026-02-23 21:44 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Edgecombe, Rick P,
	yosry.ahmed@linux.dev

On Mon, 2026-02-23 at 13:25 -0800, Sean Christopherson wrote:
> On Mon, Feb 23, 2026, Kai Huang wrote:
> > On Mon, 2026-02-23 at 08:54 -0800, Sean Christopherson wrote:
> > > On Mon, Feb 23, 2026, Kai Huang wrote:
> > > > But the odd is if the fault->addr is L2 GPA or L2 GVA, then the shared bit
> > > > (which is concept of L1 guest) doesn't apply to it.
> > > > 
> > > > Btw, from hardware's point of view, does EPT/NPT silently drops high
> > > > unmappable bits of GPA or it generates some kinda EPT violation/misconfig?
> > > 
> > > EPT violation.  The SDM says:
> > > 
> > >   With 4-level EPT, bits 51:48 of the guest-physical address must all be zero;
> > >   otherwise, an EPT violation occurs (see Section 30.3.3).
> > > 
> > > I can't find anything in the APM (shocker, /s) that clarifies the exact NPT
> > > behavior.  It barely even alludes to the use of hCR4.LA57 for controlling the
> > > depth of the walk.  But I'm fairly certain NPT behaves identically.
> > 
> > Then in case of nested EPT (ditto for NPT), shouldn't L0 emulate an VMEXIT
> > to L1 if fault->addr exceeds mappable bits?
> 
> Huh.  Yes, for sure.  I was expecting FNAME(walk_addr_generic) to handle that,
> but AFAICT it doesn't.  
> 

AFAICT too.  It goes to page table walk directly w/o checking beforehand.

> Assuming I'm not missing something, that should be fixed
> before landing this patch, otherwise I believe KVM would terminate the entire VM
> if L2 accesses memory that L1 can't map.

Yeah agreed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-02-19  0:22 [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable Sean Christopherson
                   ` (3 preceding siblings ...)
  2026-02-23 11:12 ` Huang, Kai
@ 2026-03-05  7:55 ` Yan Zhao
  2026-03-06 22:22   ` Sean Christopherson
  4 siblings, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2026-03-05  7:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Rick Edgecombe, Yosry Ahmed

On Wed, Feb 18, 2026 at 04:22:41PM -0800, Sean Christopherson wrote:
> Track the mask of guest physical address bits that can actually be mapped
> by a given MMU instance that utilizes TDP, and either exit to userspace
> with -EFAULT or go straight to emulation without creating an SPTE (for
> emulated MMIO) if KVM can't map the address.  Attempting to create an SPTE
> can cause KVM to drop the unmappable bits, and thus install a bad SPTE.
> E.g. when starting a walk, the TDP MMU will round the GFN based on the
> root level, and drop the upper bits.
> 
> Exit with -EFAULT in the unlikely scenario userspace is misbehaving and
> created a memslot that can't be addressed, e.g. if userspace installed
> memory above the guest.MAXPHYADDR defined in CPUID, as there's nothing KVM
> can do to make forward progress, and there _is_ a memslot for the address.
> For emulated MMIO, KVM can at least kick the bad address out to userspace
> via a normal MMIO exit.
> 
> The flaw has existed for a very long time, and was exposed by commit
> 988da7820206 ("KVM: x86/tdp_mmu: WARN if PFN changes for spurious faults")
> thanks to a syzkaller program that prefaults memory at GPA 0x1000000000000
> and then faults in memory at GPA 0x0 (the extra-large GPA gets wrapped to
> '0').
If the scenario is: when ad bit is disabled, prefault memory at GPA 0x0, then
guest reads memory at GPA 0x1000000000000, would fast_page_fault() fix a wrong
wrapped sptep for GPA 0x1000000000000?

Do we need to check fault->addr in fast_page_fault() as well?

>   WARNING: arch/x86/kvm/mmu/tdp_mmu.c:1183 at kvm_tdp_mmu_map+0x5c3/0xa30 [kvm], CPU#125: syz.5.22/18468
>   CPU: 125 UID: 0 PID: 18468 Comm: syz.5.22 Tainted: G S      W           6.19.0-smp--23879af241d6-next #57 NONE
>   Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
>   Hardware name: Google Izumi-EMR/izumi, BIOS 0.20250917.0-0 09/17/2025
>   RIP: 0010:kvm_tdp_mmu_map+0x5c3/0xa30 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_tdp_page_fault+0x107/0x140 [kvm]
>    kvm_mmu_do_page_fault+0x121/0x200 [kvm]
>    kvm_arch_vcpu_pre_fault_memory+0x18c/0x230 [kvm]
>    kvm_vcpu_pre_fault_memory+0x116/0x1e0 [kvm]
>    kvm_vcpu_ioctl+0x3a5/0x6b0 [kvm]
>    __se_sys_ioctl+0x6d/0xb0
>    do_syscall_64+0x8d/0x900
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
> 
> In practice, the flaw is benign (other than the new WARN) as it only
> affects guests that ignore guest.MAXPHYADDR (e.g. on CPUs with 52-bit
> physical addresses but only 4-level paging) or guests being run by a
> misbehaving userspace VMM (e.g. a VMM that ignored allow_smaller_maxphyaddr
> or is pre-faulting bad addresses).
> 
> For non-TDP shadow paging, always clear the unmappable mask as the flaw
> only affects GPAs affected.  For 32-bit paging, 64-bit virtual addresses
> simply don't exist.  Even when software can shove a 64-bit address
> somewhere, e.g. into SYSENTER_EIP, the value is architecturally truncated
> before it reaches the page table walker.  And for 64-bit paging, KVM's use
> of 4-level vs. 5-level paging is tied to the guest's CR4.LA57, i.e. KVM
> won't observe a 57-bit virtual address with a 4-level MMU.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable
  2026-03-05  7:55 ` Yan Zhao
@ 2026-03-06 22:22   ` Sean Christopherson
  0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2026-03-06 22:22 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel, Rick Edgecombe, Yosry Ahmed

On Thu, Mar 05, 2026, Yan Zhao wrote:
> On Wed, Feb 18, 2026 at 04:22:41PM -0800, Sean Christopherson wrote:
> > Track the mask of guest physical address bits that can actually be mapped
> > by a given MMU instance that utilizes TDP, and either exit to userspace
> > with -EFAULT or go straight to emulation without creating an SPTE (for
> > emulated MMIO) if KVM can't map the address.  Attempting to create an SPTE
> > can cause KVM to drop the unmappable bits, and thus install a bad SPTE.
> > E.g. when starting a walk, the TDP MMU will round the GFN based on the
> > root level, and drop the upper bits.
> > 
> > Exit with -EFAULT in the unlikely scenario userspace is misbehaving and
> > created a memslot that can't be addressed, e.g. if userspace installed
> > memory above the guest.MAXPHYADDR defined in CPUID, as there's nothing KVM
> > can do to make forward progress, and there _is_ a memslot for the address.
> > For emulated MMIO, KVM can at least kick the bad address out to userspace
> > via a normal MMIO exit.
> > 
> > The flaw has existed for a very long time, and was exposed by commit
> > 988da7820206 ("KVM: x86/tdp_mmu: WARN if PFN changes for spurious faults")
> > thanks to a syzkaller program that prefaults memory at GPA 0x1000000000000
> > and then faults in memory at GPA 0x0 (the extra-large GPA gets wrapped to
> > '0').
> If the scenario is: when ad bit is disabled, prefault memory at GPA 0x0, then
> guest reads memory at GPA 0x1000000000000, would fast_page_fault() fix a wrong
> wrapped sptep for GPA 0x1000000000000?
> 
> Do we need to check fault->addr in fast_page_fault() as well?

Ugh, yeah, good catch!

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-03-06 22:22 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-19  0:22 [PATCH] KVM: x86/mmu: Don't create SPTEs for addresses that aren't mappable Sean Christopherson
2026-02-19  0:23 ` Sean Christopherson
     [not found] ` <c06466c636da3fc1dc14dc09260981a2554c7cc2.camel@intel.com>
2026-02-20 16:54   ` Sean Christopherson
2026-02-21  0:01     ` Edgecombe, Rick P
2026-02-21  0:07       ` Sean Christopherson
2026-02-21  0:08 ` Edgecombe, Rick P
2026-02-21  0:49   ` Sean Christopherson
2026-02-23 23:23     ` Edgecombe, Rick P
2026-02-24  1:49       ` Sean Christopherson
2026-02-23 11:12 ` Huang, Kai
2026-02-23 16:54   ` Sean Christopherson
2026-02-23 20:48     ` Huang, Kai
2026-02-23 21:25       ` Sean Christopherson
2026-02-23 21:44         ` Huang, Kai
2026-03-05  7:55 ` Yan Zhao
2026-03-06 22:22   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox